Reaction Networks and the Metric Structure of Chemical Space(s

Random Walk Enzymes: Information Theory, Quantum Isomorphism, and Entropy Dispersion. The Journal of Physical Chemistry A. Mak, Pham, and Goodman...
0 downloads 0 Views 904KB Size
Subscriber access provided by ECU Libraries

Article

Reaction Networks and the Metric Structure of Chemical Space(s) Dmitrij Rappoport J. Phys. Chem. A, Just Accepted Manuscript • DOI: 10.1021/acs.jpca.9b00519 • Publication Date (Web): 08 Mar 2019 Downloaded from http://pubs.acs.org on March 9, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

Reaction Networks and the Metric Structure of Chemical Space(s) Dmitrij Rappoport∗ Department of Chemistry, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA E-mail: [email protected] Abstract In this paper we develop a formal definition of chemical space as a discrete metric space of molecules and analyze its properties. To this end, we utilize the shortest path metric on reaction networks to define a distance function between molecules of the same stoichiometry (number of atoms). The distance between molecules with different stoichiometries is formalized by making use of the partial ordering of stoichiometries with respect to inclusion. Calculations of fractal dimension on metric spaces for individual stoichiometries show that they have low intrinsic dimensionality, about an order of magnitude less than the dimension of the underlying reactive potential energy surface. Our findings suggest that efficient search strategies on chemical space can be designed that take advantage of its metric structure.

1

Introduction

Chemical space 1–5 is a strange place, both inexplicably vast and imprecisely defined. It is usually cited as a metaphor for the rich and diverse set of possible chemical structures, framing the search for new compounds in the language of space exploration. But if we 1

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

abstract away from the poetic notion, the case for a rigorous definition of chemical space is easily stated. A space is just a set having some formal structure. 6,7 Knowing this structure makes finding the chemical structures of interest and comparing them to each other easier and more efficient. As a result, applications such as machine learning for chemistry, synthetic accessibility prediction, and encoding of chemical structures should become more powerful. What’s more, a formal framework provides a language to reason about chemical systems and their relationships. This paper explores a working definition of chemical space as an ordered set of metric spaces over stoichiometry-preserving reaction networks. The latter are discretized models of reactive potential energy surfaces (PES) with continuous trajectories replaced by discrete heuristic transformation rules. Our construction relies on three main concepts. A metric space is a set having a distance function, also called a metric. 6,7 We utilize reaction networks in the transition network (TN) representation 8,9 to define the metric spaces. The network nodes of these reaction networks are collections of molecules (flasks) with constant stoichiometry while the network edges describe heuristic reactive transformations (bond breaking and bond making events). The reaction networks in the TN representation possess as a natural distance function the shortest path metric, which counts the smallest number of network edges (transformations) required to move from one network node to another. 10–12 The set of network nodes together with the attendant shortest path metric is a representation of the chemical reactivity within the given stoichiometry and defines a metric space, which we will refer to as a stoichiometric chemical space (SCS). Finally, we observe that SCS for different stoichiometries are related to each other by a partial ordering. 13–16 A partial ordering defines a less-or-equal relationship between objects but allows for some pairs to be incomparable, that is, neither object is less or equal to the other. Taken together, the three concepts— reaction networks, shortest path metric, and partial ordering of stoichiometries—provide us with the formal framework to measure distances between arbitrary molecules. We will be primarily concerned with the analysis of stoichiometric chemical spaces constructed from

2

ACS Paragon Plus Environment

Page 2 of 35

Page 3 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

reaction networks and the shortest path metric in this paper. The applications of the partial ordering property will be addressed in detail in future work. To test the utility of the above definitions in realistic chemical systems, we investigate the stoichiometric chemical spaces for closed-shell two-carbon and three-carbon organic compounds of carbon, hydrogen, and oxygen. With a robust definition of distances on reaction networks at our disposal, our interest is in analyzing the geometry of the SCS, especially their dimension and characteristic length scales. These parameters are directly related to the efficiency of searching in metric spaces. 17,18 Our main result is that SCS are low-dimensional, small-world, and broad-scale, but not scale-free. There are two reasons for why these findings are highly encouraging. First, they indicate that the intrinsic dimension of reactive PES is much smaller than their formal dimension. And second, our definition of chemical space appears to be useful indeed as it enables efficient exploration and search. If some interesting applications are to result from it, then our theoretical exercise of constructing a set of new definitions should be well justified. Many elements of this work have been previously discussed in the chemistry literature. Reaction networks play a central role in kinetic modeling of complex chemical reactions 19–22 and origins of life research. 23–26 The analysis of their graph structures has provided insights into mechanistic complexity and catalysis. 27–29 Independently, automatic synthesis prediction 30–35 spurred intensive research in compact representations of molecules and reactions and utilized graph theory to explore the network of synthetic chemistry. 36–39 Our definition of the shortest path metric on the set of bond breaking and bond making events is similar to the reaction distance definition by Kvasniˇcka and co-workers 40–43 and the chemical distance defined on the set of bond–electron (BE) matrices of Dugundji and Ugi. 31,44,45 More broadly, graph-theoretical concepts have been successfully used for developing structure descriptors 46–48 and structure generation. 49–52 The ideas of metrics 10 and partial orderings 15,16 have been discussed in the context of topological descriptors of molecules and chemical reactions. Our construction of chemical space benefits from the insights of many of these seminal

3

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

works. This paper is organized as follows. We develop the details of the chemical space construction and analysis in Sec. 2. In Sec. 3 we analyze size, dimension, average lengths, and degree distributions in stoichiometric chemical spaces of closed-shell organic compounds with two or three carbon atoms. We discuss the implications of our results for searching in chemical space in Sec. 4 and present conclusions in Sec. 5.

2 2.1

Methods Reaction Networks

We begin by introducing the necessary terminology for the chemical space construction. The reaction networks in the TN representation and the iterative heuristics-aided quantum chemistry (HAQC) procedure for generating them are described elsewhere. 8,9 Briefly, the nodes of the reaction network in the TN representation are collections of molecules (flasks) subject to the constraint that their stoichiometry (total number of atoms) be fixed. We use the Hill notation 53 CνC HνH OνO . . . to describe stoichiometries of flasks and reaction networks. While we limit ourselves to compounds consisting of carbon, hydrogen, and oxygen in this work, the extension to all covalent compounds is straightforward. The edges of the reaction networks in the TN representation are stoichiometry-preserving reactive transformations chosen from a set of heuristic transformation rules. The rule set, shown in Table 1, describes bond breaking and bond formation patterns and is (i ) complete with respect to the normal polarity rules (δH < δC < δO , where δ denotes the element’s electronegativity); (ii ) reversible, meaning that for each bond breaking pattern, the corresponding bond formation pattern must also be included; (iii) non-redundant due to the fictitious (de)polarization rules (Table 1) to describe multiple-bond reactivity. In the language of “arrow pushing”, 54,55 if a reaction can be decomposed as a sequence of “double-barbed” arrow pushes (electron-pair transfers), observing the restrictions given below, it is representable by the heuristic transformation 4

ACS Paragon Plus Environment

Page 4 of 35

Page 5 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

steps in Table 1. The HAQC procedure 8,9 generates new network nodes from existing ones by applying heuristic transformation rules. The energies of the new network nodes are then evaluated using quantum chemical structure optimizations. Because we are interested in the global structure of chemical space, we enumerate all possible reactive transformations in this work. For other applications, graph search techniques are a much more efficient alternative. The iterative HAQC procedure converges when no new (non-redundant) nodes are produced. The converged reaction network is independent of the composition of the initial network node and is completely defined by the stoichiometry, the set of heuristic transformation rules and the method for evaluating flask energies. It represents the full corpus of chemical reactivity for the given stoichiometry consistent with the heuristic transformation rules. For efficiency, we make three simplifying assumptions in this work. As mentioned above, we only include polar reactivity rules assuming normal polarity. This restriction ensures that each node in a reaction network has the same average formal oxidation state, separating the non-redox reactivity (within reaction network) from redox reactivity (between reaction networks). Further, we disallow bond breaks adjacent to charged atoms, which prevents the formation of intermediates with multiply charged atoms and electron-sextet (carbene) species. Finally, we limit the sum of absolute charges on each individual molecule to be 2 or less. None of these limitations are fundamental: we impose them for primarily practical reasons to exclude unstable and high-energy species. Our assumptions are worthwhile revisiting in the future but are reasonable for the purposes of this work. We performed all calculations using our HAQC implementation in the open-source colibri code. 56 The structure optimizations used the PM7 semiempirical method implemented in the MOPAC program 57 and the solvent effects were approximated by the conductor-like solvation model (COSMO) 58 with the effective dielectric constant of water, ε = 78.4. The energy of the proton was computed assuming a neutral aqueous solution (pH = 7). We will not consider the energy distributions in this work and will focus on the structural aspects of the

5

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 35

reaction networks. For applications of the energy profiles computed by the HAQC procedure to reaction mechanism predictions, the reader is referred to Ref. 9. Table 1: Heuristic transformation rules used in this work. Abbreviated notation is given in parentheses. Bond dissociation H C C H

C → H+ · · · C → C+ · · · O → C+ · · · O → H+ · · ·

C– C– O– O–

Bond association

(H1 C) H+ · · · (C1 C) C+ · · · (C1 O) C+ · · · (H1 O) H+ · · ·

Polarization C O → C+ O– C C → C+ C– C C → C+ C–

2.2

C– → C C– → C O– → C O– → O

H (H1C) C (C1C) O (C1O) H (H1O)

Depolarization C+ O– → C O C+ C– → C C C+ C– → C C

(C2 O) (C2 C) (C3 C)

(C2O) (C2C) (C3C)

Stoichiometric Chemical Spaces

A metric space is a set of elements S possessing a distance function d(a, b) (metric), which is defined for all a, b ∈ S and has the following properties: 6,7 d(a, b) ≥ 0 (positiveness); d(a, b) = 0 if and only if a = b (isolation); (1) d(a, b) = d(b, a) (symmetry); d(a, c) ≤ d(a, b) + d(b, c) (triangle inequality). We define the stoichiometric chemical space (SCS) ΣνC ,νH ,νO ,... as the metric space including the network nodes of the reaction network of the CνC HνH OνO . . . stoichiometry as its elements and the shortest path distance (metric) 10–12 as its distance function. The shortest path distance ds (K, L) between the network nodes K and L is the smallest number of network edges needed to reach L starting from K. In order for the shortest path metric on a directed network to satisfy the properties in Eq. 1, every pair of network nodes must be connected 6

ACS Paragon Plus Environment

Page 7 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

by at least one path (these networks are called strongly connected); moreover, the length of the forward and the reverse paths must be equal. It is easy to see that these properties are ensured by the HAQC construction provided that the heuristic transformation rules are reversible. While the shortest path metric fits our intuitive understanding of a distance as a measure of “how far apart” a pair of nodes are, it is far from the only acceptable metric. Any function fulfilling the properties of Eq. 1 is allowed, with the choice of metric determined by the application. Several alternative graph metrics have been developed for molecular graphs. 10,11 The heuristic kinetic feasibility metrics developed in our previous paper 9 are applicable as weighted shortest path metrics. We will not be concerned with alternative distance functions on networks and will use the terms “reaction network” and “SCS” interchangeably if the metric is clear from the context. The notation ΣνC ,νH ,νO ,... will refer to both the SCS and the underlying reaction network of stoichiometry CνC HνH OνO . . .. We will write Σ for an arbitrary SCS, whose stoichiometry is not further specified.

2.3

Partial Ordering of Stoichiometric Chemical Spaces

We are only able to briefly sketch the basic definitions of partial ordering and partially ordered sets (posets) here as they apply to stoichiometric chemical spaces. For more background and details, the reader is referred to Refs. 13–16. A partial ordering (denoted ) is defined in general as a relation on a set T having the properties: XX

(reflexivity);

If X  Y and Y  X,

then X = Y

(antisymmetry);

If X  Y and Y  Z,

then X  Z

(transitivity).

(2)

Note that it is not required that every distinct pair of elements X, Y ∈ T , is comparable, that is, that either X  Y or Y  X holds. The relaxation of the requirement that all elements

7

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 35

be comparable is what distinguishes partial ordering relations from the better known total orderings, of which the less-or-equal relation on the set of real numbers R is an example. In the case of interest to us, the partial ordering is defined on the set of all SCS and signifies the inclusion of one stoichiometry in the other. Specifically, we define

ΣνC0 ,νH0 ,νO0 ,...  ΣνC ,νH ,νO ,...

if νE0 ≤ νE , E = C, H, O, . . . .

(3)

Two arbitrary SCS need not be comparable; for example, we observe that Σ1,4,0 6 Σ1,0,2 and Σ1,0,2 6 Σ1,4,0 . It is easy to verify that the inclusion relation satisfies the properties of Eq. 2 and thus is a partial ordering. Our use of stoichiometric inclusion to define a partial ordering on SCS is motivated by the following observation. We recall that the nodes of an SCS are flasks, that is, collections of one or more molecules with a constant total stoichiometry. For example, the space Σ2,4,2 contains flasks of stoichiometry C2H4O2 such as acetic acid (CH3COOH), hydroxyacetaldehyde (HOCH2CHO) but also a combination of methane and carbon dioxide (CH4 + CO2) or ketene and water (CH2 C O + H2O). We call the network nodes consisting of only one molecule irreducible, those with two or more molecules reducible. At the same time, we can also consider the constituent molecules of a reducible flask K ∈ Σ individually as flasks in another SCS, K 0 ∈ Σ0 , where Σ0 corresponds to the molecule’s stoichiometry. It is clear that in this case Σ0  Σ by definition in Eq. 3, for example, Σ1,4,0  Σ2,4,2 and Σ1,0,2  Σ2,4,2 . In other words, the partial ordering on the set of SCS describes which molecular stoichiometries can occur in reducible flasks. Using the properties of partial orderings, we can make two assertions. Consider the SCS ΣνC0 ,νH0 ,νO0 ,... , ΣνC00 ,νH00 ,νO00 ,... , and ΣνC ,νH ,νO ,... with the property νE0 + νE00 = νE , E = C, H, O, . . .

(4)

To have a terminology, we call the first two spaces component spaces and the last one the 8

ACS Paragon Plus Environment

Page 9 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

combination space. We can now show that (i) the component spaces are included in the combination space, ΣνC0 ,νH0 ,νO0 ,... , ΣνC00 ,νH00 ,νO00 ,...  ΣνC ,νH ,νO ,... ;

(5)

and (ii ) any combination of flasks K 0 + K 00 from the component spaces, K 0 ∈ ΣνC0 ,νH0 ,νO0 ,... , K 00 ∈ ΣνC00 ,νH00 ,νO00 ,... , is a valid reducible flask belonging to the combination space, K 0 + K 00 ∈ ΣνC ,νH ,νO ,... .

(6)

The first assertion follows directly from the definition of the partial ordering, Eq. 3. The second assertion is satisfied if all three SCS are non-empty and are constructed by the same set of heuristic transformation rules. The set of all combinations K 0 + K 00 is the product (Kronecker) graph 59,60 of the networks ΣνC0 ,νH0 ,νO0 ,... and ΣνC00 ,νH00 ,νO00 ,... and is a subnetwork of ΣνC ,νH ,νO ,... by virtue of assertion (ii ). The partial ordering of SCS helps us complete the constructive definition of a distance between molecules of different stoichiometries and thus of the metric version of chemical space. To this end, we map the molecules onto irreducible flasks K 0 ∈ ΣνC0 ,νH0 ,νO0 ,... , K 00 ∈ ΣνC00 ,νH00 ,νO00 ,... . We distinguish two cases. If both molecules have the same stoichiometry, ΣνC0 ,νH0 ,νO0 ,... = ΣνC00 ,νH00 ,νO00 ,... , and the distance between the molecules is given by the shortest path metric ds (K 0 , K 00 ) on this metric space. Otherwise, we find the smallest SCS ΣνC ,νH ,νO ,... that includes both molecules by setting

νE = max(νE0 , νE00 ), E = C, H, O, . . .

(7)

(in the theory of partial orderings, this element is known as the join of ΣνC0 ,νH0 ,νO0 ,... and ΣνC00 ,νH00 ,νO00 ,... ). According to assertion (ii ), there exist subsets S 0 and S 00 of ΣνC ,νH ,νO ,... containing K 0 and K 00 , respectively. Therefore, we can define the distance between K 0 and K 00 as

9

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 35

the Hausdorff distance 7,61 between the sets S 0 and S 00 ,  d(K , K ) = dH (S , S ) = max max min ds (L , L ), max min ds (L , L ) , 0 0 00 00 00 00 0 0 0

00

0

00



0

00

L ∈S L ∈S

0

L ∈S

L ∈S

00

(8)

where ds (L0 , L00 ) is the shortest path metric on ΣνC ,νH ,νO ,... . The Hausdorff distance fulfills the properties of a distance function, Eq. 1, between subsets of a metric space, which means that d(K 0 , K 00 ) is a properly defined metric. Another desirable characteristic of the Hausdorff distance is that it reduces to the underlying shortest path metric if the subsets S 0 and S 00 contain only one element in each. This makes the two cases we distinguished above consistent with each other. Intuitively, the Hausdorff distance can be viewed as the thickness of the smallest “shell” around one subset necessary to completely enclose the other. We have now defined a theoretical framework for measuring distances between arbitrary molecules that we can with some justification call chemical space. In the remainder of this paper, we will study the properties of the SCS. We will investigate the properties of the partial ordering and the Hausdorff distance in future work.

2.4

Network Dimension by Box Covering Method

Dimension can be intuitively understood as the number of degrees of freedom or of independent measurements in the system. This definition works well for line segments (dimension d = 1), squares (d = 2), and cubes (d = 3). This concept can be extended to objects of non-integer dimension (fractals) having metric structure by considering their box covering dimension db . 62–64 The procedure for computing the box covering dimension is as follows. The object is partitioned into disjoint subsets (boxes) such that the largest distance within each box is equal to a fixed value l. In general, the number of boxes in such partition scales with a negative (non-integer) power of the box length, Nb (l) ∼ l−db , in the limit of l → 0. In that case, the box covering dimension can be computed by linear regression of ln Nb (l) against ln l for a series of box coverings with different values of box length l. For non-fractal

10

ACS Paragon Plus Environment

Page 11 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

objects, the box covering dimension coincides with the conventional definition of dimension. However, non-integer dimension is associated with the self-similarity of fractal objects and is considered one of their typical characteristics. Computing the optimal box covering of a network has exponentially scaling computational cost, and approximations are typically needed. 65–67 We used the implementations of the approximate compact box burning (CBB), maximum excluded mass burning (MEMB), and greedy coloring algorithms by Song and co-workers to compute the box covering dimension. 68,69 The greedy coloring algorithm converts the box covering problem into a node coloring problem on a related graph and solves the latter using a greedy algorithm. The CBB and MEMB algorithms use breadth-first search starting with a randomly chosen network node and construct boxes with the maximum total number of nodes (CBB algorithm) or the maximum number of nodes within a given radius from a central node (MEMB algorithm). The CBB, MEMB, and greedy coloring algorithms produce slightly different numerical results from each other. However, the qualitative results are in excellent agreement between all three methods. We found that the CBB algorithm was the most efficient for the reaction networks considered in this work. We report the box covering dimension computed using the CBB method in Sec. 3 and those computed by the MEMB and greedy coloring method in Sec. S2 of the SI.

3

Results

We have constructed reaction networks of compounds of carbon, hydrogen, and oxygen with νC = 2, 3 using the HAQC procedure. For easier comparison, we group the reaction networks by their average carbon oxidation state ξC = (−νH + 2 νO )/νC , for which the lower and upper bounds −4 ≤ ξC ≤ 4 are fixed by valence rules. The networks belonging to the same value of ξC form a homologous series of stoichiometries, CνC HνH,0 +2r OνO,0 +r with r = 0, 1, 2, . . . ,

11

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

which are generated by successively adding water molecules to the smallest reaction network (r = 0). Each reaction network was characterized by means of the following key characteristics: network size, fractal dimension, average distance, and degree distribution. In focusing on these properties, we are looking to explore the correlations between stoichiometries and the global network invariants of the corresponding reaction networks. Moreover, we are interested in structural properties that determine the efficiency of search algorithms in chemical spaces. Network size (Sec. 3.1) is the number of nodes in the network, which determines the size of the search space. Within the HAQC procedure, each network node encompasses one or more distinct energy minima of the reactive PES belonging to a given structural formula. The fractal dimension of networks (Sec. 3.2) generalizes the notion of the number of degrees of freedom inherent in the system. The position of a node in a linear path is completely specified by one coordinate value, while for a d-dimensional rectangular lattice, the fractal dimension is equal to d, the dimension of the coordinate space. A complementary characteristic is the average distance (Sec. 3.3) between network nodes, which establishes a characteristic length scale along each degree of freedom. Finally, the node degree (Sec. 3.4) is the number of neighboring nodes. The distribution of node degrees is a measure of translational symmetry in the system: homogeneous networks possess narrow degree distributions, while broad degree distributions are typical of heterogeneous networks lacking translational symmetry. The comparison with a d-dimensional rectangular lattice is once again instructive here. The average distance between lattice nodes grows as as n1/d , where n is the number of nodes, and all lattice nodes have the the same degree 2d . In contrast, many complex networks show an average distance that is proportional to ln n (small-world networks 70–74 ) or a very broad power-law degree distribution (scale-free networks 71,75–78 ). These structural properties turn out to be helpful in searching these networks efficiently. We will return to this question in Sec. 4.

12

ACS Paragon Plus Environment

Page 12 of 35

Page 13 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

3.1

Network Size

The network size (number of nodes) is shown as a function of the total number of atoms ν = νC + νH + νO in Fig. 1. The lines indicate the homologous series for different values of average carbon oxidation state ξC . Several of the reaction networks consist of a single node under the allowed heuristic transformation rules (C2, C2O2), while the largest twocarbon reaction network considered in this work belongs to the C2H16O9 stoichiometry and has 13,555 nodes. The double-logarithmic plot shows that the network size within each homologous series grows no faster than a polynomial function of the number of atoms. The smallest reaction networks are only a few nodes large and increase by an order of magnitude or more in size as water molecules are added. As the networks become larger, the growth rates gradually decrease, resulting in scaling curves that are concave downwards. In the asymptotic limit of infinite dilution (r → ∞), the differences between average carbon oxidation states disappear, and the scaling of the network size is entirely due to the reactivity of water molecules (breaking and formation of O–H bonds). Therefore, all scaling curves in Fig. 1 must become linear and parallel to each other in the limiting case of ν → ∞. We note the marked difference between the reaction networks corresponding to formally positive (and zero) and negative oxidation states of carbon. The oxygen-rich networks with ξC ≥ 0 are of similar size for the same number of atoms and form a cluster of overlapping scaling curves in Fig. 1. In contrast, the hydrogen-rich networks with ξC < 0 are smaller in size and become progressively smaller toward the negative end of the carbon redox scale. This observed difference in behavior is expected since hydrocarbons admit only a limited range of polar reactions (not taking into account redox reactions) compared to oxygen-containing compounds. The reaction networks of three-carbon compounds (νC = 3) show a faster polynomial increase in size with the number of atoms than those of two-carbon compounds (note the difference in vertical axes) but are otherwise qualitatively similar. The three-carbon networks with the fewest nodes are those having C3 and C3O3 stoichiometries, each containing only 13

ACS Paragon Plus Environment

The Journal of Physical Chemistry

2 nodes. The largest reaction network in this work has C3H6O3 stoichiometry and consists of 91,107 nodes. For full details of network size for all networks considered in this work, see Tables S1 and S2 in Sec. S1 of the Supporting Information (SI). 100000

10000

10000

Network Size n

1000

Network Size n

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 35

100

10

1000

100

ξC = 4

1 ξC = 0 2

C

1

ξC = − 4

5

ξC = 4

10

ξ C = − 1ξ C = − 2 ξC = 1 ξ = 3 ξC = − 3 ξC = 2

ξ C = 10 3

10

15

20 25 3035

ξC = 2 3

ξC = 0

3

Number of Atoms ν

ξC = − 2 ξC = − 8 3

ξC = 4 3 ξC = − 4 3 ξC = − 2 3 ξ C ξ=C2= 8 3

5

ξ C = − 10 3 ξC = − 4

10

15

20

25

Number of Atoms ν

Figure 1: Scaling of network size n with number of atoms ν for reaction networks with νC = 2 (left panel) and νC = 3 (right panel). Lines connect reaction networks with the same average carbon oxidation state ξC . Note the double logarithmic scale.

3.2

Network Dimension

We present the box covering dimension computed by the CBB method, dCBB , of νC = 2, 3 b reaction networks in Fig. 2. The error bars show the standard error of the linear regression. The reaction networks are low-dimensional objects with the values of fractal dimension between dCBB = 0.23±0.19 for the C3H2 stoichiometry and dCBB = 3.68±0.18 for C3H8O6. The b b fractal dimension increases logarithmically with the network size n to a very good approximation. We also note that the scaling relation is universal on the set of reaction networks considered here: the behavior of the two- and three-carbon network networks and of different homologous series (see Sec. 3.1) is indistinguishable. The fractal dimension is notably lower than the dimension of the reactive PES, which is defined on the (3ν − 6)-dimensional coordinate space for a ν-atomic molecule. It is clear 14

ACS Paragon Plus Environment

Page 15 of 35

that the coordinate space describes the full range of molecular dynamics and contains too much information if our interest is only in reactive transformations. However, separation of the reactive and non-reactive degrees of freedom is not trivial. On the other hand, the fractal dimension is an intrinsic property of SCS and is independent of the coordinate space, in which it is embedded. 17,18 The finding that the intrinsic dimensionality of the SCS is a slowly growing function of the system size suggests that our construction amounts to a heuristic dimensionality reduction method for the reactive PES. We discuss the implications of these findings in Sec. 4. The greedy coloring and the MEMB algorithms for computing the box covering dimension produce results in excellent qualitative agreement with the CBB method. The full results are included in Sec. S2 of the SI. The scaling plots for the greedy coloring and MEMB fractal dimension as a function of network size are given in Sec. S3 of the SI.

4

CBB Dimension d bCBB

4

CBB Dimension d bCBB

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

3

2

1

0

3

2

1

0 1e+01

1e+03

1e+05

1e+01

Network Size n

1e+03

1e+05

Network Size n

Figure 2: Fractal dimension dCBB of νC = 2 (left panel) and νC = 3 networks (right panel) b using compact box burning (CBB) method. Note the logarithmic horizontal scale.

15

ACS Paragon Plus Environment

The Journal of Physical Chemistry

3.3

Average Distance

The scaling of the average distance ¯l with the number of atoms ν in νC = 2 and νC = 3 reaction networks is shown in Fig. 3. The reaction networks show the small-world property, 70–74 which manifests itself as logarithmic increase of ¯l with the network size n. The nodes in a small-world network are separated by only a few steps even in very large networks. However, we should point out that this finding does not imply easy synthesizability. Our distance metric, by design, does not take into account the path energetics, and short paths might be formally plausible but not thermodynamically or kinetically feasible. Weighted path metrics are better suited than the shortest path metric for reaction mechanism prediction, see Sec. 4 and Ref. 9. It is important to note that the small-world networks are a heterogeneous grouping of complex networks. The small-world property is present in such different types of networks as Erd˝os-R´enyi random graphs, scale-free networks, random geometric networks, certain Kronecker graphs, and more. 70,71,79–81 Therefore, it should not be taken to imply a particular generative model or functional characteristic. 12

Average Shortest Path Length l

12

Average Shortest Path Length l

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 35

10

8

6

4

2

10

8

6

4

2

1e+01

1e+03

1e+05

1e+01

Network Size n

1e+03

1e+05

Network Size n

Figure 3: Average shortest path length ¯l of νC = 2 (left panel) and νC = 3 reaction networks (right panel). Note the logarithmic horizontal scale.

16

ACS Paragon Plus Environment

Page 17 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

3.4

Degree Distribution

A particularly well-studied class of small-world networks are the scale-free networks. 71,75–78 The defining characteristic of these networks is their highly heterogeneous structure with several highly connected nodes (hubs) and many low-degree nodes. The scale-free property has been linked to robustness and error tolerance of these networks. The node degree (number of neighbors) k in scale-free networks follows a an extremely heavy-tailed power-law distribution p(k) ∼ k −γ with the exponent γ = 2−3. Among networks having scale-free structure are some models of metabolic networks. In fact, their scale-free property has been proposed as a particularly robust and thus evolutionarily advantageous organizing principle. 82,83 Since the metabolic networks are embedded in chemical reaction networks, it is of interest to analyze if the reaction networks constructed in this work also possess the scale-free property. Fig. 4 gives an overview of the degree histograms in two-carbon reaction networks. The networks are arranged by average carbon oxidation state −4 ≤ ξC ≤ 4 increasing from top to bottom and the number of oxygen atoms 0 ≤ νO ≤ 6 increasing from left to right. The degree histograms show heavy-tailed but peaked degree distributions, which are well approximated by lognormal distributions. Networks with this type of degree distributions have been referred to as broad-scale. 80 As the degree histograms show, the reaction networks considered here have a few highly connected nodes at the upper range of the degree distributions, however, they lack the large pool of peripheral, low-degree nodes. Instead, the peak of the distribution is close to the median and shifts to higher values with increasing number of atoms. The average node degree scales logarithmically with the network size, see Sec. S2 of the SI. The degree histograms of νc = 3 are qualitatively similar and are displayed in Sec. S4 of the SI.

17

ACS Paragon Plus Environment

The Journal of Physical Chemistry

10

C2H8

C2H10O

C2H12O2

C2H14O3

C2H16O4

C2H18O5

C2H20O6

C2H6

C2H8O

C2H10O2

C2H12O3

C2H14O4

C2H16O5

C2H18O6

C2H4

C2H6O

C2H8O2

C2H10O3

C2H12O4

C2H14O5

C2H16O6

C2H2

C2H4O

C2H6O2

C2H8O3

C2H10O4

C2H12O5

C2H14O6

C2

C2H2O

C2H4O2

C2H6O3

C2H8O4

C2H10O5

C2H12O6

C2O

C2H2O2

C2H4O3

C2H6O4

C2H8O5

C2H10O6

C2O2

C2H2O3

C2H4O4

C2H6O5

C2H8O6

C2O3

C2H2O4

C2H4O5

C2H6O6

C2O4

C2H2O5

C2H4O6

5 15 10 5

Degree k

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 35

20 15 10 5 0 30 20 10 0 30 20 10 0 30 20 10 0 30 20 10 0 30 20 10 0 30 20 10 0

Figure 4: Degree histograms of νC = 2 reaction networks with bin width ∆k = 2. The networks are arranged by average carbon oxidation state −4 ≤ ξC ≤ 4 increasing from top to bottom and the number of oxygen atoms 0 ≤ νO ≤ 6 increasing from left to right. Median node degree is indicated by horizontal solid line.

18

ACS Paragon Plus Environment

Page 19 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

4

Discussion

The metaphor of a vast chemical space arose in response to the practical challenges in performing systematic exploration of molecular structures on a large scale. Most of these challenges can be traced back to the properties of potential energy surfaces (PES). The PES of an ν-atomic system is defined on a (3ν −6)-dimensional coordinate space and has a number of distinct minima, which increases exponentially with ν. 84–86 The number of transition states increases even more rapidly. 85,86 While semi-local optimization techniques are very efficient for finding energy minima and transition states, 87–91 the global exploration of the energy landscapes 86 has to contend with the so-called curse of dimensionality, which describes an array of counterintuitive properties of high-dimensional spaces making search in these spaces costly and inefficient. 92,93 Some general approaches to address the curse of dimensionality are dimensionality reduction, embedding methods, clustering, and randomization. 94,95 Our chemical space construction is an attempt to provide a formal framework that does not directly depend on the high-dimensional coordinate space. Instead, we create a discrete model of chemical space with metric structure. For simplicity, we suppress all issues related to lack of minima (repulsive potentials), symmetry, stereoisomerism, and conformations. The distance function on this space is defined by a three-step construction. First, we associate molecular structure with nodes of a stoichiometry-preserving reaction network (TN representation, Sec. 2.1). The edges of the reaction network in the TN representation correspond to heuristic transformation rules (bond breaking and bond bond making events). Second, we introduce the shortest path metric as a distance function on the set of the nodes of the reaction network. These two steps produce a metric structure between molecules of the same stoichiometry, which we refer to as a stoichiometric chemical space (SCS, Sec. 2.2). Finally, we define a partial ordering on the set of SCS with respect to stoichiometric inclusion, which allows us to assign a distance between molecules of different stoichiometry as the Hausdorff distance on the join metric space (Sec. 2.3). We should note that the definitions used in our chemical space construction, while sen19

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 35

sible, are not unique. As discussed in Sec. 2.1, the set of heuristic transformation rules only includes transformation rules consistent with normal polarity, that is, positive hydrogen and negative oxygen, see Table 1. These rules successfully describe the mechanisms of polar reactions (substitutions, additions, elimination) and pericyclic reactions (sigmatropic and electrocyclic reactions, cycloadditions) within the HAQC approach. 9 However, reactions with inverted polarity patterns, also known as Umpolung reactions, 96 radical, and redox reactions are excluded by construction. The extension to the larger rule set is straightforward but requires higher-level quantum chemical methods in order to describe the structures of the high-energy intermediates such as radicals, electron-sextet compounds, and carbanions. Similarly, the shortest path metric can be replaced by a number of alternative distance functions on networks. A simple generalization of the shortest path metric is the weighted path metric dw (K, L). 11 It is obtained by first assigning weights W (E) to all directed edges E in the network. The weighted path distance is then defined as the smallest sum of edge weights over all paths P = {E} between the nodes K and L,

dw (K, L) = min P

X

W (E) .

(9)

E

With all edge weights W (E) = 1, we recover the shortest path metric. The weighted path metric introduces more empiricism into our construction, however, it is better suited to convey the differences between heuristic transformation rules than the shortest path metric, which implies uniform weights W (E) = 1. A plausible choice for edge weights W (E) is a kinetic feasibility heuristic. 9 Another choice of the distance function on networks helps us illustrate some of the pitfalls of formal definitions. The discrete metric dd (K, L) can be defined on an arbitrary set by saying that the distance from any point to any other point is 1, while the distance between the point and itself is 0. The discrete metric trivially fulfills the properties of a distance function, Eq. 1. However, we can easily convince ourselves that it is unsuitable for defining a

20

ACS Paragon Plus Environment

Page 21 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

dimension on the metric space using the box covering algorithm and that it does not provide us with useful insight. As the example of the discrete metric illustrates, the mere fact that we can define a metric structure is less important than its applications. Using the shortest path metric, we constructed metric spaces that are low-dimensional, small-world, and broad-scale but not scale-free. Each of these characteristics has interesting implications for exploration and search in chemical space. It is a well-known result in computational geometry and nearest-neighbor search methods that the efficiency of search on metric spaces is determined by their intrinsic dimension. 17,18 Therefore, search on SCS should provide a meaningful speedup to molecular discovery due to their low intrinsic dimension compared to the dimension of the coordinate space. Our empirical finding of the the logarithmic increase of the fractal dimension of SCS with the network size also has interesting parallels with another result from computational geometry, the Johnson–Lindenstrauss lemma. The latter states that a set of Np points in Rd can be projected with minimal distortion onto a space with dimension d0 ∼ ln Np . 97 This embedding has been recently shown to be optimal. 98 While the network size grows polynomially with the number of atoms within the homologous series (Sec. 3.1), we should assume that the network size has an exponential dependence on the system size in general. 84–86 Given our results, it is reasonable to conjecture that SCS will continue to be comparatively low-dimensional objects for larger stoichiometries. Efficient search in complex networks has received a great deal of attention, especially in the context of decentralized message routing in computer networks. 70,71,73,99 While the small-world property entails that the shortest path from the source node to the target node has only few steps, finding the specific target node by a random walk still requires exploring a significant portion of the network. 100,101 Two heuristic search strategies show great promise for locating the target node efficiently. One such strategy is based on the addition of longrange hops with a probability depending on the fractal dimension of the network. 70,99,102 Alternatively, one can combine the network search algorithm with a network-independent

21

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

measure of similarity—for example, structural descriptors 47 —in order to reduce the number of network nodes visited. 103,104 We note that in scale-free networks, yet another class of heuristic search strategies based on local node degrees is applicable. 73,100,101 The formalism and the chemical space properties explored in this paper aim to provide an impetus for some types of interesting applications. The efficient search techniques for low-dimensional, small-world networks can be combined with the HAQC procedure to explore chemical space more efficiently. The distance function between molecular structures is directly usable in kernel-based learning methods or as a measure of structural diversity in molecular libraries. Network centrality characteristics on SCS are an interesting alternative to the existing synthetic accessibility measures. Finally, the potential of the SCS construction as a heuristic dimensionality technique and for molecular structure encoding is worth exploring.

5

Conclusions

If the reaction networks in the TN representation are just discretized models of the reactive PES, shouldn’t their study produce equivalent results? After the foregoing discussion, it seems appropriate to contrast and compare the two models. The PES is continuous and derivable from first principles. But at the same time it is high-dimensional and, for some applications, containing too much information, often prohibitively so. In contrast, the reaction networks are discrete, rich in internal structure, and, as we demonstrated in this work, relatively low-dimensional. An advantage of discrete structures is that they lend themselves to enumeration, which we have used in this work. Another technique, heuristic search, seems to be promising for molecular discovery. By studying the interrelation between the continuous and the discrete mental pictures we have an opportunity to better understand them both. Many ideas from discrete mathematics, for example, theory of networks and metric spaces, may be useful tools in this endeavor.

22

ACS Paragon Plus Environment

Page 22 of 35

Page 23 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

Supporting Information Available Network Size and Composition of νC = 2, 3 Reaction Networks; Network Characteristics of νC = 2, 3 Reaction Networks; Fractal Dimension of Reaction Networks from MEMB and Greedy Coloring Methods; Node Degree Histograms of νC = 3 Reaction Networks.

Acknowledgement This paper is dedicated to the memory of Reinhart Ahlrichs (1940–2016). The author is very grateful to Dr. Matthew Lockett for helping with the institutional support through the University of North Carolina at Chapel Hill.

References (1) Oprea, T. I.; Gottfries, J. Chemography: The Art of Navigating in Chemical Space. J. Comb. Chem. 2001, 3, 157–166. (2) Shoichet, B. K. Virtual Screening of Chemical Libraries. Nature 2004, 432, 862–865. (3) Dobson, C. M. Chemical Space and Biology. Nature 2004, 432, 824–828. (4) Lipinski, C.; Hopkins, A. Navigating Chemical Space for Biology and Medicine. Nature 2004, 432, 855–861. (5) Reymond, J.-L.; van Deursen, R.; Blum, L. C.; Ruddigkeit, L. Chemical Space as a Source for New Drugs. MedChemComm 2010, 1, 30–38. (6) Kaplansky, I. Set Theory and Metric Spaces; Allyn and Bacon: Boston MA, 1972. ´ Metric Spaces; Springer: London, 2007. (7) Searc´oid, M. O.

23

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 35

(8) Rappoport, D.; Galvin, C. J.; Zubarev, D. Y.; Aspuru-Guzik, A. Complex Chemical Reaction Networks from Heuristics-Aided Quantum Chemistry. J. Chem. Theory Comput. 2014, 10, 897–907. (9) Rappoport, D.; Aspuru-Guzik, A. Predicting Feasible Organic Reaction Pathways Using Heuristically Aided Quantum Chemistry. 2018, submitted, preprint: http:// dx.doi.org/chemrxiv.6649565.v1 (accessed Jan 17, 2019). (10) Klein, D. J. In Topology in Chemistry. Discrete Mathematics of Molecules; Rouvray, D. H., King, R. B., Eds.; Horwood: Chichester UK, 2002; Chapter 10, pp 292–315. (11) Deza, E.; Deza, M.-M. Dictionary of Distances; Elsevier: Amsterdam, 2006. (12) Goddard, W.; Oellermann, O. R. In Structural Analysis of Complex Networks; Dehmer, M., Ed.; Birkh¨aser: Boston MA, 2011; pp 49–72. (13) Preparata, F. P.; Yeh, R. T. Introduction to Discrete Structures; Addison-Wesley: Reading MA, 1973. (14) Trotter, W. T. In Partially Ordered Sets; L, G. R., Gr¨otschel,, L, L., Eds.; Elsevier: Amsterdam, 1995; Vol. 1; Chapter 8, pp 433–480. (15) Merrifield, R. E.; Simmons, H. E. Topological Methods in Chemistry; Wiley: New York, 1989. (16) Klein, D. J.; Babi´c, D. Partial Orderings in Chemistry. J. Chem. Inf. Comput. Sci. 1997, 37, 656–671. (17) Clarkson,

K.

L.

In

Nearest-Neighbor

Methods

in

Learning

Shakhnarovich, G., Darrell, T., Indyk, P., Eds.; MIT Press: 2005; pp 15–59.

24

ACS Paragon Plus Environment

and

Vision;

Cambridge MA,

Page 25 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

(18) Andoni, A.; Indyk, P. In Nearest Neighbors in High-dimensional Spaces, 3rd ed.; Goodman, J. E., O’Rourke, J., T´oth, C. D., Eds.; CRC Press: Boca Raton FL, 2018; Chapter 43, pp 1135–1155. (19) Helfferich, F. G. Kinetics of Multistep Reactions, 2nd ed.; Elsevier: Amsterdam, 2004. (20) Swihart, M. T. In Modeling of Chemical Reactions; Carr, R. W., Ed.; Elsevier, 2007; Chapter 5, pp 185–242. (21) Green Jr., W. H. Predictive Kinetics: A New Approach for the 21st Century. Adv. Chem. Eng. 2007, 32, 1–313. (22) Gupta, U.; Le, T.; Hu, W.-S.; Bhan, A.; Daoutidis, P. Automated Network Generation and Analysis of Biochemical Reaction Pathways Using RING. Metab. Eng. 2018, 49, 84–93. (23) Luisi, P. L. The Emergence of Life, 2nd ed.; Cambridge University Press: Cambridge UK, 2016. (24) Szostak, J. W. Origins of Life: Systems Chemistry on Early Earth. Nature 2009, 459, 171–172. (25) Eschenmoser, A. Etiology of Potentially Primordial Biomolecular Structures: From Vitamin B12 to the Nucleic Acids and an Inquiry into the Chemistry of Life’s Origin: A Retrospective. Angew. Chem. Int. Ed. 2011, 50, 12412–12472. (26) Ruiz-Mirazo, K.; Briones, C.; de la Escosura, A. Prebiotic Systems Chemistry: New Perspectives for the Origins of Life. Chem. Rev. 2014, 114, 285–366. (27) Bonchev, D., Mekenyan, O., Eds. Graph Theoretical Approaches to Chemical Reactivity; Kluwer: Dordrecht Netherlands, 1994. (28) Temkin, O. N.; Zeigarnik, A. V.; Bonchev, D. G. Chemical Reaction Networks: A Graph-Theoretical Approach; CRC Press: Boca Raton FL, 1996. 25

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(29) Zeigarnik, A. V.; Bruk, L. G.; Temkin, O. N.; Likholobov, V. A.; Maier, L. I. Computer-Aided Studies of Reaction Mechanisms. Russ. Chem. Rev. 1996, 65, 117– 130. (30) Corey, E. J.; Wipke, W. T. Computer-Assisted Design of Complex Organic Syntheses. Science 1969, 166, 178–192. (31) Dugundji, J.; Ugi, I. An Algebraic Model of Constitutional Chemistry as a Basis for Chemical Computer Programs. Top. Curr. Chem 1973, 39, 19–64. (32) Ugi, I.; Bauer, J.; Bley, K.; Dengler, A.; Dietz, A.; Fontain, E.; Gruber, B.; Herges, R.; Knauer, M.; Reitsam, K.; et al., Computer-Assisted Solution of Chemical Problems— The Historical Development and the Present State of the Art of a New Discipline of Chemistry. Angew. Chem. Int. Ed. 1993, 32, 201–227. (33) Jorgensen, W. L.; Laird, E. R.; Gushurst, A. J. CAMEO: A Program for the Logical Prediction of the Products of Organic Reactions. Pure Appl. Chem. 1990, 62, 1921– 1932. (34) Coley, C. W.; Barzilay, R.; Jaakkola, T. S.; Green, W. H.; Jensen, K. F. Prediction of Organic Reaction Outcomes Using Machine Learning. ACS Cent. Sci. 2017, 3, 434–443. (35) Segler, M. H. S.; Preuss, M.; Waller, M. P. Planning Chemical Syntheses with Deep Neural Networks and Symbolic AI. Nature 2018, 555, 604–610. (36) Gothard, C. M.; Soh, S.; Gothard, N. A.; Kowalczyk, B.; Wei, Y.; Baytekin, B.; Grzybowski, B. A. Rewiring Chemistry: Algorithmic Discovery and Experimental Validation of One-Pot Reactions in the Network of Organic Chemistry. Angew. Chem. Int. Ed. 2012, 51, 7922–7927.

26

ACS Paragon Plus Environment

Page 26 of 35

Page 27 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

(37) Kowalik, M.; Gothard, C. M.; Drews, A. M.; Gothard, N. A.; Weckiewicz, A.; Fuller, P. E.; Grzybowski, B. A.; Bishop, K. J. M. Parallel Optimization of Synthetic Pathways Within the Network of Organic Chemistry. Angew. Chem. Int. Ed. 2012, 51, 7928–7932. (38) Szymku´c, S.; Gajewska, E. P.; Klucznik, T.; Molga, K.; Dittwald, P.; Startek, M.; Bajczyk, M.; Grzybowski, B. A. Computer-Assisted Synthetic Planning: The End of the Beginning. Angew. Chem. Int. Ed. 2016, 55, 5904–5937. (39) Bajczyk, M. D.; Dittwald, P.; Wolos, A.; Szymku´c, S.; Grzybowski, B. A. Discovery and Enumeration of Organic-Chemical and Biomimetic Reaction Cycles Within the Network of Chemistry. Angew. Chem. Int. Ed. 2018, 57, 2367–2371. (40) Kvasniˇcka, V.; Posp´ıchal, J. Graph-Theoretical Interpretation of Ugi’s Concept of the Reaction Network. J. Math. Chem. 1990, 5, 309–322. (41) Kvasniˇcka, V.; Posp´ıchal, J. Chemical and Reaction Metrics for Graph-Theoretical Model of Organic Chemistry. J. Mol. Struct. THEOCHEM 1991, 227, 17–42. (42) Kvasniˇcka, V.; Posp´ıchal, J.; Bal´aˇz, V. Reaction and Chemical Distances and Reaction Graphs. Theor. Chim. Acta 1991, 79, 65–79. (43) Bal´aˇz, V.; Kvasniˇcka, V.; Posp´ıchal, J. Two Metrics in a Graph Theory Modeling of Organic Chemistry. Discrete Appl. Math. 1992, 35, 1–19. (44) Dugundji, J.; Gillespie, P.; Marquarding, D.; Ugi, I.; Ramirez, F. In Chemical Applications of Graph Theory; Balaban, A. T., Ed.; Academic Press: New York, 1976; pp 108–174. (45) Ugi, I.; Stein, N.; Knauer, M.; Gruber, B.; Bley, K.; Weidinger, R. New Elements in the Representation of the Logical Structure of Chemistry by Qualitative Mathematical Models and Corresponding Data Structures. Top. Curr. Chem. 1993, 166, 199–233. 27

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(46) Devillers, J., Balaban, A. T., Eds. Topological Indices and Related Descriptors in QSAR and QSPR; CRC Press: Boca Raton FL, 2000. (47) Todeschini, R.; Consonni, V. Molecular Descriptors for Chemoinformatics, 2nd ed.; Wiley–VCH: Weinheim, 2009. (48) Estrada, E.; Bonchev, D. In Handbook of Graph Theory, 2nd ed.; Gross, J. L., Yellen, J., Zhang, P., Eds.; CRC PRess: Boca Raton FL, 2014; Chapter 13, pp 1538– 1582. (49) Faulon, J.-L.; Visco Jr., D. P.; Roe, D. In Reviews in Computational Chemistry; Lipkowitz, K. B., Larter, R., Cundari, T. R., Eds.; Wiley: Hoboken NJ, 2005; Vol. 21; Chapter 3, pp 209–286. (50) Meringer, M. In Handbook of Chemoinformatics Algorithms; Faulon, J.-L., Ed.; CRC Press: Boca Raton FL, 2010; Chapter 8, pp 233–267. (51) Blum, L. C.; Reymond, J.-L. 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. J. Am. Chem. Soc. 2009, 131, 8732– 8733. (52) Ruddigkeit, L.; van Deursen, R.; Blum, L. C.; Reymond, J.-L. Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17. J. Chem. Inf. Model. 2012, 52, 2864–2875. (53) Hill, E. A. On a System of Indexing Chemical Literature; Adopted by the Classification Division of the U.S. Patent Office. J. Am. Chem. Soc. 1900, 22, 478–494. (54) Grossman, R. B. The Art of Writing Reasonable Organic Reaction Mechanisms; Springer: New York, 2003. (55) Levy, D. E. Arrow-Pushing in Organic Chemistry. An Easy Approach to Understanding Reaction Mechanisms; Wiley: Hoboken NJ, 2008. 28

ACS Paragon Plus Environment

Page 28 of 35

Page 29 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

(56) Rappoport, D. Colibri Is Your Lightweight and Gregarious Chemistry Explorer, v.0.9.7. 2017; https://bitbucket.org/rappoport/colibri (accessed Jan 17, 2019). (57) Stewart, J. J. P. Optimization of Parameters for Semiempirical Methods VI: More Modifications to the NDDO Approximations and Re-Optimization of Parameters. J. Mol. Model. 2013, 19, 1–32. (58) Klamt, A.; Sch¨ uu ¨rmann, G. COSMO: A New Approach to Dielectric Screening in Solvents with Explicit Expressions for the Screening Energy and its Gradient. J. Chem. Soc., Perkin Trans. 2 1993, 799–805. (59) Imrich, W.; Klavˇzar, S. Product Graphs. Structure and Recognition; Wiley: New York, 2000. (60) Hammack, R.; Imrich, W.; Klavˇzar, S. Handbook of Product Graphs, 2nd ed.; CRC Press: Boca Raton FL, 2011. (61) Hausdorff, F. Set Theory, 2nd ed.; Chelsea: New York, 1962. (62) Mandelbrot, B. B. Self-Affine Fractals and Fractal Dimension. Phys. Scripta 1985, 32, 257–260. (63) Theiler, J. Estimating Fractal Dimension. J. Opt. Soc. Am. A 1990, 7, 1055–1073. (64) Falconer, K. Fractal Geometry. Mathematical Foundations and Applications, 3rd ed.; Wiley: Chichester UK, 2014. (65) Song, C.; Havlin, S.; Makse, H. A. Self-Similarity of Complex Networks. Nature 2005, 433, 392–395. (66) Kim, J. S.; Goh, K. I.; Kahng, B.; Kim, D. Fractality and Self-Similarity in Scale-Free Networks. New J. Phys. 2007, 9, 177–177.

29

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(67) Gallos, L. K.; Song, C.; Makse, H. A. A Review of Fractality and Self-Similarity in Complex Networks. Physica A 2007, 386, 686–691. (68) Song, C.; Gallos, L. K.; Havlin, S.; Makse, H. A. How to Calculate the Fractal Dimension of a Complex Network: the Box Covering Algorithm. J. Stat. Mech. Theory Exp. 2007, P03006. (69) Makse, H.;

Rozenfeld, H. Box Counting Algorithms for Fractal Dimension

Calculations. 2018;

http://www-levich.engr.ccny.cuny.edu/webpage/hmakse/

software-and-data/ (accessed Jan 17, 2019). (70) Kleinberg, J. The Small-World Phenomenon: An Algorithmic Perspective. Proceedings of the 32nd Annual ACM Symposium on Theory of Computing. New York, 2000; pp 163–170. (71) Barrat, A.; Barth´elemy, M.; Vespignani, A. Dynamical Processes on Complex Networks; Cambridge University Press: Cambridge UK, 2008. (72) Newman, M. E. J. Networks: An Introduction; Oxford University Press: Oxford UK, 2009. (73) Cohen, R.; Havlin, S. Complex Networks. Structure, Robustness and Function; Cambridge University Press: Cambridge UK, 2010. (74) Estrada, E. The Structure of Complex Networks; Oxford University Press: Oxford UK, 2011. (75) Barab´asi, A.-L.; Albert, R. Emergence of Scaling in Random Networks. Science 1999, 286, 509–512. (76) Newman, M. E. J. Power Laws, Pareto Distributions and Zipf’s Law. Contemp. Phys. 2005, 46, 323–351.

30

ACS Paragon Plus Environment

Page 30 of 35

Page 31 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

(77) Caldarelli, G. Scale-Free Networks; Oxford University Press: Oxford UK, 2007. (78) Clauset, A.; Shalizi, C. R.; Newman, M. E. J. Power-Law Distributions in Empirical Data. SIAM Rev. 2009, 51, 661–703. (79) Watts, D. J.; Strogatz, S. H. Collective Dynamics of ’Small-World’ Networks. Nature 1998, 393, 440–442. (80) Amaral, L. A. N.; Scala, A.; Barth´el´emy, M.; Stanley, H. E. Classes of Small-World Networks. Proc. Nat. Acad. Sci. 2000, 97, 11149–11152. (81) Barrat, A.; Weigt, M. On the Properties of Small-World Network Models. Eur. Phys. J. B 2000, 13, 547–560. (82) Jeong, H.; Tombor, B.; Oltvai, Z. N.; Barab´asi, A.-L. The Large-Scale Organization of Metabolic Networks. Nature 2000, 407, 651–654. (83) Wagner, A.; Fell, D. A. The Small World Inside Large Metabolic Networks. Proc. R. Soc. B 2001, 268, 1803–1810. (84) Stillinger, F. H. Exponential Multiplicity of Inherent Structures. Phys. Rev. E 1999, 59, 48–51. (85) Wales, D. J.; Doye, J. P. K. Stationary Points and Dynamics in High-Dimensional Systems. J. Chem. Phys. 2003, 119, 12409–12416. (86) Wales, D. J. Energy Landscapes; Cambridge University Press: Cambridge UK, 2003. (87) Henkelman, G.; J´ohannesson, G.; J´onsson, H. In Theoretical Methods in Condensed Phase Chemistry; Schwartz, S. D., Ed.; Kluwer: Dordrecht Netherlands, 2002; Chapter 10, pp 269–302.

31

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(88) Hratchian, H. P.; Schlegel, H. B. In Theory and Applications of Computational Chemistry; Dykstra, C. E., Frenking, G., Kim, K. S., Scuseria, G. E., Eds.; Elsevier: Amsterdam, 2005; Chapter 10, pp 195–249. (89) Schlegel, H. B. Geometry Optimization. WIREs Comput. Mol. Sci. 2011, 1, 790–809. (90) Suleimanov, Y. V.; Green, W. H. Automated Discovery of Elementary Chemical Reaction Steps Using Freezing String and Berny Optimization Methods. J. Chem. Theory Comput. 2015, 11, 4248–4259. (91) Dewyer, A. L.; Arg¨ uelles, A. J.; Zimmerman, P. M. Methods for Exploring Reaction Space in Molecular Systems. WIREs Comput. Mol. Sci. 2018, 8, e1354. (92) Bellman, R. E. Adaptive Control Processes; Princeton University Press: Princeton NJ, 1961. (93) Zimek, A. In Data Clustering. Algorithms and Applications; Aggarwal, C. C., Reddy, C. K., Eds.; CRC Press: Boca Raton FL, 2014; Chapter 9, pp 201–230. (94) Samet, H. Foundations of Multidimensional and Metric Data Structures; Morgan Kaufmann: San Francisco CA, 2006. (95) Skillicorn, D. B. Understanding High-Dimensional Spaces; Springer: Heidelberg, 2012. (96) Seebach, D. Methods of Reactivity Umpolung. Angew. Chem. Int. Ed 1979, 18, 239– 258. (97) Johnson, W. B.; Lindenstrauss, J. Extensions of Lipschitz Mappings into a Hilbert Space. Contemp. Math. 1984, 26, 189–206. (98) Larsen, K. G.; Nelson, J. Optimality of the Johnson–Lindenstrauss Lemma. Proceedings of the 58th Annual IEEE Symposium on Foundations of Computer Science. Berkeley CA, 2017; pp 633–638. 32

ACS Paragon Plus Environment

Page 32 of 35

Page 33 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

(99) Kleinberg, J. M. Navigation in a Small World. Nature 2000, 406, 845–845. (100) Adamic, L. A.; Lukose, R. M.; Puniyani, A. R.; Huberman, B. A. Search in Power-Law Networks. Phys. Rev. E 2001, 64, 046135. (101) Adamic, L. A.; Lukose, R. M.; Huberman, B. A. In Handbook of Graphs and Networks; Bornholdt, S., Schuster, H. G., Eds.; Wiley–VCH: Weinheim, 2003; Chapter 13, pp 295–317. (102) Weng, T.; Small, M.; Zhang, J.; Hui, P. L´evy Walk Navigation in Complex Networks: A Distinct Relation between Optimal Transport Exponent and Network Dimension. Sci. Rep. 2015, 5, 17309. (103) Menczer, F. Growing and Navigating the Small World Web by Local Content. Proc. Nat. Acad. Sci. 2002, 99, 14014–14019. (104) Liben-Nowell, D.; Novak, J.; Kumar, R.; Raghavan, P.; Tomkins, A. Geographic Routing in Social Networks. Proc. Nat. Acad. Sci. 2005, 102, 11623–11628.

33

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 35

C2H4O2 C2H4O C2H2O2 C2H4

C2H2O

C2H2

C2O C2

TOC Graphic

34

ACS Paragon Plus Environment

C2O2

Page 35 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

Biography Dmitrij Rappoport earned his doctoral degree in theoretical chemistry from the University of Karlsruhe, Germany. He completed postdoctoral studies with Filipp Furche at the Unversity of California, Irvine and with Al´an Aspuru-Guzik at Harvard University. In parallel with his data engineering work in the private sector, he is conducting research in affiliation with the University of North Carolina at Chapel Hill. His current research interests include reaction networks, complexity in chemistry, and molecular response properties.

35

ACS Paragon Plus Environment