Predicting Feasible Organic Reaction Pathways Using Heuristically

Jun 4, 2019 - (5−8) Complex chemical reactions, which consist of networks of concurrent ..... correctly rank the reaction pathways such that feasibl...
0 downloads 0 Views 3MB Size
Article Cite This: J. Chem. Theory Comput. 2019, 15, 4099−4112

pubs.acs.org/JCTC

Predicting Feasible Organic Reaction Pathways Using Heuristically Aided Quantum Chemistry Dmitrij Rappoport* and Alań Aspuru-Guzik* Department of Chemistry and Chemical Biology, Harvard University, 12 Oxford Street, Cambridge, Massachusetts 02138, United States

Downloaded via UNIV OF SOUTHERN INDIANA on July 17, 2019 at 06:07:31 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.

S Supporting Information *

ABSTRACT: Studying organic reaction mechanisms using quantum chemical methods requires from the researcher an extensive knowledge of both organic chemistry and first-principles computation. The need for empirical knowledge arises because any reasonably complete exploration of the potential energy surfaces (PES) of organic reactions is computationally prohibitive. We have previously introduced the heuristically-aided quantum chemistry (HAQC) approach to modeling complex chemical reactions, which abstracts the empirical knowledge in terms of chemical heuristicssimple rules guiding the PES explorationand combines them with structure optimizations using quantum chemical methods. The HAQC approach makes use of heuristic kinetic criteria for selecting reaction paths that are not only plausible, that is, consistent with the empirical rules of organic reactivity, but also feasible under the reaction conditions. In this work, we develop heuristic kinetic feasibility criteria, which correctly predict feasible reaction pathways for a wide range of simple polar (substitutions, additions, and eliminations) and pericyclic organic reactions (cyclizations, sigmatropic shifts, and cycloadditions). In contrast to knowledge-based reaction mechanism prediction methods, the same kinetic heuristics are successful in classifying reaction pathways as feasible or infeasible across this diverse set of reaction mechanisms. We discuss the energy profiles of HAQC and their potential applications in machine learning of chemical reactivity.

1. INTRODUCTION

present considerable challenges for quantum chemical modeling due to the involved topology of the reactive potential energy surface (PES) with numerous energy minima connected by transition states (TS). In our previous work, we have developed the heuristically aided quantum chemistry (HAQC) approach to modeling complex chemical reactions,13 which leverages the existing empirical rules of polar organic reactions as powerful heuristics for formulating plausible reactive transformations, followed by structure optimizations using quantum chemical methods. By means of the HAQC methodology, we were able to reproduce

The understanding of the existing and the design of new reaction pathways in organic chemistry build on a large body of empirical knowledge in organic reaction mechanisms accumulated over the course of more than a century. The success of physical organic chemistry and, more recently, quantum chemical computation created a theoretical foundation of organic reactivity and also provided practical guidance for developing synthetic routes.1−4 In parallel to these advances, the past decade has seen a shift in focus to complex biological and chemical systems.5−8 Complex chemical reactions, which consist of networks of concurrent and interlinked reaction paths, are increasingly recognized as the rule rather than the exception in chemistry.5,9−12 At the same time, these systems © 2019 American Chemical Society

Received: February 10, 2019 Published: June 4, 2019 4099

DOI: 10.1021/acs.jctc.9b00126 J. Chem. Theory Comput. 2019, 15, 4099−4112

Article

Journal of Chemical Theory and Computation

2. CHEMICAL HEURISTICS The network representation affords a radically simplified but powerful discrete model of a complex chemical reaction that gives insight into its structure, properties, and dynamics in an efficient way. In this and the preceding work, we employ the transition network (TN) representation, which may be thought of as a discretized version of the reactive PES.27 The network nodes in the TN representation describe f lasks closed collections of molecules, whose stoichiometry is kept constant throughout the reaction networkwhile the directed network edges are stoichiometry-preserving reactive transformations. A sequence of directed network edges connecting a pair of (not necessarily adjacent) flasks comprises a reaction pathway or chemical mechanism. In the HAQC approach, each transformation implements one of the predefined transformation heuristics, which are derived from empirical rules chemists use to write reasonable mechanisms for reactions. This set of transformation heuristics, together with the HAQC approach, defines the scope of the chemical reaction network, i.e., the kinds of chemical reactions that the TN model is capable of describing, as well as the granularity of the description. Organic synthesis planning has a long tradition of using empirical rules to encode known synthetic reactions in the manner of an expert system, beginning with the seminal work on the LHASA system by Corey and Wipke,28 and including works by Ugi and Dugundji,29,30 Jorgensen,31 and many others, most recently by Grzybowski and co-workers.12,32 Using machine learning techniques holds significant promise for simulating reaction outcomes.33−35 In this work, we do not attempt to collect the known empirical rules and instead employ a set of heuristics describing the most basic reactive transformationsheteropolar breaking and formation of single, double, and triple bonds (Table 1). Our motivation

the key products and reaction paths of the formose reaction and to explore the structure and properties of the corresponding reaction network.13 However, further improvements are necessary to the HAQC methodology in selecting reaction paths, which are not only plausible, that is, consistent with the empirical rules of organic reactivity, but are also thermodynamically and kinetically feasible under the reaction conditions, broadly defined. While the reaction thermodynamics can be directly estimated from the molecule energies computed using quantum chemical methods, the formal derivation of kinetic parameters requires the knowledge of transition-state energies. TS computations have seen many methodological improvements,14−18 they still present a major computational bottleneck in studying complex chemical reactions. Promising approaches for accelerating TS search include ab initio molecular dynamics,19,20 automated transition-state search algorithms,21,22 heuristics-based generation of reactive complexes,23,24 and manual steering.25,26 In this work we explore an alternative approach, which avoids the explicit TS calculations and instead employs heuristic criteria to determine kinetic feasibility. We define formal requirements that kinetic heuristics should fulfill and develop several simple proposals, the performance of which we assess using a set of well-characterized organic reaction mechanisms. Predictions of reaction pathways encompass two related but distinct tasks: finding one or all feasible pathways leading from a given reactant to a given product (single-source, single-product reaction prediction), and identifying all products, for which at least one feasible pathway exists from a given reactant (singlesource, multiproduct reaction prediction). In both cases, the results are dependent on the reaction conditions such as solvent, temperature, presence of catalysts, etc. All these factors also determine the relative yields of the feasible reaction products. For the purposes of this work, we wish to abstract from the specific reaction conditions and consider a generalized feasibility of reaction pathways in typical reaction conditions, while admitting a certain degree of fuzziness under this definition. Furthermore, we distinguish only between feasible and infeasible pathways, reducing the task at hand to a simple binary classification problem. As we show in this work, this type of binary classification is achievable and is in line with the accumulated empirical knowledge of organic reaction mechanisms. Moreover, we are able to develop a consensus heuristic kinetic criterion, which applies across a broad range of polar organic reactions within the HAQC approach. This result is highly encouraging and raises the prospect for a robust and inexpensive general predictor of organic reactivity, complementing ab initio TS computations. Potential applications of the HAQC methodology to organic synthesis and complex chemical reactions, such as the ones arising in the context of origins of life, are the primary motivator for this and future work. This paper is structured as follows. We review the HAQC approach and introduce the relevant definitions in section 2. In section 3, we briefly describe our implementation of the HAQC scheme and introduce the reaction network models generated in this work. Section 4 contains a study of heuristic kinetic criteria for feasibility classification in the single-source, single-product problem for a series of well-characterized organic reactions including aliphatic additions, substitutions, and eliminations, aromatic substitution, and pericyclic reactions. The results are discussed in section 5, and the conclusions of this work are given in section 6.

Table 1. Elementary Transformation Heuristics Used in This Work (X = F, Cl, Br, I)a bond dissociation

a

bond association

HC → H+ ··· C− CC → C+ ··· C− CO → C+ ··· O− CX → C+ ··· X− XC → X+ ··· C− HO → H+ ··· O− HX → H+ ··· X− XX → X+ ··· X− polarization

(H1̷ C) (C1̷ C) (C1̷ O) (C1̷ X) (X1̷ C) (H1̷ O) (H1̷ X) (X1̷ X)

H+ ··· C− → HC C+ ··· C− → CC C+ ··· O− → CO C+ ··· X− → CX X+ ··· C− → XC H+ ··· O− → HO H+ ··· X− → HX X+ ··· X− → XX depolarization

(H1C) (C1C) (C1O) (C1X) (X1C) (H1O) (H1X) (X1X)

CO → C+O− CC → C+C− CC → C+C−

(C2̷ O) (C2̷ C) (C3̷ C)

C+O− → CO C+C− → CC C+C− → CC

(C2O) (C2C) (C3C)

Abbreviated notation is given in parentheses.

for using this much simpler rule set is to broaden the scope of the chemical reactivity we can describe and to reduce knowledge bias. At the same time, we rely on energy-based heuristic criteria for choosing thermodynamically and kinetically feasible transformations. We note that by making the specific choice of elementary transformation rules, we deliberately leave aside reactions involving radicals, electrondeficient compounds, e.g., carbenes, and reactions with inverted polarity. These extensions would require an enlarged set of transformation heuristics and are outside of the scope of 4100

DOI: 10.1021/acs.jctc.9b00126 J. Chem. Theory Comput. 2019, 15, 4099−4112

Article

Journal of Chemical Theory and Computation

transition-state theory, Hammond’s postulate, and computational considerations. A heuristic kinetic criterion WK→L for the N-step reaction path K → L = {K = K0, K1, ..., KN = L} thus should possess the following properties: 1. Non-Negativity. WK→L ≥ 0. 2. Additivity. WK→L→M = WK→L + WL→M for consecutive reaction paths K → L and L → M. 3. Hammond’s Postulate. WK→L should approximate the activation energy of the reaction path K → L (up to a constant factor) and should thus be most sensitive to energy maximum of the energy profile along the reaction path K → L, corresponding to the highest-energy intermediate flask. 4. Length Penalty. WK→L should increase roughly linearly with the number of steps N between K and L. 5. Detailed Balance. The heuristic kinetic criteria for the forward and the backward reaction should be related to each other via the energy difference between the initial and the final flasks.

this work. Similarly, we limit ourselves to the chemistry of the elements C, H, O, and halogens for the purposes of this work. The set of elementary transformation heuristics is chosen to be reversible (that is, each transformation appears in both the forward and the reverse direction) and stoichiometrypreserving. While the choice of bond dissociations and associations as transformation heuristics is hardly surprising, the polarization and depolarization of multiple bonds in Table 1 deserve an additional comment. By writing double and triple bonds in their polarized forms, we do not mean to imply that these charge-separated structures are distinct intermediates in the reaction mechanisms or distinct energy minima on the corresponding reactive PES. Depending on the molecular structure and the reaction mechanism, this might or might not be the case. In either case, we consider (de)polarization steps as purely fictitious constructs, which are formally energyneutral and are a useful device for expressing the reactivity of double and triple bonds. Finally, we note that with the above definitions, all polar reactions involving the elements C, H, O with normal polarity as well as halogens in their oxidation states of 0 and −1 can be derived from the elementary transformation heuristics by composition. While the elementary transformation heuristics effectively narrow down all conceivable reaction mechanisms to the subset of those consistent with the predefined rules of chemical reactivity (plausible reaction paths), they do not rule out highenergy intermediates, e.g., strained small rings, or high-barrier reaction paths, such as the breaking of unactivated C−H and C−C bonds. This is the price we have to pay for the broader scope of organic reactions covered by the elementary transformation heuristics. As a consequence, it is necessary to further select for reaction paths that are not only plausible but also feasible under the reaction conditions, as judged by thermodynamic and kinetic criteria. The thermodynamic and kinetic criteria developed in our previous work showed promise in minimizing the number of infeasible reaction paths.13 The thermodynamic criterion can be taken to be the energy difference ΔEK→L = EL − EK between the initial (reactant) flask K and the final (product) flask L and does not require additional assumptions. By contrast, the kinetic criterion must be approximated by heuristic expressions if we are to avoid expensive TS computations. One of the two heuristic kinetic criteria considered previously is climb Wc,K→L, which yields the energy of the highest point in the energy profile of reaction path relative to the initial flask. This criterion follows from the idea of Hammond’s postulate that allows us to take the energy of the highest-energy intermediate flask to approximate the activation barrier of the rate-limiting step.1,36 The second kinetic criterion is arc Wa,K→L, which modifies the climb criterion as to penalize longer reaction paths by adding a constant penalty factor α for each step. In both cases, higher values of the heuristic kinetic criteria indicate a less favorable reaction path. The feasible reaction paths are thus determined by setting thermodynamic and kinetic threshold values such that all reaction paths with ΔEK→L ≤ ΔE and WK→L ≤ W̅ are assumed as feasible, and all others are considered infeasible.13 The focus of the present work is on generalizing this successful strategy to other organic reactions and on refining our capability for discriminating between feasible and infeasible reaction paths. To aid the criterion selection, we first define a set of expectations that a useful heuristic kinetic criterion should fulfill, in keeping with the qualitative picture of

WK → L − WL → K = ΔE K → L = E L − E K

(1)

The above properties allow us to interpret the heuristic kinetic criterion WK→L as the analog of the energy factor for the reaction path K → L in the Boltzmann distribution of all plausible reaction paths with its rate given by vK → L ∝ exp( −βWK → L)

(2)

where β = 1/(kBT) with Boltzmann constant kB and absolute temperature T. Additivity of WK→L ensures that multiplication of the probabilities for each step of the reaction pathway generates the probability of the entire pathway. The length penalty makes longer reaction paths relatively less likely and qualitatively describes a branching of reaction pathways at each elementary step. Hammond’s postulate provides an approximation for the transition-state energy of a reaction pathway by the energy of the highest-energy intermediate. Finally, if the heuristic kinetic criterion WK→L satisfies the detailed balance condition in eq 1, the rates of the forward and backward reactions computed according to eq 2 obey the detailed balance in the stationary state. The heuristic kinetic criteria introduced in our previous work13 are climb and arc. The climb criterion is defined by N−1

Wc,K → L =

∑ max(EK

−E K i ,0)

i+1

i=0

(3)

where EKi is the energy of the intermediate Ki. The climb kinetic criterion gives the difference between the highestenergy intermediate and the initial reactant and thus generalizes the intuitive notion of the activation energy to the multistep case. The climb criterion also fulfills the above requirements 1−3. However, the discriminating power of the climb criterion is modest, since it fails to account for the length of the reaction pathway and thus gives preference to long, lowbarrier reaction mechanisms. The arc kinetic criterion introduces a constant penalty α per reaction step, N−1

Wa , K → L =

∑ [(E K i=0

i+1

− E K i)2 + α 2]1/2

(4)

where α ≥ 0 is an adjustable parameter. α = 0 is equivalent to the climb criterion, while for α > 0, the value of the arc 4101

DOI: 10.1021/acs.jctc.9b00126 J. Chem. Theory Comput. 2019, 15, 4099−4112

Article

Journal of Chemical Theory and Computation Table 2. Reaction Network Models Generated in This Worka

n is the number of network nodes; e is the number of network edges. bFlasks with sum of absolute atomic charges Σ|c| ≤ 4 allowed.

a

Wk,K → L = Ä É1/2 N−1 Å ÅÅ ij 1 − exp( −κ(E K − E K )2 ) yzÑÑÑÑ ÅÅ zzÑÑ i+1 i ∑ ÅÅÅÅ(EK i+1 − EK i)2 + α 2jjjjj zzÑÑÑ 2 z + − κ − 1 exp( ( E E ) ) Å K i+1 Ki i=0 Å k {ÑÑÑÖ ÅÇ

criterion increases approximately linearly with the number of steps, thus additionally fulfilling the requirement 4 listed above. In the limit α → ∞, the arc criterion favors the pathways K → L with the smallest number of steps. The arc criterion performs relatively well for the reaction pathways involved in the formose reaction and predicts the mechanisms of the enolization, hemiacetal formation, and aldol addition in agreement with empirical knowledge. Moreover, the key products and reaction paths of the formose reaction are correctly reproduced by using the arc criterion with α = 1 eV.13 At the same time, the arc criterion penalizes every elementary transformation equally, even the formally energy-neutral (de)polarization steps (Table 1). In order to prevent the latter, we define the modified kinetic arc criterion karc Wk,K→L by

(5)

where α ≥ 0 and κ ≥ 0 are adjustable parameters. This functional form attenuates the penalty α per reaction step depending on the corresponding energy change. In the limit of large energy changes, the attenuating term tends to 1, and the karc criterion increases approximately linearly with the number of steps. At the same time, energy-neutral steps (EKi+1 = EKi), in particular the fictitious (de)polarization of multiple bonds, do not contribute to the karc criterion. The parameter κ adjusts the rate of transition between the limiting cases. κ = 0 corresponds to the climb criterion, while κ → ∞ recovers the arc criterion. 4102

DOI: 10.1021/acs.jctc.9b00126 J. Chem. Theory Comput. 2019, 15, 4099−4112

Article

Journal of Chemical Theory and Computation

3. BUILDING LARGE CHEMICAL REACTION NETWORKS Automatic construction (or discovery) of reaction mechanisms is a fundamental task in kinetic modeling,11,37−42 ab initio simulation of reaction dynamics,43−46 and synthesis planning.12,28,30,32,47−49 The HAQC approach13 seeks the middle ground between database-driven empirical methods12,28,30,32,48 and ab initio PES exploration.14−18,50,51 The basic unit of data used in constructing reaction mechanisms by HAQC is the flask, which refers to a collection of discrete molecules, subject to the constraint that only flasks with equal total stoichiometries may be transformed into one another. The constituent molecules of a flask are represented by their structural formulas and are considered as independent of each other. As a consequence, the network model containing flasks as its nodes and their mutual transformations as edges is welldefined as a discretized model of the reactive PES. The corresponding energy evaluations involve only structure optimizations of the individual molecules and are cheaply and easily performed. At the same time, the mapping between molecules and flasks is of the many-to-many type and requires careful data synchronization. Another central consideration of the HAQC approach is the relationship between the structural formulas and the corresponding three-dimensional molecular geometries used for energy evaluations. The elementary transformation heuristics are formulated in the language of structural formulas in terms of bond breaking and bond making (Table 1), whereas quantum chemical structure optimizations have no notion of localized bonds and the optimized structures need to be validated with respect to the atomic connectivities and potentially rejected if rearrangements or fragmentations occur. Due to the hybrid structure of the HAQC approach, the problem of building chemical reaction networks is divisible into a series of interlocking simple tasks, which can be efficiently executed in parallel. The communication between parallel processes thus becomes a primary concern and can be achieved using a centralized database. These requirements are quite unlike the situation in traditional quantum chemical computation, which is typically constrained by CPU, memory, and disk space needs. The open-source prototype Python implementation of the HAQC approach called Colibri52 is described in section S1 of the Supporting Information. In order to evaluate the predictive power of the arc and karc heuristics and to optimize their adjustable parameters, we applied them to a set of 14 simple organic reactions (Table 2) representing a variety of polar reactions (substitutions, additions, and eliminations) as well as pericyclic reactions (6π electrocyclization, Claisen rearrangement, and Diels− Alder and ene reactions). The appeal of these reaction mechanisms is that their major pathways are well characterized, and experimental and theoretical reference data on their thermodynamics and kinetics are available in the literature. For each of the reactions in the set, we generated the complete reaction network starting with the flask representing the reactants and repeatedly applying the transformation heuristics (Table 1) until convergence was reached and no new flasks could be made. All structure optimizations were performed using the PM7 semiempirical method as implemented in the MOPAC program53 and the solvent effects were approximated by the conductor-like solvation model (COSMO)54 with the effective dielectric constant of water, ε

= 78.4. The energy of the proton was computed by assuming a neutral aqueous solution (pH = 7). To prevent computations of many high-energy intermediates, only flasks with the sum of absolute atomic charges of Σ|c| ≤ 2 were retained, unless noted otherwise in Table 2. The complete network of each reaction is the exhaustive description of the corresponding reactive system within the framework of HAQC. It contains all plausible reactive pathways between the reactant and product flasks. Since the reaction network may contain cyclespathways starting and ending on the same node55the path length between a pair of nodes is generally unbounded, and a reasonable cutoff for the path length Nmax has to be defined in practice. For each of the reactions in Table 2, we determined the number of plausible reaction pathways (np) by enumerating all paths not containing cycles (simple paths) of length N ≤ Nmax between the reactant and product flasks. We identified the feasible reaction pathways for these reactions (nf) by comparing them with the accepted reaction mechanisms from the organic chemistry literature (see section 4.3 and sections S3 and S4 of the Supporting Information). The statistics of the plausible and empirically feasible reaction pathways are given in Table 3. Table 3. Reactants, Products, Plausible (np) and Empirically Feasible (nf) Pathways, and Maximum Reactive Pathway Lengths (Nmax) of the Reactions Considered in This Worka

a

reaction

np

nf

Nmax

tautomerization (TA) diol formation (DI) SN1 substitution (S1) SN2 substitution (S2) Br2 addition (BA) HBr addition (HA) E1 elimination (E1) SEAr substitution (SA) epoxide hydrolysis (EP) esterification (ES) Claisen rearrangement (CL) 6π electrocyclization (6C) Diels−Alder reaction (DA) ene reaction (EN)

2 38 94 102 14 28 28 96 61 523 264 12241 5532 10443

1 7 5 5 2 3 3 4 2 9 108 56 248 138

8 8 6 6 6 6 6 6 6 6 6 8 8 8

See Table 2 for reaction definitions.

4. APPLICATIONS TO REACTION MECHANISMS 4.1. Kinetic Feasibility Heuristics. We evaluated the performance of the arc and karc heuristic kinetic feasibility criteria on the basis of their ability to correctly rank the reaction pathways such that feasible pathways are assigned lower values of the kinetic heuristic than infeasible pathways. Given the simplicity of the arc and karc kinetic heuristics, we do not expect them to predict the activation energies in a quantitative fashion. The correct ranking, however, is sufficient for a binary classification of reaction pathways, whereby pathways WK→L ≤ W̅ are assumed as feasible and those with WK→L > W̅ are considered infeasible. For the single-source, single-product problem treated here, the thermodynamic criterion does not contribute to the classification. The ability to distinguish feasible from infeasible reaction pathways using the HAQC approach is valuable as a screening tool for complex reaction networks or as the first stage in multistage computational studies of reaction mechanisms. 4103

DOI: 10.1021/acs.jctc.9b00126 J. Chem. Theory Comput. 2019, 15, 4099−4112

Article

Journal of Chemical Theory and Computation

Figure 1. Classification of reaction pathways as feasible/infeasible using the arc heuristic Wa with α = 1 eV for the reactions considered in this work. Empirically feasible reaction pathways are marked by blue lines; infeasible pathways are marked by red lines. The interval between the feasible reaction pathway with the highest Wa value and the infeasible reaction pathway with the lowest Wa value (feasibility gap) is highlighted by gray shading. W̅ a = 6 eV estimates the consensus threshold for binary classification or reaction pathways as feasible or infeasible. See Table 2 for reaction definitions.

Figure 2. Classification of reaction pathways as feasible/infeasible using the karc heuristic Wk with α = 1 eV, κ = 5 eV−2 for the reactions considered in this work. W̅ k = 6 eV estimates the consensus threshold for binary feasible/infeasible classification. See Table 2 for reaction definitions and Figure 1 for additional details.

In order to facilitate the comparison between the reaction pathways modeled by HAQC and the literature reaction mechanisms, two observations should be made. First, the HAQC procedure, by design, divides reaction steps into individual bond breaking and bond making events (elementary transformations). For many reactions, several distinct orderings of elementary transformations are plausible (as defined above) and consistent with the accepted reaction mechanism. Additionally, both arc and karc criteria are invariant with respect to the ordering of the energy changes (eq 4 and eq 5, respectively). Therefore, we combine the m reaction pathways

comprising the same set of elementary transformations in different orderings and having the same value of the kinetic heuristic under one reaction pathway with the multiplicity m. Second, we note that not all elementary transformations change the flask energy; indeed, the (de)polarization steps (Table 1) are considered energy-neutral within the HAQC procedure. To maintain consistency with these definitions, we treat all flasks connected to each other by (de)polarization as an equivalence class and assign the energy of the flask with the most saturated valences (lowest sum of absolute charges Σ|c|) to all its members. While this approach is not the only possible 4104

DOI: 10.1021/acs.jctc.9b00126 J. Chem. Theory Comput. 2019, 15, 4099−4112

Article

Journal of Chemical Theory and Computation

Figure 3. Comparison of predicted energy differences ΔE with literature standard reaction enthalpies ΔH°r for reactions of Table 2 (left). Comparison of karc heuristics Wk with α = 1 eV, κ = 5 eV−2 to literature activation energies Ea for reactions of Table 2 (right). Literature data for gas-phase reactions are shown by empty circles; data for solution reaction are shown by solid circles. Linear regression plots for gas-phase and solution standard reaction enthalpies are indicated by dashed and solid lines, respectively. The heavy vertical line in the right panel denotes W̅ k = 6 eV.

all reaction mechanisms in Figure 2. The smallest feasibility gap of 0.054 eV is observed for the Claisen rearrangement (CL). The important result of the correct ranking is that the karc heuristic enables a reliable binary classification of reaction pathways as feasible or infeasible using a (reaction-specific) threshold value W̅ k for each reaction mechanism. Finally, we examine the consistency of the predictions by the karc heuristic across different reaction mechanisms. A useful procedure for de novo prediction of feasible reaction pathways depends on a reaction-independent estimate. Using the karc criterion with α = 1 eV and κ = 5 eV−2 we can determine a consensus threshold of W̅ k = 6 eV (shown as a heavy vertical line in Figure 2), with which 9 out of 14 reactions are predicted to have at least one feasible reaction pathway under a comparable set of conditions. We also note that the remaining 5 reactions either require catalysts (bromine addition (BA), aromatic substitution (SA), and epoxide hydrolysis (EP)) or elevated temperatures (Diels−Alder (DA) and ene (EN) reactions). The results for the arc and karc heuristic criteria using selected other values of the α and κ parameters are given in section S5 of the Supporting Information. An optimal set of parameters for arc and karc kinetic heuristics can be obtained in a systematic manner by optimizing a metric for the quality of prediction. A frequently used metric for classification problems is the balanced F measure,56 which combines into a single measure the proportion of correct predictions among predicted positive results (precision) and the proportion of true positive results correctly identified by the predictor (recall). We define feasible pathways as positive results and consider the empirical feasibility as the ground truth. The balanced F measure takes on values between 0 and 1, wherein higher values correspond to better classification performance. We optimized the parameters α, κ, and the feasibility threshold W̅ k using a Bayesian optimization algorithm with Tree of Parzen Estimators, as implemented in the hyperopt package.57 The loss function for the Bayesian optimization was defined as (1 − F1(α,κ,W̅ k)), where F1 is the balanced F measure. We found that the karc heuristic with α = 7.2 eV, κ = 16.2 eV−2, and W̅ k = 20.2 eV is optimal with respect to the balanced F measure (F1 = 0.871). The set of parameters determined by manual search above (α = 1 eV, κ = 5 eV−2, and W̅ k = 6 eV) was less accurate with respect to the balanced F measure (F1 = 0.418); however,

one for dealing with (de)polarization steps, it is the simplest and helps prevent numerical noise. Figure 1 gives an overview of the classification results for the reactions of Table 2 using the arc heuristic kinetic feasibility criterion Wa with α = 1 eV. For each reaction mechanism, the plausible pathways are ordered by their Wa values and are depicted as vertical lines. Empirically feasible reaction pathways are shown in blue, empirically infeasible pathways are in red. As a visual guide, the interval between the feasible pathway with the highest Wa value and the infeasible pathway with the lowest Wa value (feasibility gap, by analogy with the orbital gap of molecular orbital theory) is highlighted by gray shading. As Figure 1 illustrates, even the simple arc heuristic succeeds in correctly ranking the reaction pathways by feasibility for 10 out of 14 reaction mechanisms (except diol formation (DI), esterification (ES), and Diels−Alder (DA) and ene (EN) reactions), with the feasible reaction pathways having lower values of the kinetic heuristic than the infeasible reaction pathways. Any choice of the threshold value W̅ a from within the (positive) feasibility gap will result in a correct classification of the reaction pathways as feasible or infeasible for a given reaction mechanism. We further observe from Figure 1 that the feasibility gaps for different reaction mechanisms vary significantly with respect to their location and width on the Wa scale. This is perhaps to be expected, given that the reaction mechanisms considered in this work occur at different temperatures, a fact we purposely ignored in our broad definition of kinetic feasibility. The best consensus threshold for the arc criterion with α = 1 eV across the set of reaction mechanisms considered here is W̅ a = 6 eV (as indicated by a heavy vertical line in Figure 1). The karc heuristic criterion Wk with α = 1 eV and κ = 5 eV−2 improves upon the arc heuristic by reducing the length penalty for elementary transformations with small energy changes. In particular, the fictitious (de)polarization of multiple bonds is not penalized (see section 2 for details). As a result, the values of the karc heuristic are shifted downward compared to the arc heuristic with the same α parameter (Figure 2). This correction improves the performance of the karc heuristic and provides a correct ranking of the reaction pathways by kinetic feasibility for all 14 reaction mechanisms considered in this work. This fact is visualized by positive feasibility gaps for 4105

DOI: 10.1021/acs.jctc.9b00126 J. Chem. Theory Comput. 2019, 15, 4099−4112

Article

Journal of Chemical Theory and Computation

Figure 4. Feasible tautomerization (TA) reaction pathway. Path multiplicity is given in parentheses.

increasing the threshold to W̅ k = 7 eV yielded classification performance close to optimal (F1 = 0.823). 4.2. Thermodynamic and Kinetic Reaction Parameters. Figure 3 illustrates the correlations between the predicted thermodynamic and kinetic heuristics and the corresponding literature values. We compare the predicted energy differences ΔE to standard reaction enthalpies ΔHr° (left panel) and the values of the karc Wk heuristic with α = 1 eV, κ = 5 eV−2 to the activation energies Ea from the literature. By virtue of eq 2, W is the heuristic equivalent of the activation energy. When multiple reference values are available, we give preference to experimental data or the results of high-level computations. We include both gas-phase and solution data because the computational model used in this work takes into account continuum solvation effects via the COSMO model but does not include specific solvation. The complete data set is given in section S2 of the Supporting Information. As expected, the predicted reaction energies show a good degree of correlation with the literature data (Figure 3 (left)). This result underscores the good accuracy of the PM7 semiempirical method for organic thermochemistry53 and the fact that no additional approximations are introduced for reaction thermodynamics. We find that the agreement is somewhat better between the predicted reaction energies and the gas-phase enthalpies. This discrepancy can be attributed to specific solvent effects, especially for the reactions producing alcohols (epoxidation (EP), SN1 (S1), and SN2 (S2) substitution), which show the greatest differences. The comparison of the karc heuristic kinetic feasibility criterion Wk with α = 1 eV, κ = 5 eV−2 with literature activation energies underscores the fundamental challenge of the HAQC procedure, the accurate prediction of reaction barriers in solution. The karc heuristic in Figure 3 (right) does not appear to be quantitatively correlated with activation energies across 14 reaction mechanisms considered in this work. This shows that while providing an excellent relative measure of kinetic feasibility for each reaction, the karc heuristic does not constitute an absolute kinetic scale. This finding suggests that additional work needs to be done in order to better align the individual kinetic feasibility scales. At the same time, the existence of a practical consensus threshold W̅ k = 6 eV for the karc heuristic with α = 1 eV, κ = 5 eV−2 indicates that the empirical notion of reaction feasibility can be estimated by kinetic heuristics in a computationally inexpensive way. 4.3. Reaction Mechanism Prediction. We briefly discuss here the predicted reaction mechanisms for three polar reactions (keto−enol tautomerization and SN1 and SN2 substitutions) and two pericyclic reactions (Claisen rearrangement and Diels−Alder reaction). The reader is referred to sections S3 and S4 of the Supporting Information for the complete discussion of the predicted reaction mechanisms and the comparison with literature data. The tautomerization of vinyl alcohol in the gas phase is governed by the Woodward−Hoffmann rules58−60 and proceeds via an antarafacial 1,3-H shift with a highly strained and energetically unfavorable transition state.61−63 In water, the tautomerization reaction comprises a deprotonation and a

Figure 5. Energy profile of feasible tautomerization (TA) reaction pathway. Flasks containing only neutral compounds are denoted by solid circles; flasks with charged compounds are shown as empty circles. The flask energies are given with reference to the energy of the initial flask.

protonation step and is catalyzed by both acids and bases.62,64−66 The HAQC procedure predicts one feasible pathway (Figure 4) with a single-barrier energy profile (Figure 5). The first two steps of the predicted pathway are fictitious (de)polarization steps that transfer the positive charge to the oxygen atom and allow for a subsequent deprotonation. The maximum of the energy profile corresponds to the chargeseparated intermediate and describes the proton shift from the oxygen to the α-carbon atom, consistent with the experimental reaction mechanism. Nucleophilic substitution at the saturated carbon center occurs on a mechanistic spectrum between the concerted SN2 mechanism and the stepwise SN1 mechanism.4,67−72 The HAQC approach reconstructs both mechanisms as a sequence of a polar C−Br bond breaking and a C−O bond formation steps (Figure 6 and Figure 7). While the substitutions in both substrates feature qualitatively similar single-barrier energy profiles, they differ in the height of the energy maxima relative to the reaction energies (Figure 8). Notice the difference in scale. For the stepwise SN1 mechanism, the energy maximum corresponds directly to the high-energy carbenium ion intermediate and is a structural model for the transition states of the steps 0 → 1 and 1 → 2 of the S1−0 reaction pathway according to Hammond’s postulate. In contrast, in order to model the concerted SN2 mechanism within the HAQC formalism, we may need to introduce an additional conceptual step, that is, to assume that the transition states of the steps 0 → 1 and 1 → 2 of the S2−0 reaction pathway coalesce at higher energies into a single transition state for the combined 0 → 2 elementary step. This reasoning parallels Jencks’ concept of enforced concertedness, which relates the stability of the high-energy intermediate to qualitative 4106

DOI: 10.1021/acs.jctc.9b00126 J. Chem. Theory Comput. 2019, 15, 4099−4112

Article

Journal of Chemical Theory and Computation

Figure 6. Feasible SN1 substitution (S1) reaction pathways. Path multiplicities are given in parentheses.

Figure 7. Feasible SN2 substitution (S2) reaction pathways. Path multiplicities are given in parentheses.

Figure 8. Energy profiles of feasible SN1 (S1, left) and SN2 (S2, right) substitution reaction pathways. See caption of Figure 5 for more details.

Figure 9. Feasible Claisen rearrangement (CL) and Diels−Alder (DA) reaction pathways. Path multiplicities are given in parentheses.

changes in the reaction mechanism.73−75 As a result, the curvature of the HAQC energy profile may be a useful predictor for the degree of concertedness within a reaction pathway. The alternative substitution mechanisms (S1−1 and S2−1) are dominated by the dissociation of water, which overwhelms the details of the substitution mechanism (Figure 8).

The Claisen rearrangement of vinyl allyl ether to 4-pentenal and the cycloaddition of 1,3-butadiene with ethene are prototypical examples of sigmatropic rearrangements72,76−82 and the Diels−Alder reaction,4,72,83−94 respectively. Both of them are known to follow concerted mechanisms without identifiable intermediates. Therefore, they can be viewed as critical tests of the HAQC procedure based on heteropolar 4107

DOI: 10.1021/acs.jctc.9b00126 J. Chem. Theory Comput. 2019, 15, 4099−4112

Article

Journal of Chemical Theory and Computation

Figure 10. Energy profiles of feasible Claisen rearrangement (CL) and Diels−Alder (DA) reaction pathways. See caption of Figure 5 for more details.

by the reaction kineticsas a sequence of one or more transformation heuristics of Table 1. However, concerted reaction mechanisms can be represented within the HAQC formalism by merging adjacent elementary transformation steps into larger concerted steps. Based on our observations so far, the concept of enforced concertedness73−75 may provide useful guidance for further research. While the HAQC energy profiles are useful conceptual devices and seem to correctly capture some patterns of chemical reactivity, they should not be overinterpreted. It is perhaps best to think of them as taking an intermediate position between the qualitative “arrow pushing” reaction diagrams and the complex realities of multidimensional reactive PESs. Importantly, the HAQC energy profiles should not be taken as the literal traces of reactive trajectories in energy space. We highlight three fundamental differences: The points of the energy profile are discrete, and the connections between them are nothing more than visual guides. Each of the points corresponds to a local energy minimum on the PES, even if it appears to be a maximum of the energy profile. While the relationship between transition-state structures and highenergy reaction intermediates is implied by Hammond's postulate, it is not quantitative. Finally, the abscissa of the energy profile is not a true reaction coordinate, recording the overall changes in molecular structure. Instead, it consists of discrete bond making and bond breaking steps, wherein each bond transformation event is weighted equally. While this simple approach is a useful first step in constructing kinetic heuristics, it neglects the differences in bond distances and bond strengths between bonds of different types. Improved variants of the simple HAQC energy profiles could take the latter into account by using abstract measures of molecular change95 or molecular similarity96,97 as their abscissas instead of the step index. 5.2. Search Methods for Reaction Mechanism Prediction. Taking into consideration only the narrow goal of predicting feasible reaction pathways, we have pursued a somewhat circuitous and inefficient approach in this work. We have first constructed an exhaustive transition network (TN) model of plausible elementary transformations, only to filter it down to the feasible reaction pathways using thermodynamic and kinetic feasibility heuristics. Only between 50% (for

transformation heuristics (Table 1). For both the Claisen rearrangement and the Diels−Alder reaction, the predicted reaction mechanisms are consistent with their “arrow pushing” models, under the assumption that the electron pairs are transferred one at a time (Figure 9). The energy profiles have the expected single-barrier shape, and their large curvatures are indicative of enforced concertedness (Figure 10). The Claisen rearrangement requires at least four steps (as indicated by a solid circle in Figure 10 (left)) but can additionally include a pair of energy-neutral polarization/depolarization steps. Similarly, the Diels−Alder reaction takes place in six steps or in eight steps if an additional polarization/depolarization cycle is included. By construction, the (de)polarization steps do not contribute to the arc or karc heuristics, and we do not consider the longer pathways as distinct mechanisms. As an interesting side effect of the stepwise electron pair transfers, the sequencing of the bond formations in the Diels−Alder reaction has a small but nonzero effect and results in two neardegenerate reaction pathways (DA−0 and DA−1 in Figure 10). We present a more detailed discussion of all reaction mechanisms considered in this work in sections S3 and S4 of the Supporting Information.

5. DISCUSSION 5.1. Interpretation of Energy Profiles. Having analyzed a diverse set of 14 polar and pericyclic reaction mechanisms, we are now in the position to make some tentative generalizations. The most immediately apparent commonality of the feasible reaction pathways considered so far is that their energy profiles have the intuitively expected single-barrier shape. Among polar reactions, we find only two exceptions: the HBr addition and elimination reactions, in which the energy profiles are entirely nonincreasing and nondecreasing, respectively. For polar reactions, the correspondence between the high-energy charged intermediates and the transition state is explainable by Hammond’s postulate. But perhaps more surprisingly, we also find that almost all pericyclic reactions considered in this work give rise to single-barrier energy profiles. The 6π electrocyclization reaction is an exception, however, having a strictly downhill energy profile. By design, the HAQC approach tends to err on the side of being too granular and to represent any elementary reactionas defined 4108

DOI: 10.1021/acs.jctc.9b00126 J. Chem. Theory Comput. 2019, 15, 4099−4112

Article

Journal of Chemical Theory and Computation

networks in the TN representation are relatively sparse; i.e., each network node is connected to only a small fraction of the overall network. Sparse directed networks are characterized by small values of network density

tautomerization) and 0.5% (for 6π electrocyclization) of all plausible reaction pathways in Table 3 are empirically feasible. Our approach is justified because we are additionally interested in comparing and optimizing kinetic heuristics. However, given a fixed set of feasibility criteria, we are free to choose a much more efficient strategy based on local search.98,99 The formal properties of the kinetic heuristics, specifically non-negativity and additivity (see section 2), are designed to make them applicable in the context of local search or network-flow optimization algorithms.100 The simplest local-search variant of the HAQC approach might use a global kinetic feasibility threshold W̅ k along with an additional local kinetic feasibility threshold w̅ k = W̅ k/Nmax + δwk, where Nmax is the maximum path length and δwk is a local slack parameter. Within this local-threshold approach, one would in every step only accept elementary transformations Ki → Ki+1 (0 ≤ i ≤ Nmax − 1), for which WKi→Ki+1 ≤ w̅ k (local threshold criterion) and WK0→Ki+1 ≤ W̅ k (global threshold criterion). Path search algorithms are exceptionally versatile and widely used.55,99 The breadth-first search algorithm55 is suitable for finding all reaction pathways with up to Nmax steps between the given initial and final flasks (single-source, single-product reaction prediction), which is the focus of this work. But with only small modifications, this approach can be adapted to solve different problems in reaction mechanism prediction. The feasibility of a particular reaction, given the initial and final flasks, translates into finding only the path with the smallest value of the kinetic feasibility heuristic instead of all paths and can be readily solved by a standard shortest-path algorithm. Breadth-first search can also be extended to solve the singlesource, multiproduct reaction problem. The only significant difference to this work is that we need to take into consideration both kinetic and thermodynamic feasibility of the predicted reaction pathways.13 Provided that a suitable calibration exists, it should even be possible to estimate overall reaction rates and product compositions using eq 2. 5.3. Reaction Networks and Their Properties. While local search is likely the most efficient approach to reaction mechanism prediction, the full reaction networks in the TN representation are worthy of a closer examination in their own right.27 We contrast here the TN model with the interaction network (IN) representation, which plays a central role in kinetic modeling11,37−42,101−108 and is usually the implied meaning of the term reaction network.7,9,12,32,103 The definition of the TN representation, in which the network nodes correspond to flasks and the network edges are stoichiometry-preserving transformations, has a number of practically useful consequences. First, the reaction networks in the TN representation are large but fundamentally finite, meaning that the exhaustive generation of the reaction network by the HAQC procedure is guaranteed to terminate after a finite number of steps. The network size in the TN representation increases exponentially with the number of atoms, making it impractical to generate the entire network for large enough systems. Nevertheless, as we have demonstrated in this work, reaction networks in the TN representation remain of manageable size for some systems relevant to chemical reactivity (up to eight non-hydrogen atoms,