A Predictive Tool for Electrophilic Aromatic Substitutions Using

4 days ago - Herein we present a machine learning model able to predict the reactive site of an electrophilic aromatic substitution with an accuracy o...
0 downloads 0 Views 752KB Size
Subscriber access provided by UNIV OF LOUISIANA

Article

A Predictive Tool for Electrophilic Aromatic Substitutions Using Machine Learning Anna Tomberg, Magnus J. Johansson, and Per-Ola Norrby J. Org. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.joc.8b02270 • Publication Date (Web): 18 Oct 2018 Downloaded from http://pubs.acs.org on October 19, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Organic Chemistry

A Predictive Tool for Electrophilic Aromatic Substitutions Using Machine Learning Anna Tomberg,†, £ Magnus J. Johansson,‡ Per-Ola Norrby*£ †Discovery Sciences ‡Cardiovascular, Renal and Metabolism £Pharmaceutical Sciences IMED Biotech Unit, AstraZeneca, Gothenburg, Sweden. Email: [email protected]

Abstract At the early stages of the drug development process, thousands of compounds are synthesized in order to attain the best possible potency and pharmacokinetic properties. Once successful scaffolds are identified, large libraries of analogues are made, which is a challenging and time-consuming task. Recently, Late Stage Functionalization (LSF) has become increasingly prominent since these reactions selectively functionalize C-H bonds, allowing to quickly produce analogues. Classical electrophilic aromatic halogenations are a powerful type of reaction in the (LSF) toolkit. However, the introduction of an electrophile in a regioselective manner on a drug-like molecule is a challenging task. Herein we present a machine learning model able to predict the reactive site of an electrophilic aromatic substitution with an accuracy of 93% (internal validation set). The model takes as input a SMILES of a compound and uses six quantum mechanics descriptors to identify its reactive site(s). On an external validation set, 90% of all molecules were correctly predicted.

Introduction Late stage functionalization (LSF) deals with the delicate differentiation between C-H bonds of various hybridization, electronic and steric environment in densely functionalized molecules.1-3 Employing this strategy, an advanced scaffold, such as drug-like compound, produces a library of structural analogues through the introduction of diversity on specific functional groups. Most commonly, LSF approaches rely on a multitude of mechanistic manifolds to address the widest possible set of C-H bonds. Substitution of hydrogen can be divided into three different classes, each with many subtypes: 1. Initial abstraction of a hydrogen to give a reactive intermediate (cation, radical, anion, organometallic), which is subsequently combined with a reagent to give a new functional group. This class includes both directed and non-directed C-H activation with transition metal complexes,46 as well as deprotonation or radical abstraction of the most susceptible C-H bond.7-9 2. Concerted breaking of the C-H bond and formation of the new C-X bond, including most carbene(oid)/nitrene(oid)/oxene(oid) additions into electron rich C-H bonds,10,11 but also certain aromatic substitutions such as iodination (vide infra). 3. Addition of a reagent to an unsaturated (normally aromatic) carbon, followed by subsequent elimination of the hydrogen in the attacked position, including examples like the Minisci reaction1214 and Electrophilic Aromatic Substitution (EAS).15 Especially powerful are reactions allowing for the installation of not only a single diversification but rather a LSF handle for the introduction of several different functional groups (diversity oriented LSF). Along with recent developments in C-H borylation chemistry, largely developed by Malecza/Smith16,17 and Hartwig groups,18 a complementary tool in terms of regiochemical outcome is classical electrophilic aromatic halogenations. This reaction yields compounds where a single point of halogenation can give rise to diverse sets of compounds, with widely differentiated properties, using well established cross coupling chemistry.

ACS Paragon Plus Environment

The Journal of Organic Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

In an emerging area of organic chemistry, such as LSF, where C-H bonds rather than C-X are functionalized, tools that predict the regiochemical outcome of reactions targeting the omnipresent C-H bonds are needed. Within AstraZeneca, we have recently embarked on a program where we want to support synthetic chemists with tools to select suitable LSF conditions for desired transformations. To this end, we will develop predictive tools for a range of common C-H activation reactions built on a common platform. In the future, we aim to develop models for both reactivity and selectivity of a set of diverse LSF reactions. Herein we will present our first efforts in this direction, a predictive classifier model able to distinguish between aromatic carbons that are active and inactive towards electrophilic aromatic substitution. Background Electrophilic aromatic substitution is one of the standard organic transformations taught in undergraduate synthesis. The classical mechanism involves addition of an electrophile to the aromatic ring to form a σcomplex (the Wheland intermediate, an arenium ion) followed by deprotonation to give the observed substitution product (Scheme 1). The reactivity and regioselectivity would generally be rationalized from the ability of the substituents to stabilize or destabilize the intermediate arenium ion.15 Schoolbook: R

E+

R

E

-H+

H -complex

R

E

R

E

Realistic: R

E+

E R -complex

-complex -H+

R

E

-H+



H

R

E

Scheme 1. Mechanisms of EAS. The schoolbook mechanism does not give the full picture: the reactivity and selectivity only correlate to some extent. Relative reactivity of simple substrates like benzene and toluene can vary by several orders of magnitude depending on the electrophile and the reaction conditions, whereas selectivity variations are much smaller.19A more realistic mechanism can rationalize some of these observations by the formation of a -complex, which can sometimes be rate-limiting but not selectivity-determining (Scheme 1). 20,21 Furthermore, for some electrophiles such as iodine, the σ-complex may not be a true intermediate, but rather a transition state with a concerted deprotonation, detectable by a significant kinetic isotope effect.22 However, brominations and chlorinations seem to be reasonably well approximated by the schoolbook mechanism. A general approach to selectivity predictions is to calculate all transition states on the path from starting materials to products. This direct approach is challenging, in part because the product ratio may not always be connected to transition states on the potential energy surface,20 but also due to the inherent difficulty in automating transition state searches for multiple steps and conformations. A simpler approach relies on

ACS Paragon Plus Environment

Page 2 of 16

Page 3 of 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Organic Chemistry

based on the Hammond postulate, which states that late transition states resemble the products significantly more than the reactants.23 For electrophilic aromatic halogenations with Br and Cl, this allows to use the σcomplex as a model system for regioselectivity.15 Liljenberg et al. have utilized this approach using DFT optimization of the arenium intermediates.24 An alternative way forward is to predict relative reactivity from properties of the reactants. This is the principle behind empirical substituent constants, such as electronic Hammett parameters and steric Taft parameters.25 Automated applications of such rules were implemented already in the 1980’s.26 However, empirical parameters are not well suited to our purpose. It is sometimes necessary to move outside the space of existing substituent parameters, and the relationship becomes more complex with polycyclic heteroaromatic systems. Methods based on unbiased QM calculations are therefore preferred. A computational approach based on Frontier Molecular Orbitals was advanced for EAS by Fukui et al. already in the 1950’s.27 Later contributions have extended this approach to other QM-derived atomic properties.2832

Much of the earlier work in this field has been focused on all-carbon aromatic substrates, particularly on substituted benzenes. Substrates for LSF in the pharmaceutical industry contain a large proportion of heteroaromatics, both mono- and polycyclic. Recognizing this need, later contributors have assembled larger and more diverse sets of compounds for training and/or testing and focused on high-throughput methods. Kruszyk et al. observed that reactivity correlates with NMR chemical shifts, and developed a model where EAS reactivity is based on predicted chemical shifts from already developed models in commercial software.33 Interestingly, this easily applied method gave good qualitative predictions (80%). Very recently, the same group revealed a different method, RegioSQM, which uses simple semi-empirical methods to calculate σ-complexes with proton as a model electrophile.34 The method can be tuned to minimize false negatives or false positives by arbitrary selection of an energy threshold. By accepting all isomers within 3 kcal/mol of the minimum energy structure, the authors could show that for 96% of all molecules at least one experimental reactive site was identified correctly (together with a number of unverified positives). We note that a potential problem in the type of data set used here is that only one product has been reported for most experiments, whereas careful analysis frequently reveals the presence of regioisomers in the reaction mixture. Thus, “false positives” need not be wrong, they may in fact correspond to isomers that are present but unreported, whereas false negatives are much more of a concern. Our goal is to develop an automated method to support our LSF workflow with high accuracy prediction of reaction outcome for EAS (in particular bromination) of typical drug candidates. Based on our perceived needs and the methods tested in the literature, we aim to develop a classifier method that will identify one or a few reactive sites in a drug-like molecule. The method should satisfy the following requirements:  Avoid false negatives, allow occasional false positives,  Avoid empirical data; instead use computed properties,  Use atomic descriptors that can be computed once and be applied to a range of reaction types,  Accept input as 1D (e.g., SMILES), 2D (e.g., drawing), or 3D molecules,  Have an automated workflow,  Deliver results on the same or next day. With these aims and our currently available methods in mind, we have excluded methods based on direct determination of transition states, as well as methods requiring extensive conformational searching at high levels of theory, noting in passing that such methods might be productively applied in the near future.35 For

ACS Paragon Plus Environment

The Journal of Organic Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

our models, we will utilize DFT-derived ground state descriptors as input for the machine learning models. Furthermore, we will also augment these with energies calculated using a reduced form of RegioSQM34 (vide infra) and used as atomic descriptors for the site of protonation. Traditionally, descriptor-based quantitative structure-reactivity relationships (QSRR) have been based on multidimensional regression to find linear free energy relationships.36,37Linear models for yields have also been derived using machine learning methods.38,39 However, the data available for training and testing is mostly binary, identifying one aromatic carbon in each molecule as reactive. With this type of data, linear regression to derive a typical QSRR model is not feasible. Instead, we will utilize atom- and bond-level descriptors, and test different types of machine learning (ML) approaches to derive reactivity classifier models. We will measure the success of the method in three different ways. Firstly, the model will be developed on an atomic level, and therefore we will apply an atom-level test, where each aromatic carbon is simply classified as reactive or unreactive. This way of measuring the performance of the models will be used for method development. The majority of our compounds has only one atom that reported as reactive; the baseline prediction (all atoms assigned as unreactive) will thus yield an accuracy of 75%. This is the number that atom-level predictions should be compared to. Secondly, we will apply a stringent molecule-level test, where a molecular prediction will only be considered successful if all aromatic atoms in the molecule are correctly classified as either reactive or unreactive. The baseline here will be a random model. With only 25% of the available aromatic carbons being assigned as reactive, and on average five aromatic C-Hs per molecule, the probability for a random correct prediction is approximately 0.25*(1-0.25)4 = 8%. Thirdly, we will apply a more realistic molecule-based test where we will allow one false positive per molecule. This is based on the assumption that undetected regioisomers will be present in the experiments underlying our data, and thus the model should not be penalized for predicting an isomer that may have been present without being detected. Beyond the three global tests, we will also utilize the ability of some machine learning methods to assign a probability to a binary prediction. This will allow us to identify the most probable reactive site in each molecule, irrespective of how many atoms have been assigned as reactive. This test will be intermediate between the second and third global test and will recognize that a substrate that was found to be unreactive will sometimes become reactive under harsher conditions. From a user perspective, this mode of running has the advantage of always delivering exactly one predicted product per molecule. Machine Learning Models. Although machine learning models have become a hot topic only in the recent decade, these algorithms have been around since the middle of the 20th century.40 In fact, the now very QSRR models are examples of machine learning, based on linear regression.41 Such early models used few molecular descriptors, and were applicable to only small series of similar compounds. As the number of relevant descriptors and the molecular complexity grew, linear models became less helpful, shifting the spotlight to different methods for data fitting. Logistic Regression. Being a special case of linear regression, logistic regression is a statistical method for analyzing a dataset used to predict a binary outcome.42 A linear relationship between the descriptors and the experimental results is used to estimate the probability of each carbon to be active or not towards EAS. The sigmoid function is applied to the probabilities in order to output a binary classification (Figure 1).

ACS Paragon Plus Environment

Page 4 of 16

Page 5 of 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Organic Chemistry

Figure 1. Underlying mechanism of logistic regression. Support Vector Machine. An SVM model is another example of binary classifier, that categorizes the data by separating it using a hyperplane.43 Figure 2 illustrates how an SVM would classify data with 2 descriptors, where the hyperplane is simply a line dividing the points. First, the slope of the line would be determined, then the best separation would be achieved by optimizing the distance between each point and the line. This is done by maximizing the length of the support vectors (green lines connecting the data points to the separation line). The more descriptors are used in a model the higher the dimension of the hyperplane, thus visualizing the boundary between classes becomes more difficult.

Figure 2. Underlying mechanism of support vector machine. Artificial Neural Network. An artificial neural network got its name from the seeming similarity to a nervous system. Artificial neurons are connected by a network, which can compute values from inputs. The structure of an ANN can be split into three main parts (Figure 3).44 The input layer receives the descriptors for the model to process. The input layer passes this information to the hidden layers, which apply transformations to these values until they reach the output layer. This last part of the neural network returns the output value that corresponds to the prediction of model. The learning process is achieved by optimizing the importance (or weights) of the connections between neurons. Algorithms like backpropagation are often used to this end. Neural networks are black boxes, meaning that it is rather difficult to understand how each descriptor is treated within the network.

Figure 3. Underlying mechanism of an artificial neural network. Random Forest. Random forests are ensembles of decision trees of various sizes that individually make their own predictions (Figure 4).45 These predictions are then merged together for a final collective decision, usually by majority vote. The decision trees built within the forest formulate their own sets of rules based on limited information, which makes them poor predictors on their own. However, when many poor

ACS Paragon Plus Environment

The Journal of Organic Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 16

predictors are assembled into a forest, the predictive power increases substantially while also avoiding the problem of overfitting.

Figure 4. Underlying mechanism of a random forest classifier. Computational Details Assembly of molecular sets. In this work, experimental results were obtained from a mix of compounds sets published in literature34,46 as well as results mined from AstraZeneca’s reaction library (ELN). As such, only the compounds taken from literature were used for the training, cross-validation and internal testing of the models. The molecules obtained from the ELN were supplemented with some literature results and used as a separate, external testing set. The sets from literature were screened for duplicates, which were removed, resulting in a total of 694 molecules, out of which 602 (2994 aromatic unsubstituted carbons of interest) were used for training and internal validation and 92 were set apart as the external validation. It is important to consider that because the sets are unbalanced (there are more positions that are not reactive towards EAS), the base line for any model is not 50%, but rather 75% for this particular data. This value is obtained if all positions were labeled as inactive. All molecules and the corresponding experimental sites of reaction for electrophilic aromatic substitution can be found in the Supporting Information. Although it is not indicated, we assume that the sites of reaction reported in literature are for the major product only. Additionally, it was noticed that the -SF5 group is not properly treated by the force field used in RDkit47 (MMFF), so the positions on the fluorine atoms need manual adjustment before proceeding with optimization. Generation of descriptors. To compute the necessary descriptors for the training of the models, a program (MAP.py) was written in python. The protocol and the program’s dependencies are stated in the Supporting Information. For each compound, the following procedure was applied. First the SMILES were converted into a 3D structure after running a short conformational analysis using MMFF (or UFF if some atom type was not available). Taking the lowest energy conformer, a geometry optimization was done in Gaussian using TPSSh/Def2SVP. Using this geometry, all the descriptors, except reduced-RegioSQM energies, were obtained (Table 1). Table 1. Descriptors used for modeling. Descriptor Partial Charges C-H bond order

Source Program DDEC648 DDEC6

ACS Paragon Plus Environment

Abbreviation Q CHBO

Page 7 of 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Organic Chemistry

Sum of bond orders around arom. C DDEC6 SBO Condensed Fukui coefficients for From Gaussian orbitals Fukui electrophilic attack Atomic solvent accessible surface GEPOL9349 SAS Reduced-RegioSQM energies Based on RegioSQM34 rRSQM Partial charges and bond orders were computed using DDEC6, a method available through the Chargemol code (open source). This charge assignment is a combination of the Iterative Hirshfled and Atom-InMolecule approaches,48 leading to little basis-set and conformation dependency as well as good handling of buried atoms. Condensed Fukui coefficients for electrophilic attack (𝑓𝑘― ) were computed using the Frontier Molecular Orbitals (FMO) approximation. Using this simplification, the coefficients were obtained from the highest occupied molecular orbital (HOMO) of a molecule by equation 1,50 where the Fukui index for an atom 𝑘 are obtained from the LCAO coefficients 𝑐𝐻𝑂𝑀𝑂(𝑖) and the overlap matrix elements 𝑆𝑖𝑗. 𝑓𝑘― = ∑𝑎

[𝑐𝐻𝑂𝑀𝑂(𝑎)2 + 𝑐𝐻𝑂𝑀𝑂(𝑎)∑𝑏 ≠ 𝑎𝑐𝐻𝑂𝑀𝑂(𝑏)𝑆𝑎𝑏]

∈ 𝑘

(1)

The solvent accessible surface values for each atom were obtained using the GEPOL9349 program (open source). This method first builds a set of solvent excluding spherical surfaces using a probe of a given radius (default value of 1.4 was used here). These surfaces are then divided in triangular tesserae and used to calculate the area exposed to the solvent for each atom of the molecule. Lastly, the calculation of the reduced RegioSQM energies was largely based on the RegioSQM method, but several changes have been made. The procedure begins the same way, by identifying all aromatic carbons that are bonded to a hydrogen. For each position, a protonated molecule is produced. The next steps are different from the RegioSQM approach. For each protonated state, we run a conformational search and select only the lowest conformer. This conformer is then optimized using PM3 in chloroform (using Gaussian). The PM3 energies are compared among each other and the value relative to the lowest energy is assigned to the corresponding carbons (see Figure 5). It is this relative PM3 energy that becomes the descriptor for the machine learning models.

ACS Paragon Plus Environment

The Journal of Organic Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 5. Descriptors generation example. The various descriptors for aromatic carbons of each compound were then assembled into input vectors of the form shown in Figure 6. We noted that equivalent positions could get different descriptors due to numerical errors or selection of one particular conformation. In such cases, the values were averaged and set to be equal on the equivalent positions before proceeding. The models shuffled and randomized which data points to learn from at each iteration.

Figure 6. Vector representation of data collected for each aromatic carbon.

Machine learning models. Four classifiers were tested: logistic regression (LR), support vector machine (SVM), artificial neural network (ANN), and random forest (RF). The code was written in python using RDkit47 and sklearn51 packages. The details of the implementation are collected in the Supporting information. For each classifier, the following procedure was applied. First, the model with default parameters was tested ten times. For this, the data set was randomly split into a training and validation sets, used to train the model and obtain an estimate of the accuracy, respectively. Note that for ANN, SVM and LR, the data was standardized by removing the mean and scaling to unit variance. This step was done using the StandardScaler feature of the preprocessing functionality in sklearn. Next, the data set was once again randomly split into 3 sets: training, cross-validation and testing. Using the training and cross-validation sets, hyperparameter optimization was completed. The accuracy on the testing set was obtained from the model trained on the best set of parameters. This step was performed 10 times.

Results and Discussion Atomic Descriptors. In complex molecules, the reactivity of the different atoms is determined by a combination of many factors. Thus, accurately predicting which aromatic carbon in such molecules will be prone to EAS would require studying more than one property. Using chemical intuition and experience, chemists are often able to rationalize the observed products, but for a predictive model, quantitative descriptors are needed. There exist hundreds of molecular descriptors, however regioselectivity requires the use of atomic properties. As such, we selected a few descriptors that we could easily interpret and that mechanistically make sense to use in the setting of an EAS reaction. The most promising descriptors were the condensed Fukui coefficients. These coefficients are reactivity indices that quantitatively define the relative nucleophilicity of each atom in a molecule. The higher the value of the coefficient, the more reactive the atom will be towards a hypothetical electrophile. Originally,

ACS Paragon Plus Environment

Page 8 of 16

Page 9 of 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Organic Chemistry

Fukui indices are molecular descriptors, giving the reactivity of an entire molecule relative to a reference. To be useful for the prediction of regioselectivity, this descriptor must be condensed to each atom of a molecule. This can be done in several ways;52 we have chosen the FMO approximation where the Fukui coefficients are estimated from the density of the HOMO. This is a fast method that requires only the molecular orbitals as input and produces similar results to other approaches. Another way to describe the reactivity of an atom towards an electrophile is its partial charge: electrophiles would be attracted to carbons that are more negatively charged. While there are many charging schemes available, we have chosen to use the DDEC6 scheme.48 This method uses a combination of the Iterative Hirshfled and Atom-In-Molecule approaches to assign partial charges to atoms. DDEC6 charges do not suffer from basis set dependence and are practically unaffected by conformation. Additionally, unlike methods derived from the electrostatic potential, this charging scheme is able to assign reasonable charges to buried atoms. Bond orders around aromatic carbons can be used to relate to reactivity since, in general, stronger bonds would lead to a more stable intermediate, and thus would indicate a more reactive position. We chose two types of descriptors, aromatic C-H bond order and the sum of all bond orders connected to the carbon. The only steric descriptor used in this work is the atomic solvent accessible surface. Sterically hindered positions would have a lesser chance to interact with the surrounding molecules, be it solvent or electrophiles, and are expected to be less reactive. The area exposed to the solvent for each atom is computed by splitting the molecular solvent accessible surface and summing up all the sections connected to the same atom.49 This descriptor was computed using the optimized geometry of the input molecule. Different sizes of probes can be employed to generate the initial molecular surface and mimic various electrophiles. However, the experimental data used in this work are mostly for halogenation reactions, thus a small probe of the default diameter was used. After collecting more experimental results with bulkier electrophiles, the models can be re-trained to take the size of the electrophile into consideration. While the grand goal of this work is to produce a model for the prediction of relative amounts of product, not enough quality data is currently available for training. However, lots of studies report the major reactivity site, thus a binary classifier model to predict whether an aromatic carbon would be active or not toward EAS is achievable. It is important to keep in mind that the computational models can only learn from what they are trained on. This introduces a limitation faced by any model that requires training: if the model is presented with an input far different from what it has learned from, only limited accuracy can be expected. To ensure the robustness of the models, descriptors that have strong linear correlations must be removed. As shown on Figure 7, the descriptors selected showed little to no correlation. In order to verify our hypothesis that condensed Fukui coefficients would have the highest predicting power among the

ACS Paragon Plus Environment Figure 7. Correlation heatmaps between descriptors.

The Journal of Organic Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

descriptors chosen, we assembled the data into vectors as described in the Methodology and proceeded to training the different models. Model Training. We were interested in comparing how a classical regression model would perform compared to other machine learning algorithms. Thus, a total four models were implemented using (1) logistic regression, (2) support vector machine, (3) artificial neural network, and (4) random forest. The performance of these models will be assessed using the atom-level accuracy, since the carbons’ descriptors of all compounds will be scrambled and partitioned randomly. To begin, default parameters were used in all four models to get an idea of their performance on the dataset. The data was randomly split into training and testing tests (80:20) and the average accuracy was computed over 10 runs (Table 2). Logistic regression was found to have the lowest accuracy of the four models, which is probably due to the impossibility of separating the data using only linear relationships. The other three models performed similarly to each other, with random forests displaying the best results.

Table 2. Machine learning models and their corresponding accuracies. Accuracy (%) averaged over 10 runs Model LR SVM NN RF

Default parameters 84.1 85.7 86.7 88.7

Best parameters 84.5 87.36 87.01 89.88

Next, the random forest and logistic regression models were used to rank the descriptors by applying the recursive feature elimination procedure implemented in sklearn. The ranking is done by recursively considering smaller and smaller sets of descriptors and estimating the importance of each type on the accuracy of the model. The resulting rankings of descriptors were the same for both models (see Table S4 in Supporting Information). Fukui coefficients were indeed found to have the highest predictive power, followed by C-H bond orders and partial charges. The Fukui function, which measures the localized density of the most easily removed electrons, is expected to correlate well with reactivity towards electrophiles, and thus the good performance here is expected. Surprisingly, the solvent accessible surface was among the lowest ranked descriptors. This can be due to the training data consisting of mostly halogenations, rather small electrophiles that are not as impacted by steric hindrance. We believe that considering bulkier electrophiles will require a more significant input from steric descriptors. Internal validation. The performance of a model is affected by many parameters, which can be adjusted to obtain better results. Parameter optimization was next performed for the four models. The data set was once again randomly split into 3 parts for each run: training, cross-validation, and testing sets. While changing the parameters of the models, the accuracy was computed on the cross-validation set. Using the best set of parameters for each model (details in Supporting Information), the accuracy on the testing set was obtained. The results over 10 runs are collected in Table 2. Once again, random forest outperformed other models, with an average accuracy of 89.9%. The best random forest model (Model-1) obtained this way had an accuracy of 91.7% (with F1 score of 0.82 and AU-ROC of 0.87). Since the training of the classifier required scrambling of the entries, only atom-level accuracy was calculated in this step.

ACS Paragon Plus Environment

Page 10 of 16

Page 11 of 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Organic Chemistry

Reduced RegioSQM. The idea to predict the regioselectivity of EAS reactions by the means of the σcomplex (Scheme 1) was successfully implemented in a method called RegioSQM.34 In this program, all potentially reactive aromatic carbons are identified and a σ-complex is formed by protonating each position. For each protonated state, up to 20 conformers are generated and optimized using PM3 in chloroform. Finally, the energies of these molecules are compared, and the positions corresponding to the protonated molecules with the lowest energies are labeled as reaction sites. The authors report an overall success rate of 81%, when a cutoff of 1 kcal/mol is used. That is, all sites within 1 kcal/mol of the lowest energy conformer are labeled as active. More generally, in 92% of cases, an experimentally observed reactive site was among the positions identified as active by this method, using the 1 kcal/mol cutoff. Loosening the cutoff energy to 3 kcal/mol increases the number of correct predictions to 96%, but comes at an expense of a higher number of false positives. Instead of computing the energies of several conformers for each protonated state, we thought of reducing the number of conformers generated to the bare minimum. Thus, for each potential reactive aromatic carbon, only the conformer with the lowest energy is retained, as computed by the force field. Using this reduced RegioSQM (rRSQM) approach, we obtained 88% accuracy on the same set with a cutoff of 3 kcal/mol (see Supporting Information for details). Thus, using the rRSQM approach or the Random Forest classifier yields a similar predictive power overall. Lastly, the four classifier models were retrained after combining all six descriptors. The rRSQM descriptor contained the relative energies of the σ-complexes corresponding to each aromatic carbon, which allowed removal of the arbitrary cutoff value. As summarized in Table 3, the accuracy of the combined models increased, with Random Forest showing the best predictive power of 93.1% as average on 10 runs, and best model (Model-2) having an accuracy of 94.99% (F1 score of 0.90, AU-ROC of 0.94). Table 3. Machine learning models and their accuracy using all descriptors including rRSQM. Model LR SVM NN RF

Accuracy (%) over 10 runs 91.39 92.04 91.40 93.11

To better understand the limitations of our models, an external validation was needed. Using this new set, the results for entire molecules could be examined. External validation. As a final validation step for our models, a set of 92 electrophilic aromatic halogenation reactions (429 data points) were compiled from AstraZeneca’s database as well as mined from literature. The Random Forest was re-trained using all the data gathered in the previous step (Model-3). Although the model was trained on mainly brominations, we were curious to see how transferable this classifier is to other electrophiles. This external validation set comprised of 62 brominations and 18 chlorinations. Additionally, 12 iodinations were assembled, but will be treated as a separate set.

ACS Paragon Plus Environment

The Journal of Organic Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table 4. Results for the external validation sets with different electrophiles. Halogen Br Cl I Br and Cl Total

Atom-Level accuracy (%) Model-3

89.9 93.8 66.1 90.9 87.6

The results obtained by running this external validation set (brominations and chlorinations) through the final model are summarized in Table 4. Overall, 91% of all carbons were labeled correctly and 82% of all reactive sites were identified. Since the accuracy obtained was similar to the one computed for the internal validation, we can conclude that the model is not over-trained, and transferrable to chlorinations as well. Unsurprisingly, the model shows the lowest prediction power for iodinations. This is expected since the relative energies of the σ-complexes were used as one of its descriptors. As explained previously, while being predictive for brominations and chlorinations, this descriptor is not relevant for iodinations.22 Thus, these data points will not be used to evaluate the performance of the model. Next, we could apply the molecule-level test to see how many molecules were correctly labeled on all of their aromatic C-Hs. Out of 80 molecules, 23 had at least one wrongly assigned carbon, resulting in 71% of all molecules perfectly labelled. To gain insight, incorrectly predicted molecules were examined. Analysis of Model-3. After inspecting the incorrectly labelled molecules, all errors could be classified into three groups. Examples of each type are presented below. One of the advantages of the Random Forest classifier is the possibility to see how confident the model is in its prediction. Thus, we used a color scheme to indicate the positions predicted with high confidence in green (above 70%) and the ones with low confidence (between 50 and 70%) with orange. The first type of error identified one site as reactive, whereas another site was found experimentally (7 molecules); there was no pattern observed linking these compounds together (Figure 8). To address this type of error, we could add more data points on similar molecules to the training set, and possibly also add a tautomeric enumeration to the method.

Figure 8. Examples of error type 1. Black circle indicates the experimental reactive site, green and orange circles show the predicted sites prediction confidence below 70% respectively. Another type of error is illustrated in Figure 9, where the molecules were predicted to have no reactive site at all (6 molecules). This is possible because, unlike models such as RegioSQM where each molecule is studied as a whole, the random forest treats every carbon as a separate prediction. Interestingly, for these compounds, the experimentally determined reactive site corresponds to the highest % confidence of reactivity in almost all cases. For example, on the right-most compound on Figure 9, the carbon circled in black was found to have a 33% probability of being reactive (and thus has been classified as unreactive), while all the other aromatic carbons have a reaction probability of less than 1%. To ensure that at least one

ACS Paragon Plus Environment

Page 12 of 16

Page 13 of 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Organic Chemistry

reaction site is output, if all carbons of a molecule are active with a certainty below 50%, the one with the highest percentage will be labelled as the most likely reactive site.

Figure 9. Examples of error type 2. Black circle indicates the experimental reactive site. The last type of error are false positives (Figure 10). In this case, the reactive site of each compound was correctly identified, but additional false positives (carbons labeled inactive experimentally, but predicted active) were also found (10 molecules). As stated previously, the false positives may in actuality be reactive carbons leading to minor products, which are often omitted in publications. For this reason, in a more realistic molecule-based test, one false positive per molecule will be allowed.

Figure 10. Examples of error type 3. Black circle indicates the experimental reactive site, green and orange circles show the predicted sites with above 70%, and below 70% confidence respectively.

Final result. The implementation of the correction for errors of type 2 and 3 led to the final molecule-based metric of the performance of our model. With these slightly looser criteria, Model-3 is able to label all carbons correctly in 72 molecules out of 80, resulting in a molecule-based accuracy of 90%. Conclusion In summary, the goal of this study was to investigate if a machine learning algorithm can learn to correctly classify carbons as active or inactive towards electrophilic aromatic substitution. Using a combination of QM and topological descriptors, four models were trained: logistic regression, support vector machine, artificial neural network classifier and a random forest classifier. Based on our results for the training on electrophilic aromatic bromination, Random Forest model outperformed other models, demonstrating the ability to correctly label 93% of unsubstituted aromatic carbons in the internal validation set. Additionally, the transferability of the model was tested on an external validation set with mixed halogenations (brominations, cholorinations and iodinations). The classifier displayed excellent performance on chlorinations and brominations, while the predictive power on iodinations was much lower. This drop in accuracy was expected since iodinations follow a slightly different reaction mechanism, thus the descriptors used in this model were not applicable in this case. Supplementary to a high accuracy, the Random forest provides a confidence of prediction, allowing the user to know how certain the model is in its prediction for each carbon. One advantage of the current model is its flexibility. As more data becomes available, the model can be rapidly retrained to increase the accuracy. We also expect to revisit the more general reaction in the future, adding descriptors to characterize the electrophile. A possible future extension is to also include reaction conditions and to aim for a true reactivity model, but we believe that this will require acquisition of additional experimental data obtained using focused libraries under well-controlled conditions.

ACS Paragon Plus Environment

The Journal of Organic Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The final model is currently being deployed within AstraZeneca, to support medicinal chemists working with late stage functionalization. Our expected use is to identify in which cases we can employ electrophilic bromination to affect a desired substitution. This is our first step in generating a toolbox that will eventually allow chemists to choose among several reactions depending on desired site of functionalization. Supporting Information. Additional information of the methodology and testing results, SMILES and 3D coordinates (with labeled experimental reaction sites) of all compounds used in training and validation, details on descriptors generation workflow, descriptors of all compounds. The python code is available at https://github.com/Ianiusha/AutoLSF/tree/master/EAS. Acknowledgements. We are grateful to Ola Engkvist, Philipp Buerger, Christian Tyrchan, and Christian Sköld for helpful discussions in the preparation of this manuscript. AT is a fellow of the AstraZeneca postdoc programme. References (1) Cernak, T.; Dykstra, K. D.; Tyagarajan, S.; Vachal, P.; Krska, S. W. The medicinal chemist's toolbox for late stage functionalization of drug-like molecules. Chem. Soc. Rev. 2016, 45, 546576. (2) Fier, P. S.; Hartwig, J. F. Synthesis and Late-Stage Functionalization of Complex Molecules through C-H Fluorination and Nucleophilic Aromatic Substitution. J. Am. Chem. Soc. 2014, 136, 10139-10147. (3) Chen, D. Y. K.; Youn, S. W. C-H Activation: A Complementary Tool in the Total Synthesis of Complex Natural Products. Chem. - Eur. J. 2012, 18, 9452-9474. (4) Xue, X.-S.; Ji, P.; Zhou, B.; Cheng, J.-P. The Essential Role of Bond Energetics in C–H Activation/Functionalization. Chem. Rev. 2017, 117, 8622-8648. (5) Murakami, K.; Yamada, S.; Kaneda, T.; Itami, K. C–H Functionalization of Azines. Chem. Rev. 2017, 117, 9302-9332. (6) Gensch, T.; Hopkinson, M. N.; Glorius, F.; Wencel-Delord, J. Mild metal-catalyzed C–H activation: examples and concepts. Chem. Soc. Rev. 2016, 45, 2900-2936. (7) Yi, H.; Zhang, G.; Wang, H.; Huang, Z.; Wang, J.; Singh, A. K.; Lei, A. Recent Advances in Radical C–H Activation/Radical Cross-Coupling. Chem. Rev. 2017, 117, 9016-9085. (8) Gormisky, P. E.; White, M. C. Catalyst-Controlled Aliphatic C–H Oxidations with a Predictive Model for Site-Selectivity. J. Am. Chem. Soc. 2013, 135, 14052-14055. (9) Bigi, M. A.; Reed, S. A.; White, M. C. Directed Metal (Oxo) Aliphatic C–H Hydroxylations: Overriding Substrate Bias. J. Am. Chem. Soc. 2012, 134, 9721-9726. (10) Chiappini, N. D.; Mack, J. B. C.; Du Bois, J. Intermolecular C(sp3)−H Amination of Complex Molecules. Angew. Chem. Int. Ed. 2018, 57, 4956-4959. (11) Roizen, J. L.; Zalatan, D. N.; Du Bois, J. Selective Intermolecular Amination of CH Bonds at Tertiary Carbon Centers. Angew. Chem. Int. Ed. 2013, 52, 11343-11346. (12) Duncton, M. A. J. Minisci reactions: Versatile CH-functionalizations for medicinal chemists. MedChemComm 2011, 2, 1135-1161. (13) Molander, G. A.; Colombel, V.; Braz, V. A. Direct Alkylation of Heteroaryls Using Potassium Alkyl- and Alkoxymethyltrifluoroborates. Org. Lett. 2011, 13, 1852-1855. (14) Seiple, I. B.; Su, S.; Rodriguez, R. A.; Gianatassio, R.; Fujiwara, Y.; Sobel, A. L.; Baran, P. S. Direct C−H Arylation of Electron-Deficient Heterocycles with Arylboronic Acids. J. Am. Chem. Soc. 2010, 132, 13194-13196. (15) Clayden, J. G., N.; Warren, S.; Wothers, P.: Organic Chemistry; Oxford University Press: New York, 2012. (16) Cho, J.-Y.; Tse, M. K.; Holmes, D.; Maleczka, R. E.; Smith, M. R. Remarkably Selective Iridium Catalysts for the Elaboration of Aromatic C-H Bonds. Science 2002, 295, 305-308.

ACS Paragon Plus Environment

Page 14 of 16

Page 15 of 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Organic Chemistry

(17) Preshlock, S. M.; Ghaffari, B.; Maligres, P. E.; Krska, S. W.; Maleczka, R. E.; Smith, M. R. High-Throughput Optimization of Ir-Catalyzed C–H Borylation: A Tutorial for Practical Applications. J. Am. Chem. Soc. 2013, 135, 7572-7582. (18) Larsen, M. A.; Hartwig, J. F. Iridium-Catalyzed C–H Borylation of Heteroarenes: Scope, Regioselectivity, Application to Late-Stage Functionalization, and Mechanism. J. Am. Chem. Soc. 2014, 136, 4287-4299. (19) Olah, G. A. Aromatic substitution. XXVIII. Mechanism of electrophilic aromatic substitutions. Acc. Chem. Res. 1971, 4, 240-248. (20) Nieves-Quinones, Y.; Singleton, D. A. Dynamics and the Regiochemistry of Nitration of Toluene. J. Am. Chem. Soc. 2016, 138, 15167-15176. (21) Liljenberg, M.; Stenlid, J. H.; Brinck, T. Mechanism and regioselectivity of electrophilic aromatic nitration in solution: the validity of the transition state approach. J. Mol. Model. 2017, 24, 15. (22) Liljenberg, M.; Stenlid, J. H.; Brinck, T. Theoretical Investigation into Rate-Determining Factors in Electrophilic Aromatic Halogenation. J. Phys. Chem. A 2018, 122, 3270-3279. (23) Hammond, G. S. A Correlation of Reaction Rates. J. Am. Chem. Soc. 1955, 77, 334-338. (24) Liljenberg, M.; Brinck, T.; Herschend, B.; Rein, T.; Rockwell, G.; Svensson, M. Validation of a Computational Model for Predicting the Site for Electrophilic Substitution in Aromatic Systems. J. Org. Chem. 2010, 75, 4696-4705. (25) Hansch, C.; Leo, A.; Taft, R. W. A survey of Hammett substituent constants and resonance and field parameters. Chem. Rev. 1991, 91, 165-195. (26) Bures, M. G.; Roos-Kozel, B. L.; Jorgensen, W. L. Computer-assisted mechanistic evaluation of organic reactions. 11. Electrophilic aromatic substitution. J. Org. Chem. 1985, 50, 4490-4498. (27) Molecular Orbital Theory of Orientation in Aromatic, Heteroaromatic, and Other Conjugated Molecules. J. Chem. Phys. 1954, 22, 1433-1442. (28) Hirao, H.; Ohwada, T. Theoretical Study of Reactivities in Electrophilic Aromatic Substitution Reactions:  Reactive Hybrid Orbital Analysis. J. Phys. Chem. A 2003, 107, 2875-2881. (29) Cao, J.; Chen, F. Theoretical study on the correlation of the experimental nucleophilic and electrophilic reaction rates of aromatic compounds with the prediction results of theoretical methods. Youji Huaxue 2016, 36, 2463-2471. (30) Zhou, Z.; Parr, R. G. Activation hardness: new index for describing the orientation of electrophilic aromatic substitution. J. Am. Chem. Soc. 1990, 112, 5720-5724. (31) Clark, L. A.; Ellis, D. E.; Snurr, R. Q. A Fukui function overlap method for predicting reactivity in sterically complex systems. J. Chem. Phys. 2001, 114, 2580-2591. (32) Liu, Y.; Yang, Z.-Z.; Zhao, D.-X. Rationalization of regioselectivity of electrophilic substitution reaction for cyclic compounds in terms of Dpb values. Chin. Chem. Lett. 2015, 26, 553-556. (33) Kruszyk, M.; Jessing, M.; Kristensen, J. L.; Jørgensen, M. Computational Methods to Predict the Regioselectivity of Electrophilic Aromatic Substitution Reactions of Heteroaromatic Systems. J. Org. Chem. 2016, 81, 5128-5134. (34) Kromann, J. C.; Jensen, J. H.; Kruszyk, M.; Jessing, M.; Jorgensen, M. Fast and accurate prediction of the regioselectivity of electrophilic aromatic substitution reactions. Chem Sci 2018, 9, 660665. (35) Guan, Y.; Wheeler, S. E. Automated Quantum Mechanical Predictions of Enantioselectivity in a Rhodium-Catalyzed Asymmetric Hydrogenation. Angew. Chem. Int. Ed. 2017, 56, 9101-9105. (36) Oslob, J. D.; Åkermark, B.; Helquist, P.; Norrby, P.-O. Steric Influences on the Selectivity in Palladium-Catalyzed Allylation. Organometallics 1997, 16, 3015-3021. (37) Sigman, M. S.; Harper, K. C.; Bess, E. N.; Milo, A. The Development of Multidimensional Analysis Tools for Asymmetric Catalysis and Beyond. Acc. Chem. Res. 2016, 49, 1292-1301. (38) Ahneman, D. T.; Estrada, J. G.; Lin, S.; Dreher, S. D.; Doyle, A. G. Predicting reaction performance in C–N cross-coupling using machine learning. Science 2018, DOI: 10.1126/science.aar5169.

ACS Paragon Plus Environment

The Journal of Organic Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(39) Nielsen, M. K.; Ahneman, D. T.; Riera, O.; Doyle, A. G. Deoxyfluorination with Sulfonyl Fluorides: Navigating Reaction Space with Machine Learning. J. Am. Chem. Soc. 2018, 140, 5004-5008. (40) A history of machine learning. https://cloud.withgoogle.com/build/data-analytics/explorehistory-machine-learning (accessed 11/07/2018 2018). (41) Mitchell, J. B. O. Machine learning methods in chemoinformatics. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2014, 4, 468-481. (42) Shalev-Shwartz, S.; Ben-David, S.: Logistic Regression In Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press, 2014. (43) Shalev-Shwartz, S.; Ben-David, S.: Support Vector Machines In Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press, 2014. (44) Shalev-Shwartz, S.; Ben-David, S.: Neural Networks. In Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press, 2014. (45) Shalev-Shwartz, S.; Ben-David, S.: Random Forests. In Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press, 2014. (46) Brown, J. J.; Cockroft, S. L. Aromatic reactivity revealed: beyond resonance theory and frontier orbitals. Chem. Sci. 2013, 4, 1772-1780. (47) RDKit: Open-source cheminformatics. (48) T.A. Manz, N. G. L. DDEC6: A Method for Computing Even-Tempered Net Atomic Charges in Periodic and Nonperiodic Materials. ArXiv e-prints 2015. (49) Silla, E.; Tunon, I.; Pascual-Ahuir, J. L. GEPOL: an improved description of molecular surfaces. II. Computing the molecular area and volume. J. Comput. Chem. 1991, 12, 1077-1088. (50) Sanchez-Marquez, J.; Zorrilla, D.; Sanchez-Coronilla, A.; de, l. S. D. M.; Navas, J.; Fernandez-Lorenzo, C.; Alcantara, R.; Martin-Calleja, J. Introducing "UCA-FUKUI" software: reactivityindex calculations. J. Mol. Model. 2014, 20, 2492. (51) Pedregosa F., V. G., Gramfort A., Michel, V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay E. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825-2830. (52) Bultinck, P.; Fias, S.; Van Alsenoy, C.; Ayers, P. W.; Carbo-Dorca, R. Critical thoughts on computing atom condensed Fukui functions. J. Chem. Phys. 2007, 127, 034102.

For Table of Contents Only

ACS Paragon Plus Environment

Page 16 of 16