SCScore: Synthetic Complexity Learned from a ... - ACS Publications

Department of Chemical Engineering, Massachusetts Institute of Technology; 77. Massachusetts Avenue, Cambridge, MA ... compounds, to identify diversit...
0 downloads 11 Views 7MB Size
Subscriber access provided by READING UNIV

Article

SCScore: Synthetic Complexity Learned from a Reaction Corpus Connor W. Coley, Luke Rogers, William H. Green, and Klavs F. Jensen J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00622 • Publication Date (Web): 08 Jan 2018 Downloaded from http://pubs.acs.org on January 11, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

SCScore: Synthetic Complexity Learned from a Reaction Corpus Connor W. Coley, Luke Rogers, William H. Green,∗ and Klavs F. Jensen∗ Department of Chemical Engineering, Massachusetts Institute of Technology; 77 Massachusetts Avenue, Cambridge, MA 02139 E-mail: [email protected]; [email protected] Abstract Several definitions of molecular complexity exist to facilitate prioritization of lead compounds, to identify diversity-inducing and complexifying reactions, and to guide retrosynthetic searches. In this work, we focus on synthetic complexity and reformalize its definition to correlate with the expected number of reaction steps required to produce a target molecule, with implicit knowledge about what compounds are reasonable starting materials. We train a neural network model on twelve million reactions from the Reaxys database to impose a pairwise inequality constraint enforcing the premise of this definition: that on average, the products of published chemical reactions should be more synthetically complex than their corresponding reactants. The learned metric (SCScore) exhibits highly-desirable nonlinear behavior, particularly in recognizing increases in synthetic complexity throughout a number of linear synthetic routes.

Introduction The difficulty of quantifying “complexity” is apparent when simply considering the numerous ways one can define complexity - should a large multifunctional, multichiral compound be 1

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

considered complex? While the total synthesis of a steroid is long and challenging, a synthesis starting from readily-available compounds like cholesterol might only require a few reaction steps. Herein, we focus on the specific definition of complexity that is “synthetic complexity”. In simple terms, we want molecules that are easy to synthesize to have a low score, while molecules that are hard to synthesize to have a high score. Synthetic complexity - or the opposite, synthetic accessibility - can be used in drug discovery (as well-summarized by Méndez-Lucio and Medina-Franco 1 ) as a scoring metric for virtual compound libraries and to assist medicinal chemists in prioritizing certain lead compounds over others, 2,3 for designing compound libraries in diversity-oriented synthesis, 4 or for computer-assisted synthesis planning as a metric for determining how promising each candidate disconnection is while performing a retrosynthetic search. 5–7 The standard procedure for quantifying complexity (including both synthetic complexity and its more general counterpart, “molecular complexity”) is to either (a) construct a heuristic definition based on domain expertise, or (b) design a model that can be regressed on ground truth data. Expert chemists may be asked to score drug-like molecules to validate an expert-defined heuristic in the former case, or to fit a regression in the latter case. Existing approaches to quantifying complexity can be broadly categorized into the following: 1. Scores calculated from a small number of expert-selected structural descriptors 8–12 2. Scores calculated through some sort of automated retrosynthetic analysis 13–16 3. Scores calculated by regressing a large number of structural descriptors 17 4. Scores calculated using fragment or substructure-level contributions 18–23 All of these techniques must be regressed or trained on some ground truth data. For retrosynthetic techniques, this may be the ease of finding a synthetic route starting from simpler starting materials, sometimes as a binary easy/hard score. 16 For numerical scores,

2

ACS Paragon Plus Environment

Page 2 of 40

Page 3 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

often a survey of expert chemists is used as the data set. 12,17 Some studies have examined what features are most prevalent in large chemical databases, most notably Ertl and Schuffenhauer’s use of the PubChem database to identify rare structural motifs when defining their SA_Score metric, 20 or Fukunishi et al.’s 21 use of commercially-available compound databases to correlate complexity with price. Despite numerous studies seeking to quantify complexity, published reactions are rarely used as the basis for scoring. An important exception is the approach proposed by Heifets, 24 who trains a model to predict the number of synthetic steps needed to reach a given chemical from buyable compounds based on 43,976 molecules from the patent literature. Each molecule was labeled with the number of reaction steps required to produce it from a fixed set of commercially-available compounds. We propose a methodology for using precedent reaction knowledge to learn a function for evaluating synthetic complexity. This synthetic complexity score provides an additional metric by which virtual screening or molecular design pipelines can prioritize compounds as a complement to other metrics. Specifically, we train a neural network model to quantify synthetic complexity scores in a manner that correlates with the number of reaction steps required to produce a compound, but does not rely on the availability of multistep reaction pathways, chemist rankings, or rigid encoding of buyable compounds for training. We refer to this learned complexity score as the SCScore.

Approach Desirable Properties Our definition of synthetic complexity arises from one simple premise: if a synthetic route between compounds A and C goes through intermediate B, then the reaction steps A −−→ B and B −−→ C should each be considered productive reaction steps, reflected by an increase in synthetic complexity. There are, of course, many domain-specific definitions of what a “rea3

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 40

sonable” or “good” synthetic route is. However, a synthetic complexity score should trend upwards during a multistep synthesis. 8,24 Equivalently, the retrosynthetic analysis of a target compound should allow us to find a pathway to buyable chemicals with only “downhill” moves that reduce complexity. This enables the prioritization of retrosynthetic disconnections during an automated retrosynthesis search. This prioritization has traditionally been done with simple expert heuristics like the length of molecules’ canonical SMILES 25 representation raised to the 3/2 power (SMILES3/2 ). 5 More generally, aligned with the intent of one of the earliest complexity metrics from Bertz, 8 the metric should reflect when progress is being made in a synthetic sequence. We would also like the synthetic complexity score to be insensitive to the exact starting materials database. For a specific retrosynthetic program, it might make sense to specify precisely which compounds are available. For a universal scoring function, we prefer to keep this definition general and work under the assumption that the most synthetically simple compounds are reasonable starting materials, either purchased directly or synthesized straightforwardly. Chemicals commonly used as reactants are necessarily easier to obtain than products, which provides an indication of what might be commercially available without restricting the analysis to depend on a specific compound database.

Details of Approach We aim to find a suitable function f with flexible parameterization (θ) to map any molecule m to a numerical synthetic complexity score (Equation 1).

SCScore(m) ≡ f (m; θ)

(1)

In order to satisfy the premise of our definition of synthetic complexity, we seek to impose an inequality constraint on the complexity assigned to chemicals appearing in each reaction example {R1 + R2 + ... + Rn −−→ P}, where Ri are the reactants and P is the major product.

4

ACS Paragon Plus Environment

Page 5 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 1: Visualization of the approach to quantifying synthetic complexity (SCScore). Starting from a reaction network of known reactions (a), we wish to find a means of arranging all known chemicals so that products are scored more highly than each of their corresponding reactants (b) as a means of assigning a quantitative synthetic complexity score to every known compound (c) using a parameterized model that is generalizable to novel compounds. In reality, the network of known reactions used for training and testing consists of millions of examples involving millions of distinct chemical species.

5

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 40

As shown in Equation 2, this means that the score assigned to a reaction product should be at least as large as any of the scores assigned to the corresponding reactants.

 f (P ; θ) ≥ max f (Ri ; θ) i

∀ (R1 + R2 + ... + Rn −−→ P)

(2)

To formalize this training objective, we divide our set of reaction examples {R1 + R2 + ... + Rn −−→ P} into a set of molecular pairs {R, P} for which the constraint should hold. Rather than seek an exact solution to achieve this separation, we define a hinge loss function to match this pairwise ranking objective

L(θ) =

X

  max 0, f0 − [f (P; θ) − f (R; θ)]

(3)

(R,P)

where the inclusion of f0 > 0 encourages a minimum degree of separation between reactants and products. L(θ) represents a sum of “penalties” from each reactant and product pair, and is the function that we aim to minimize during model training. Note that because the model f (m, θ) takes a single molecule as input and outputs a single score, two model evaluations are used per (R, P) pair to calculate the difference in their scores. An important consequence of using a hinge loss function (i.e., max(0, x)) instead of a binary cross-entropy loss function (as would be more typical for a pairwise ranking task) is that the model will not try to maximize the separation of scores between reactants and products. If the model recognizes a reactant as significantly simpler than a product–with a difference exceeding f0 –then that example does not contribute to L(θ) and will not be used to update model parameters. This mitigates the risk of overfitting, where molecules that only appear as products would be inclined toward the maximum score; such molecules need to be only slightly more complex than the reactants required to synthesize them. During training, a large fraction of examples will readily satisfy f (P; θ) − f (R; θ) ≥ f0 , while a much smaller number of examples will contribute to the model’s learned nuance. This is analogous to how a support vector regression model may depend on a subset of training examples as 6

ACS Paragon Plus Environment

Page 7 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

support vectors, although all examples have the potential to affect the final model. A visualization of this approach is shown in Figure 1. Starting from a database or network of known reactions (1a), we wish to find a means of arranging all known chemicals so that products are scored more highly than each of their corresponding reactants (1b) as a means of assigning each a quantitative SCScore (1c). If we do so using an appropriately parameterized model, the scoring function will be generalizable to novel compounds. Our reaction-based definition affords the following advantages: 1. The perceived difficulty of synthesizing a molecule is directly informed by historical trends pertaining to how products are made and - implicitly - in how many steps. Each molecule is analyzed in the context of all known molecules as they appear in all known reactions. 2. The SCScore is an aggregate of experimental information and does not suffer from potential personal biases (e.g, reactions very familiar to chemists on paper might actually be used with low frequency in practice) or from the limitations of automated retrosynthetic analyses (i.e., missing reaction rules or making unrealistic disconnections). Previous studies have shown significant disagreement between chemists when scoring synthetic accessibility and when prioritizing compounds during drug discovery, suggesting that a data-driven model could provide a useful complement to subjective chemist evaluations. 26,27 3. By reformulating the problem of predicting synthetic complexity as an analysis of reactions instead of substances, we avoid the need for ground truth (molecule, value) pairs and instead learn from ranked (molecule, molecule) pairs. This enables the use of much larger data sets (i.e., reaction databases) for training instead of small, expertdefined data sets. 4. Synthetic complexity should be non-linear with respect to the number of atoms or functional groups present; it might be that it is easier to prepare the protected R−NHBoc 7

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

version of a compound rather than the R−NH2 , which should be reflected by the primary amine having a higher synthetic complexity score. This precludes the use of simple additive models combining fragment contributions, as the whole molecule must be analyzed simultaneously to understand this nuance. It also necessitates the analysis of reactions to understand when protections or deprotections should be considered productive steps.

Implementation We use a Morgan fingerprint of radius 2 (similar to an Extended-Connectivity Fingerprint of diameter 4 (ECFP4) 28 ) as implemented in the RDKit, 29 including chirality, folded to a length 1024 bit vector as the input to our model. Two other fingerprint representations, length-2048 boolean fingerprints and length-1024 integer fingerprints, were also tested but were not found to offer significant advantages (cf. Tables S4 and S5). The model, implemented in Python 2.7 using Tensorflow, 30 is a feed forward neural network consisting of five hidden layers with 300 hidden nodes each, ReLU activation with bias, and a single sigmoid output. No regularization (e.g., L2, Dropout) was used. The output score ∈ (0, 1) is linearly scaled to (1, 5) to be more natural for human interpretation. We set the offset in the hinge loss to be f0 = 0.25 so that an “equilibrium” separation of all intermediates in a 16-step synthesis could fit within the range (1, 5). Because synthetic complexity – as defined – should be nonlinear with respect to structure, we did not explore simpler model architectures that may have enabled straightforward interpretation as a fragment-contribution approach does. A schematic of the model and more training details can be found in the Supporting Information (Figure S1). As our data source, we use the ca. 12 million reactions in Reaxys 31 that have (a) a reaction structure file parseable by RDKit and (b) at least one set of reaction details that is labeled as a “single-step” reaction and (c) have a single recorded major product. This last criterion ensures that trivial side products (e.g., salts, water) are excluded. These are expanded to ca. 22 million (reactant, product) pairs. We use a randomized 80:10:10 8

ACS Paragon Plus Environment

Page 8 of 40

Page 9 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

training:validation:testing split. Note that although entries are not filtered by publication year, there is a natural bias toward recently-published reactions because the rate of publishing is growing exponentially.

Assumptions of Approach In how we have structured the learning task, a molecule is considered “hard to make” (i.e., would be assigned a high score) if it appears to require many reaction steps to synthesize through conventional means. There are some subtleties in this assumption that warrant mention: 1. Although the objective during optimization is to impose a separation between reactants and products of at least f0 = 0.25, some reactions will be more complexifying than others and not all reactions will achieve this separation. In practice, it might be possible to synthesize a compound with score 3.5 in fewer steps than one with score 3.0 if 3.5 appears to be synthetically challenging but is actually conducive to useful synthetic steps or a convergent synthesis. The more a compound appears as a reactant, the lower its perceived synthetic complexity will tend to be. 2. There is no explicit knowledge of what starting materials are available (i.e., purchasable). This is learned implicitly by the model as it gradually learns what types of structures tend to be present in reactants and what structures tend to be present in products. If a certain motif appears in both the reactants and products of a single reaction example, then a gradient step trying to increase the separation of their scores will not change how that motif contributes to the overall SCScore. Conceptually, the learned SCScore thus mimics how a chemist might identify a certain substructure as something that can be purchased and installed, rather than created. This can result in large discrepancies between the SCScore and past synthetic accessibility scores when complex scaffolds are perceived to be readily available due to the prevalence of similar

9

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

compounds as reactants. As an example, Deoxycholic acid is perceived to be a “low complexity molecule” because it appears significantly more often at the start of a synthetic step than at the end; a linear route to Deoxycholic acid is shown later in Figure 7. 3. There is no explicit consideration of how complicated a particular reaction is in terms of its necessary reagents, catalysts, procedural difficulty, or even its yield. This makes the learned synthetic complexity score more suitable for application to a research setting (e.g., a medicinal chemistry laboratory) rather than a development setting, because process chemistry necessitates stricter considerations of what constitutes a good synthesis. 20,32 These considerations are correspondingly less able to be captured by a single metric. In Sheridan et al.’s study, this mismatch manifested itself as a relatively poor correlation between estimated synthetic complexity and process mass intensity (PMI) compared to its correlation with SA_Score. 17 No attempt was made in this study to categorize reactions from the Reaxys database into research-relevant and process-relevant subsets. 4. The SCScore model will be biased by the types of reactants and products that appear in the data (Reaxys). This means that complicated natural products (of which there are relatively few in Reaxys), may “saturate” our scoring function and be assigned a maximum value of 5. We see this with several examples from Sheridan et al. later in Figure S3, where chemicals are assigned a value close to 5 (corresponding to the maximum complexity).

10

ACS Paragon Plus Environment

Page 10 of 40

Page 11 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Results Quantitative Evaluation Most reactant-product reaction pairs are easily separated during training, as indicated by the steep loss curves shown in Figure 2 and the rapid increase in the fraction of examples with a separation greater than f0 = 0.25. Validation performance was used to select the best performing model state, which was then evaluated using the test set and subjected to further qualitative analysis. The similar performance between the training data and the validation data suggests that the model is not being significantly overfit, which is consistent with the comparable test set performance achieved after final model selection. Additional scatterplots showing correlation between training/validation losses and validation/testing losses are shown in Figure S2.

Figure 2: Profiles showing the (a) average hinge loss, (b) fraction of reactant-product pairs with a positive SCScore difference, and (c) fraction of reactant-product pairs with a SCScore difference exceeding 0.25. Each epoch represents one iteration of training on the entire data set. The validation performance (blue) was used to determine the model state to use for testing (red) and further application. There are 2,214,928 reactant-product pairs in the testing set. Learned complexity scores are shown in aggregate for these compounds in Figure 3. 91.3% of these pairs have a positive 11

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

SCScore difference and 78.9% have an SCScore difference greater than f0 = 0.25. The median increase in SCScore is 0.77, although it is important to keep in mind that these statistics are for all reactant-product pairs. A significant number of reactants are trivial salt fragments or counterions that are assigned a low SCScore very close to 1 (as can be seen in Figure 3a).

Figure 3: Histograms showing (a) the SCScore assigned to each reactant and product in the test set, and (b) the difference in SCScores for each reactant-product pair. Many reactant fragments are trivial salts or counterions and are assigned low scores near 1.

Visualization of Synthetic Complexity To visualize the distribution of assigned synthetic complexity scores as a function of molecular structure, principal component analysis was applied to the fingerprints of a random selection of 50,000 products appearing in the test set. The first three principal components are used to produce the plot shown in Figure 4a, where each point is colored by its SCScore (red: high, blue: low). Because the learned model uses the molecular fingerprint as its only input, we expect a correlation to exist. However, the trend observed globally does not hold upon closer examination of a narrow region (−0.1 < PCA3 < 0.1), shown in Figure 4b. The same visualizations using the SMILES3/2 heuristic function are shown in Figures 4c and 4d; the SMILES3/2 score exhibits a similar global trend but does not exhibit the same degree of 12

ACS Paragon Plus Environment

Page 12 of 40

Page 13 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

local variation. This is explained by the neural network’s ability to capture effects that are nonlinear with respect to the original fingerprint. Not only does the model have the capacity to capture nonlinear effects, but Figure 4 shows that this nonlinearity is necessary to capture the nuances of synthetic complexity, as it is defined here, beyond what is reflected by the first three principal components.

Comparison to Chemist Scores Sheridan et al. 17 provide a large open-source set of structures (divided into several subsets) and complexity scores assigned by a large panel of expert chemists. Out of 1,775 compounds, 44 cannot be parsed by the most recent release of RDKit 29 and were excluded from the comparison. We assign complexity scores to the remaining 1,731 compounds using our learned SCScore model as well as the Synthetic Accessibility score (SA_Score); 20 these are shown in Figures S3 and S4, respectively. The full data set, including the 44 unparseable molecules, can be found in the Supporting Information. Several molecules from Sheridan et al.’s MDDR (MDL Drug Data Report) subset are substantially more complex than those most commonly found in Reaxys, which results in a “saturation” of the SCScore at the maximum complexity of 5. Histograms showing the distributions of scores can be found in Figure 5, demonstrating that our model consistently ranks molecules as more synthetically complex than Sheridan et al.; this is particularly true for the MDDR subset (Figure S6).

13

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 4: Visualization of complexity for 50,000 random products from the Reaxys test set. (a, c) Scatterplot of the first three principal components, colored by assigned complexity score and (b, d) Zoomed-in view of a slice with −0.1 < PCA3 < 0.1 for (a, b) the learned SCScore and (c, d) the heuristic SMILES3/2 score, where in all cases red signifies a higher score than blue. Both the SCScore and heuristic SMILES score show a similar global trend in complexity as a function of the first three principal components. However, the SCScore shows a high degree of local variation due to the nonlinearity of the trained model, while the heuristic SMILES3/2 score shows very little variation as this simple scoring heuristic does not capture any nuance in synthetic complexity.

14

ACS Paragon Plus Environment

Page 14 of 40

Page 15 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 5: Distributions of the SCScore and the meanComplexity score as assigned by chemists in Sheridan et al. 17 for the entire data set. Several molecules from the MDDR subset are substantially more complex than what is most commonly found in Reaxys, which results in a “saturation” of the learned score at the maximum complexity of 5.

A statistically-significant correlation exists between Sheridan et al.’s meanComplexity score and the SCScore assigned here for all data subsets except the small Kjell 32 set, consisting of just 15 molecules and previously used in a study of process mass intensity (PMI); within the Kjell et al. subset, the correlation to meanComplexity is stronger using SA_Score. This is not entirely surprising, as the compounds used for training the SA_Score and the compounds used in Sheridan et al. are both intended to be drug-like. Interestingly, even the naive SMILES3/2 heuristic complexity score correlates strongly with the chemist consensus meanComplexity, arguably more closely than both the SCScore and SA_Score (Figure S5). The lackluster correlation with Sheridan et al.’s meanComplexity scores should not be taken as an indication of a poor model for synthetic complexity. In their crowdsourcing study, “explicit instructions stated that the goal was to score ‘complexity’ and not specifically synthetic difficulty.” 17 The intent of the study was to analyze an aggregated definition of molecular complexity based on subjective chemist scores, rather than to study one specific definition of complexity. Similarly, the SA_Score model from Ertl and Schuffenhauer 20

15

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

was developed as a complexity metric based on the popularity of fragments in the PubChem database, which is correlated with but perhaps not directly aligned with synthetic complexity. The disparity between meanComplexity scores and SCScores is related to the SCScore’s focus on synthetic complexity, rather than the broader definition of molecular complexity that was captured by Sheridan et al.’s survey. Several compounds with exceptionally large discrepancies are shown in Figure 6. The first compound (6a), glucosamine, is perceived by the model as synthetically simple due to its prevalence as a starting material, which implies that it is straightforward to obtain (in this case, to purchase); the chemist meanComplexity score was 3.105 ± 1.314, the largest standard deviation of the entire dataset, suggesting that some survey respondents recognized it as a common building block while some did not. As Sheridan et al. explain, “a molecule that is complex as seen by one method (e.g., with many chiral centers) may appear very synthetically accessible in a retrosynthetic view if most of the chiral centers are contained in a single preexisting reagent”. 17 Similarly, 1,2,3,4,5,6-cyclohexanol (6b) is trivial to purchase and also received a wide range of chemist scores (reported standard deviation of 1.26). Ouabain (6c) is not inexpensive, but is commercially available and is more commonly employed as a reactant than a product. The fourth example (6d) strongly resembles the penicillin core structure and is perceived as being a straightforward derivative of such an available starting material. The remaining four compounds (6e-6h) are perceived by the model as synthetically complex but by chemist respondents as structurally simple. Compounds 6e and 6f both contain stereoselectively-substituted tetralin scaffolds, which accounts for their perceived complexity.

Analysis of Synthetic Routes A major goal for our learned synthetic complexity score is to reflect when progress is being made in a multi-step synthesis; this is a result of its design to correlate with the number of synthetic steps required to produce a molecule. To evaluate its performance in this regard, 16

ACS Paragon Plus Environment

Page 16 of 40

Page 17 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

a

b

OH H2 N

c

OH HO

OH

O

OH

OH HO

HO

OH

O

HO

e

OH

OH

1.423 2.442

HO HO HO

g O

O N S

1.794 4.778

f

O

HO

OH

OH SCScore: 1.249 meanComplexity: 3.105

d

OH

O

O

N O

NH2

1.740 3.216

O

h

N O

O

N

N N

N

N

Br Br

H2 N

N 4.262 1.413

4.998 2.222

4.444 1.778

3.636 1.583

Figure 6: Compounds from Sheridan et al. 17 with substantial differences between the SCScore (blue) and chemist consensus meanComplexity score (red), both between 1 and 5. (a-d) Compounds that the model perceives to be simple that were scored by chemists as complex, perhaps because the chemists did not consider or recognize what is buyable; (e-h) compounds that the model perceives to be complex that were scored by chemists as simple. we turn to a recent paper by Flick et al. 33 describing likely synthetic routes to the 29 new chemical entities (NCEs) approved in 2015, divided into 41 linear syntheses. Many steps actually consist of multiple parts; we refer the reader to the original paper for details. We calculate the synthetic complexity of every starting material, intermediate, and final product in these 41 linear syntheses using (a) our SCScore, (b) SA_Score, 20 as implemented in RDKit, and (c) a simple heuristic SMILES3/2 based on the length of a compound’s SMILES string representation. A comparison of our learned scoring function against SA_Score is shown in Figure 7 as a function of reaction step; a similar plot showing the SMILES3/2 score is shown in Figure S7. An ideal trend would be a monotonic increase in synthetic complexity with reaction step number, operating under the assumption that every reaction step in these syntheses helps make progress towards the goal of synthesizing the final product molecule. The SMILES strings and scores for all 205 structures can be found in the Supporting Information. Qualitatively, SCScores (blue) tend to reflect a monotonic increase in synthetic complexity throughout these linear syntheses more often than the SA_Score (orange), which was designed to measure synthetic complexity but was trained on substances rather than 17

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 7: Synthetic complexity scores assigned to each reactant, intermediate, and product in 41 linear syntheses of the 29 new chemical entitites (NCEs) approved in 2015 as reported by Flick et al.. 33 SCScores are shown in blue, while SA_Scores are shown in orange. X-axes correspond to the reaction step number with starting materials at x = 0.

18

ACS Paragon Plus Environment

Page 18 of 40

Page 19 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

reactions. Selected syntheses are shown in Figure 8.

a +0.295 -0.177

HO

HO

+0.437 -0.509

S

HO

S

O

O

2.143 2.634

1.706 3.143

O S

+0.198 +0.718

O

O

OH SCScore: 1.411 SA_Score: 3.320

S

+0.291 +0.067

O

2.434 2.700

2.632 3.418

O O S

+0.222 -0.754

NH2

S

NH2

O +0.913 -0.260

O

O

O

F I

O

O 2.855 2.664

3.767 2.404

Polmacoxib

b

O O

O

+0.775 -0.601

OMe O

O

O O

N H O

O

OMe OMe

N H

2.087 2.527

c

Cl

O

+0.342 -0.188

OMe

O

1.311 3.128

O

+0.305 -0.331

OMe

OMe

N

Cl

O

N

OMe

+0.313 +0.057

NH2

OMe

3.046 2.065

2.734 2.008

2.392 2.196

Lenvatinib frag 198

O +0.519 +0.567

O

HO

+0.334 +1.518

Boc N

Boc N

H N

Boc N

H N

+0.184 -1.697

N

Boc N

+0.116 +0.253

N

N N Ms

Boc N

N 1.610 2.231

2.469 4.317

2.129 2.798

d

2.770 2.872

2.653 2.620

Omarigliptin frag 147 F

HN

O HN

N

+1.833 -0.099

N

HN

O N

N

-0.537 -0.137

N

F F

O N

NH

+1.010 +0.020

N

N

O N

1.953 2.420

3.786 2.321

3.250 2.184

NH

4.260 2.204

Flibanserin

Figure 8: Exemplary syntheses from Flick et al. 33 where there is a significant difference in the complexity trends as evaluated by the learned SCScore model (blue) and the SA_Score (red). The change in complexity is shown above each forward reaction arrow. Perhaps the most striking difference is observed in the synthesis of Polmacoxib (Figure 8a), where the SA_Score perceives a decrease in synthetic complexity for four of the six reaction steps. If the products of these reactions were actually more synthetically accessible than their reactants, then the overall synthesis could be trivially improved by starting with the product and forgoing that reaction. The SCScore model perceives a monotonic increase in complexity as the linear synthesis proceeds, while the SA_Score perceives several steps as 19

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

simplifying. Similar trends can be seen for the synthesis of Lenvatinib fragment 198 (8b), where 3/4 and 0/4 reaction steps are seen as simplifying by SA_Score and our learned model, respectively. There are still several cases in Figure 7 where the SCScore model believes a reaction to be simplifying, one of which can be seen in the second reaction step in the synthesis of Flibanserin (8d). The removal of the ispropyl group is seen as simplifying; the model does not recognize that substructure as a common protecting group (as it might for, e.g., Boc). The starting material is a convenient source of the central scaffold with broken symmetry and appears to be available from several dozen vendors; the commercial availability of this starting material may have informed the development of the route, which is not captured by the model. Part of the power of using a flexible neural network model to learn synthetic complexity – rather than relying on heuristic calculations or expert-defined descriptors – is its ability to capture nonlinear or potentially counter-intuitive trends. An example of this is shown in Figure 9 for the final step in the synthesis of Eluxadoline. In this reaction step, a methyl ester is hydrolyzed to the acid and a secondary R−NHBoc is deprotected to form the primary R−NH2 amine. Although this step might reduce the structural complexity as perceived by SA_Score, the fact that the synthesis of Eluxadoline relies on this protected intermediate indicates that the intermediate is necessarily less synthetically complex. The deprotected Eluxadoline requires strictly more synthetic steps when proceeding through the reported route. The ability of our model to perceive when protections or deprotections are productive is critically important to its potential application to automated retrosynthesis; this retro deprotection step would still be perceived as a “downhill” step, making progress towards reaching simpler, more available starting materials.

Analysis of Reaction Types To determine the perceived complexity of different reaction types, we turn to the open source ca. 50k reaction dataset previously for the task of reaction role assignment 34 and 20

ACS Paragon Plus Environment

Page 20 of 40

Page 21 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

O

OMe HO

N

N

+0.239 -0.179

O

N

N

Ph

HN O NHBoc

H2N

OMe

O

MeO

Ph

HN O H2N

NH2 O

SCScore: 4.312 SA_Score: 3.870

4.552 3.691

Eluxadoline

Figure 9: Final step in the synthesis of Eluxadoline, 33 showing scores assigned by the SCScore model (blue) and SA_Score (red). Our model recognizes that the protected form is less synthetically complex, despite the presence of additional functionalities, which is necessarily true due to its use as an intermediate during the preparation of Eluxadoline. retrosynthesis prediction, 35,36 derived from a larger collection from the US patent literature. 37 The reactions of this particular subset have been classified by Schneider et al. 38 into 10 reaction classes described in Table S1. We refer the reader to Schneider et al. 38 for more information on the classification methodology. We use the pre-cleaned data from Coley et al., 36 whereby examples with multiple products are split into multiple distinct examples. A second set of 14k reactions from 14 subclasses within class 1 (cf. Table S1) was taken from Schneider et al. 38 for additional analysis. Rather than dividing examples into reactant-product pairs, here we define the complexity of a reaction to be the difference in complexity of the product and the maximum reactant complexity. That is,

 f (R1 + ... + Rn −−→ P) = f (P) − max f (Ri )

(4)

Histograms showing the distributions of reaction complexity assigned to reactions within each class are shown in Figure 10. Figure 10 reveals that reactions of the first four classes (heteroatom alkylation and arylation, acylation and related processes, C-C bond formation, and heterocycle formation) tend to introduce the most complexity as perceived by the model. This is a consequence of having defined reaction complexity using the maximum of reactant complexities, as this def-

21

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 10: Quantification of the extent to which different reaction types are complexifying as perceived by the SCScore. Histograms depict the differences in scores between products and reactants, where the complexity of multiple reactants is equal to the maximum of the set of reactant complexities. Reaction data is from Schneider et al.. 38 inition favors convergent syntheses where two building blocks of moderate complexity may be brought together to form a product of significantly higher complexity. Reactions within the first four classes tend to be convergent. Heterocycle formation reactions, by definition, form heterocycle motifs that can appear challenging to synthesize in addition to bringing two similarly-complex building blocks together, which accounts for their slightly higher average complexity of +0.76; recognizing highly-complexifying heterocycle forming reactions is important for automated retrosynthesis, as these reactions can be overlooked during manual retrosynthetic analysis. The remaining reaction classes, described in the legend of Figure 10, are more centrally distributed around ∆f = 0 but all have an average increase in synthetic complexity. The fact that both protections and deprotections have positive average SCScore differences is another testament to the model’s understanding of when the addition or removal of protecting groups is productive; a fragment-based approach would necessitate that protection and deprotection steps have opposing effects on synthetic complexity. Some types of reactions, particularly protections and oxidations, are largely perceived as “lateral” steps leading to small changes in synthetic complexity. This is consistent with how certain reactions are used in route planning:

22

ACS Paragon Plus Environment

Page 22 of 40

Page 23 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

protections and deprotections are almost always a means to an end, but they themselves may not have a strong effect on synthetic complexity. Likewise, oxidation reactions are often minor functional group manipulations (e.g., alcohol to ketone). A detailed analysis of subclasses within class 1 (heteroatom alkylation and arylation) can be found in Figure S9 with subclass descriptions in Table S3. An analogous figure to Figure 10 using SA_Score is shown in Figure S8. Seven of the ten reaction classes are perceived to be simplifying (i.e., have a negative average change in complexity), demonstrating the superiority of the SCScore in capturing the increase in synthetic complexity from reactants to products.

Conclusion We have developed a methodology for quantifying synthetic complexity based directly on published reaction data. By reformulating the problem, we shift from the traditional regression methods trained on (molecule, value) pairs to a method that imposes inequality constraints on (molecule, molecule) pairs. The trained SCScore model exhibits justifiable discrepancies with previous definitions of “complexity”, namely the crowdsourced definition from Sheridan et al., as we are specifically focusing on synthetic complexity. SCScores are applicable to lead prioritization in that even if a compound is perceived as complex by expert chemists, a low SCScore indicates that there may be an easier way of accessing that compound than initially recognized (Figure 6). SCScores also describe actual multistep syntheses well (Figures 7 and 8), including protections/deprotections (Figures 9 and 10), in a manner that enables its use as a guiding metric for retrosynthesis to identify options for rapidly introducing complexity. We believe that the SCScore has the potential to become an important additional metric in virtual screening pipelines and/or de novo molecular design. We describe our workflow in full detail and open source our code to enable use of other datasets as knowledge bases, for example in-house electronic lab notebook data. We also include the trained model as a “deployed” model with minimal dependencies for straightforward 23

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

integration into python-based workflows.

Acknowledgement This work was supported by the DARPA Make-It program under contract ARO W911NF16-2-0023. CWC received additional funding from the NSF Graduate Research Fellowship Program under Grant No. 1122374. The authors thank Wengong Jin for help with the code.

Supporting Information Available Supplementary Information available: Additional figures and discussion of code and model hyperparameters. All code used for model training can be found at https://github.com/ connorcoley/scscore. This material is available free of charge via the Internet at http://pubs.acs.org/.

References (1) Méndez-Lucio, O.; Medina-Franco, J. L. The Many Roles of Molecular Complexity in Drug Discovery. Drug Disov. Today 2017, 22, 120–126. (2) Bleicher, K. H.; Böhm, H.-J.; Müller, K.; Alanine, A. I. Hit and Lead Generation: Beyond High-Throughput Screening. Nat. Rev. Drug Discovery 2003, 2, 369. (3) Baber, J.; Feher, M. Predicting Synthetic Accessibility: Application in Drug Discovery and Development. Mini-Rev. Med. Chem. 2004, 4, 681–692. (4) Lee, D.; Sello, J. K.; Schreiber, S. L. Pairwise Use of Complexity-Generating Reactions in Diversity-Oriented Organic Synthesis. Org. Lett. 2000, 2, 709–712.

24

ACS Paragon Plus Environment

Page 24 of 40

Page 25 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(5) Szymkuc, S.; Gajewska, E. P.; Klucznik, T.; Molga, K.; Dittwald, P.; Startek, M.; Bajczyk, M.; Grzybowski, B. A. Computer-Assisted Synthetic Planning: The End of the Beginning. Angew. Chem., Int. Ed. 2016, 55, 5904–5937. (6) Warr, W. A. A Short Review of Chemical Reaction Database Systems, Computer-Aided Synthesis Design, Reaction Prediction and Synthetic Feasibility. Mol. Inform. 2014, 33, 469–476. (7) Proudfoot, J. R. Molecular Complexity and Retrosynthesis. J. Org. Chem. 2017, 82, 6968–6971. (8) Bertz, S. H. The First General Index of Molecular Complexity. J. Am. Chem. Soc. 1981, 103, 3599–3601. (9) Bertz, S. H. Convergence, Molecular Complexity, and Synthetic Analysis. J. Am. Chem. Soc. 1982, 104, 5801–5803. (10) Bertz, S. H. On the Complexity of Graphs and Molecules. Bull. Math. Biol. 1983, 45, 849–855. (11) Barone, R.; Chanon, M. A New and Simple Approach to Chemical Complexity. Application to the Synthesis of Natural Products. J. Chem. Inf. Comput. Sci. 2001, 41, 269–272. (12) Takaoka, Y.; Endo, Y.; Yamanobe, S.; Kakinuma, H.; Okubo, T.; Shimazaki, Y.; Ota, T.; Sumiya, S.; Yoshikawa, K. Development of a Method for Evaluating DrugLikeness and Ease of Synthesis Using a Data Set in which Compounds are Assigned Scores Based on Chemists’ Intuition. J. Chem. Inf. Comput. Sci. 2003, 43, 1269–1275. (13) Ihlenfeldt, W.-D.; Gasteiger, J. Computer-Assisted Planning of Organic Syntheses: The Second Generation of Programs. Angew. Chem., Int. Ed. 1996, 34, 2613–2633.

25

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(14) Huang, Q.; Li, L.-L.; Yang, S.-Y. RASA: a Rapid Retrosynthesis-Based Scoring Method for the Assessment of Synthetic Accessibility of Drug-Like Molecules. 2011. (15) Boda, K.; Seidel, T.; Gasteiger, J. Structure and Reaction Based Evaluation of Synthetic Accessibility. J. Comput. Aided Mol. Des. 2007, 21, 311–325. (16) Podolyan, Y.; Walters, M. A.; Karypis, G. Assessing Synthetic Accessibility of Chemical Compounds Using Machine Learning Methods. J. Chem. Inf. Model. 2010, 50, 979–991. (17) Sheridan, R. P.; Zorn, N.; Sherer, E. C.; Campeau, L.-C.; Chang, C.; Cumming, J.; Maddess, M. L.; Nantermet, P. G.; Sinz, C. J.; O’Shea, P. D. Modeling a crowdsourced definition of molecular complexity. J. Chem. Inf. Model. 2014, 54, 1604–1616. (18) Hendrickson, J. B.; Huang, P.; Toczko, A. G. Molecular Complexity: a Simplified Formula Adapted to Individual Atoms. J. Chem. Inf. Comput. Sci. 1987, 27, 63–67. (19) Allu, T. K.; Oprea, T. I. Rapid Evaluation of Synthetic and Molecular Complexity for in silico Chemistry. J. Chem. Inf. Model. 2005, 45, 1237–1243. (20) Ertl, P.; Schuffenhauer, A. Estimation of Synthetic Accessibility Score of Drug-Like Molecules Based on Molecular Complexity and Fragment Contributions. J. Cheminform. 2009, 1, 8. (21) Fukunishi, Y.; Kurosawa, T.; Mikami, Y.; Nakamura, H. Prediction of Synthetic Accessibility Based on Commercially Available Compound Databases. J. Chem. Inf. Model. 2014, 54, 3259–3267. (22) Bottcher, T. An Additive Definition of Molecular Complexity. J. Chem. Inf. Model. 2016, 56, 462–470. (23) Proudfoot, J. R. A Path Based Approach to Assessing Molecular Complexity. Bioorg. Med. Chem. Lett. 2017, 27, 2014–2017.

26

ACS Paragon Plus Environment

Page 26 of 40

Page 27 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(24) Heifets, A. Automated Synthetic Feasibility Assessment: A Data-driven Derivation of Computational tools for Medicinal Chemistry. Ph.D. thesis, 2014. (25) Weininger, D. SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36. (26) Lajiness, M. S.; Maggiora, G. M.; Shanmugasundaram, V. Assessment of the Consistency of Medicinal Chemists in Reviewing Sets of Compounds. J. Med. Chem. 2004, 47, 4891–4896. (27) Kutchukian, P. S.; Vasilyeva, N. Y.; Xu, J.; Lindvall, M. K.; Dillon, M. P.; Glick, M.; Coley, J. D.; Brooijmans, N. Inside the Mind of a Medicinal Chemist: the Role of Human Bias in Compound Prioritization During Drug Discovery. PloS One 2012, 7, e48476. (28) Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. (29) Landrum, G. RDKit: Open-Source Cheminformatics. http://www.rdkit.org, Accessed on 2016-11-20. (30) Abadi, M. et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. CoRR 2016, abs/1603.04467 . (31) Lawson, A. J.; Swienty-Busch, J.; Géoui, T.; Evans, D. The Future of the History of Chemical Information; ACS Publications, 2014; pp 127–148. (32) Kjell, D. P.; Watson, I. A.; Wolfe, C. N.; Spitler, J. T. Complexity–Based Metric for Process Mass Intensity in the Pharmaceutical Industry. Org. Process Res. Dev. 2013, 17, 169–174. (33) Flick, A. C.; Ding, H. X.; Leverett, C. A.; Kyne, R. E.; Liu, K. K. C.; Fink, S. J.;

27

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

O’Donnell, C. J. Synthetic Approaches to the New Drugs Approved During 2015. J. Med. Chem. 2017, 60, 6480–6515. (34) Schneider, N.; Stiefl, N.; Landrum, G. A. What’s What: The (Nearly) Definitive Guide to Reaction Role Assignment. J. Chem. Inf. Model. 2016, 56, 2336–2346. (35) Liu, B.; Ramsundar, B.; Kawthekar, P.; Shi, J.; Gomes, J.; Luu Nguyen, Q.; Ho, S.; Sloane, J.; Wender, P.; Pande, V. Retrosynthetic Reaction Prediction Using Neural Sequence-to-Sequence Models. ACS Cent. Sci. 2017, 3, 1103–1113. (36) Coley, C. W.; Rogers, L.; Green, W. H.; Jensen, K. F. Computer-Assisted Retrosynthesis Based on Molecular Similarity. ACS Cent. Sci. 2017, (37) Lowe, D. M. Extraction of Chemical Structures and Reactions from the Literature. Thesis, University of Cambridge, 2012. (38) Schneider, N.; Lowe, D. M.; Sayle, R. A.; Landrum, G. A. Development of a Novel Fingerprint for Chemical Reactions and Its Application to Large-Scale Reaction Classification and Similarity. J. Chem. Inf. Model. 2015, 55, 39–53.

28

ACS Paragon Plus Environment

Page 28 of 40

Page 29 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Graphical TOC Entry

0

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Paragon Plus Environment

Page 30 of 40

aPageJournal 31 of 40 of Chemical A Information and Modeling C

A+B → C 1 C+D → E B E D 2 B+D → F 3 E+F → G F 4 G+H → I G 5 G+C → J 6 H 7 8 I 9 10 11 B 12 13 D A 14 F 15 C 16 E 17 G 18 19 I 20 J 21 22 23 I E G 24 A C 25 J H Plus Environment 26 B DACS FParagon 27 28 Increasing SCScore

J

b

c 1 2 3 4 5

H

a Journal of Chemical Information b andPage Modeling 32 of 40 0.2

0.18

0.8 0.6 0.4 0.2

Loss [-]

10.16 20.14 3 40.12 5 0.1 6 0.08 7 80.06 9 0.04 10 110.02 12 0 13 0 14

Fraction > 0.0 [-]

1

Training Validation Testing

0

Fraction > 0.25 [-]

c

10

20

30

Epoch [-] 1

0.8 0.6 0.4

0.2 ACS Paragon Plus Environment 0 10

20

Epoch [-]

30

0

10

20

Epoch [-]

30

PageJournal 33 of 40 of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

ACS Paragon Plus Environment

Page 34 of 40

Page 35 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

ACS Paragon Plus Environment

a

b

OH H2 N

e

HO

OH

OH 1 HO O 2 3 SCScore: 1.249 4 meanComplexity: 3.105 5 6 7 8 9 10 N 11 12 4.262 13 1.413

c

d

OH Journal and Modeling OH of Chemical Information O O OH

OH HO

HO

OH

OH

1.423 2.442

O N

HO HO HO

S

1.794 4.778

g O

O

OH

OH

f

Page 36 of 40 HO

O

N O

NH2

1.740 3.216

O

h

N O

O

N

N N

N

H2 N

ACS Paragon Plus Environment N 4.444 1.778

Br Br

4.998 2.222

3.636 1.583

Ariprazole Lauroxil 3

Page 37 of 40 4.6 4.8 4.4

2.5

4.2

1 2 4.5 2.5 3 2 4 4 501234567891011 Evogliptin frag 136 3 6 3 7 2.5 2.5 8 2 2 90 1 2 3 4 Lenvatinib frag 198 10 3 3 11 2.5 2 12 2 13 0 1 2 3 4 14 Omarigliptin frag 147 4.5 15 4 2.5 3.5 16 3 2 2.5 17 1.5 0 1 2 3 4 18 Palbociclib frag 212 19 2.5 2.5 20 2 1.5 21 220 1 2 2 23Sacubitril 3 4 24 3 2 25 2 26 1 0

1

Brexpiperazole

2

2.4

3

2

1

2

2

0

Eluxadoline

3

0 1 2 3 4 5

Brexpiprazole frag 65

2 0

Deoxycholic Acid

2.5

Cariprazine

Cangrelor 3.5

8

2.5 2 1.5

6

Journal of Chemical Information and Modeling 4 3 2.2

4.5 4 3.5 3 2.5

2

3

2.8

3

4

2.5 2

0

1

2

3

4

3

4

1

3

2

2 0

1

2

2.5 0

5

1

2

3

1

4

2.5

3

2

3

2.5

1.5

2

2

2 012345678

0

Omarigliptin frag 148 3.5

3

3

3

1

2

3

4

2

2

2

2.5 2

Panobinostat

Polmacoxib

Rolapitant

2 0

1

2

3.5

3

Safinamide

4

5

3

4.5

0 1 2 3 4 5 6

0 1 2 3 4 5

Selexipag

Trelagliptin

2

3

2

2

1.5

2

0

1

2

3

0

1

2

3

1

2

3

Omarigliptin 2.5

4.5

2

3.5

4 3.5 3 0

3.2 3 2.8 2.6 2.4 2.2

2 1.5 1

2

2 1.5

2

3

Palbociclib

3 2

4.5 4 3.5 3 2.5

0 1 2 3 4 5 6

4.5 4 3.5 3

2.5

1

4

3

Rolapitant frag 94 4

3

3 2 2

1

2

3

0

1

2

3

4

Uridine triacetate

3.8

2.2 3.6 2 3.4 1.8

0 1 2 3 4 5 6

2

0

2

2.5

3.5 2.5 2 1.5

1.5

1

0

3 4 ACS Paragon2.5Plus4Environment

3.5 3 2.5 2 1.5

2

Rolapitant frag 92

5.5

2.5

3

0

6

3 2.5

3

2

Ozenoxacin frag 27 3

2

Lenvatinib

1.5 0

Ozenoxacin

2

4

2.5

1

4

2.5

4

3

2 0

3

1.5

3

1.5

0 1 2 3 4 5

3.5 3 2.5 2 1.5

2

3

2.2 2 1.8 1.6 1.4

0 1 2 3 4 5

2.2 2 1.8 1.6 1.4

1

4

2

Olanexidine

2.5

0 1 2 3 4 5 6

4

0

Lusutrombopag frag 113

Osimertinib

2.5

2

2

3

0 1 2 3 4 5

4

2

2

3.5

5

3

1

2.5

2

3

4 3 2.8 2.6 2.4 2.2

1

Ixazomib

4

0 1 2 3 4 5 6 3.5

4

4

0

Isavuconazonium

2

0

Evogliptin frag 125 3.6 3.4 3.2 3 2.8 2.6

4.5

2

4

4

Lusutrombopag

3

4

2

2

3

Lesinurad

3

4

0

3.5

2

3.5

2

4

1

Evogliptin 2.5

3

3

0

3

Isavuconazole frag 8 2.5

2

2.5 0

Flibanserin 4

1

2.6

2

4 3.8 3.6 3.4 3.2

2

Esflurbiprofen

3

2

Cobimetinib 4

1.5 0

Eluxadoline frag 85

6

4

1

4

2.5

0

1

X Axis Reaction step Y Axes SCScore [1-5] SA_Score [1-10]

a HO

Journal of Chemical Information and Modeling +0.295 -0.177

HO

1 2 SCScore: 1.411 3 SA_Score: 3.320 4 5 6 7 8 9 10 11 12 13 14 15 O +0.775 OMe -0.601 16O 17 O O 18 1.311 19 3.128 20 21 22 +0.519 O +0.567 23 Boc N 24 25 1.610 2.231 26 27 28 29 O 30 HN N 31 32 33 1.953 34 2.420 35

+0.437 -0.509

S

HO

S

+0.291 +0.067

+0.198 +0.718

O

O

OH

S O

O

2.143 2.634

1.706 3.143

Page 38 of O40

S

O

2.434 2.700

2.632 3.418

O O S

+0.222 -0.754

NH2

S

NH2

O +0.913 -0.260

O

O

O

F I

O

O 2.855 2.664

3.767 2.404

Polmacoxib

b

O O O

N H O

O

OMe OMe

Cl

N H

2.087 2.527

O

+0.342 -0.188

OMe

O

c

O

+0.305 -0.331

OMe

OMe N

Cl

O

N

OMe

+0.313 +0.057

NH2

OMe

3.046 2.065

2.734 2.008

2.392 2.196

Lenvatinib frag 198

O HO

+0.334 +1.518

Boc N

H N

Boc N

N

H N

+0.184 -1.697

Boc N

+0.116 +0.253

N

N Boc N

N Ms

N 2.469 4.317

2.129 2.798

d

2.770 2.872

2.653 2.620

Omarigliptin frag 147 F

HN +1.833 -0.099

N

HN

O N

N

-0.537 -0.137

N

F F

O N

NH

+1.010 +0.020

N

N N

ACS Paragon Plus Environment 3.786 2.321

3.250 2.184

O

4.260 2.204

Flibanserin

NH

O

OMe

O

OMe

Page Journal 39 of 40of Chemical Information and Modeling MeO HO

1 2 H2N 3 4 5

N

+0.239 -0.179

N

N

N

Ph

HN O NHBoc

Ph HN O

H2N NH2 ACS Paragon Plus Environment

O SCScore: 4.312 SA_Score: 3.870

O

4.552 3.691

Eluxadoline

0 0

1

500

2

150

50

0

1

2

400

50 0

-1

Prod-Reac Score Class 8 avg = 0.11

Page 40 of 40

100

0 -1

Prod-Reac Score Class 7 avg = 0.22

Class 5 avg = 0.12

150

0 -1

1000

Class 4

Count

500

Class 3

of Chemical and Modeling avg = 0.63 Informationavg = 0.76 Count

Count

1000

0

1

2

-1

Prod-Reac Score Class 9 avg = 0.18

0

1

2

Prod-Reac Score Class 10 avg = 0.23

0

Count

500

100 50

0

1

2

Prod-Reac Score

200

ACS Paragon Plus Environment 0

-1

Count

50

Count

Count

1 2 0 3 -1 0 1 2 Prod-Reac Score 4 Class 6 5 avg = 0.23 6 2000 7 8 1000 9 0 10 -1 0 1 2 11 Prod-Reac Score

Class 2 avg = 0.58 Journal

1500

Count

1000

Class 1 avg = 0.62

Count

Count

2000

0 -1

0

1

2

Prod-Reac Score

0 -1

0

1

2

Prod-Reac Score

-1

0

1

2

Prod-Reac Score

1. Heteroatom alkylation and arylation 2. Acylation and related processes 3. C-C bond formation 4. Heterocycle formation 5. Protections 6. Deprotections 7. Reductions 8. Oxidations 9. Functional group interconversion 10. Functional group addition