Automatized Assessment of Protective Group Reactivity: A Step

Oct 26, 2016 - ... of Protective Group Reactivity: A Step Toward Big Reaction Data Analysis ... of reaction conditions (catalyst, solvent) using raw r...
0 downloads 0 Views 2MB Size
Subscriber access provided by TRENT UNIV

Article

Automatized assessment of protective group reactivity: a step toward big reaction data analysis Arkadii I. Lin, Timur Ismailovich Madzhidov, Olga Klimchuk, Ramil I. Nugmanov, Igor S. Antipin, and Alexandre Varnek J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.6b00319 • Publication Date (Web): 26 Oct 2016 Downloaded from http://pubs.acs.org on November 1, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Automatized Assessment of Protective Group Reactivity: a Step Toward Big Reaction Data analysis Arkadii I. Lin [a],[b], Timur I. Madzhidov[a], Olga Klimchuk[b], Ramil I. Nugmanov[a], Igor S. Antipin[a] and Alexandre Varnek*[a],[ b] [a]

Laboratory of Chemoinformatics and Molecular Modeling, Department of Organic Chemistry, A.M. Butlerov Institute of Chemistry, Kazan Federal University, Kremlyovskaya Str. 18, Kazan, Russia, 420008 [b] Laboratory of Chemoinformatics, Faculty of Chemistry, University of Strasbourg, rue Blaise Pascal 1, Strasbourg, France, 67000 Corresponding Author: [email protected]

ABSTRACT: We report a new method to assess protective groups (PGs) reactivity as a function of reaction conditions (catalyst, solvent) using raw reaction data. It is based on an intuitive similarity principle for chemical reactions: similar reactions proceed under similar conditions. Technically, reaction similarity can be assessed using the Condensed Graph of Reaction (CGR) approach representing an ensemble of reactants and products as a single molecular graph, i.e., as a pseudomolecule for which molecular descriptors or fingerprints can be calculated. CGR-based inhouse tools were used to process data for 142111 catalytic hydrogenation reactions extracted from the Reaxys database. Our results reveal some contradictions with famous Greene’s Reactivity Charts based on manual expert analysis. Models developed in this study show high accuracy (ca. 90%) for predicting optimal experimental conditions of protective group deprotection.

1

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 23

Keywords: Condensed Graph of Reaction, reaction similarity search, protective groups reactivity.

Introduction A priori assessment of optimal reaction conditions for a given transformation is the holy grail of synthetic organic chemistry. Usually, the choice of reaction conditions proceeds in essentially empirical way: the chemist relies either on his/her own experience or on information about similar reactions retrieved from the literature. However, the exponential growth of chemical information makes the task of manual analysis and generalization extremely difficult and special approaches and tools are needed to efficiently extract such knowledge from raw data. Although there have been numerous efforts to develop computer-assisted organic synthesis approaches 1, very few studies related to the theoretical assessment of reaction conditions have been reported so far. Thus, Struebing et al

2

described a mixed quantum mechanics – linear free energy relationship based approach to predict optimal solvent for bimolecular reactions. Marcou et al

3

used machine learning methods to build

models predicting suitable solvent and catalyst for different types of Michael reaction. Both these studies were performed on small sets of manually curated data. Existing approaches

2, 3

, however,

could hardly be applied to large databases because of data quality, diversity, and size problems. Most of reaction equations in widely used databases like CAS REACT or Reaxys are not stoichiometrically balanced, some important information related to yield or experimental conditions is missing or the data are poorly annotated, standard names for catalysts are not established. Although some efforts concerning reaction balancing

4

or normalization and duplicate checking

5

have been reported, an

automatized workflow for reactions curation and standardization still needs to be developed. Notice that reaction data stored in databases are incredibly diverse due to the variability of reaction classes, chemotypes, and dependence of yield on experimental conditions, which makes knowledge extraction extremely difficult. Thus, the reaction data perfectly fit the “Big Data” definition in Wikipedia, as a term for data sets “… that are so large or complex that traditional data processing applications are inadequate” 6. One of the most respected compilations of expert knowledge on protective group (PG) chemistry is the book “Greene’s Protective Groups in Organic Synthesis” 7, the 5th edition of which was published recently . It provides information on the reactivity of more than 1000 protective groups collected from some 11 000 publications. A protective group 8 is a specific chemical group introduced into a molecule 2

ACS Paragon Plus Environment

Page 3 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

by chemical modification of a functional group (FG) in order to achieve chemoselectivity in a subsequent chemical reaction. A protective group prevents the undesirable transformation of a functional group while another structural moiety is modified (Scheme 1). The reactivity of a protective group depends both on the reaction conditions (catalyst, solvent, additives) and its chemical environment, as illustrated in Figure 1.

Scheme 1. A typical organic synthesis involving protection/deprotection steps. Here, [-OH] is a functional group to be protected

Figure 1. Variation of PG reactivity as a function of catalyst and the chemical environment 9. If Pd/C is used as a catalyst, the benzyl group 2 protecting the OH group is cleaved whereas protective group 1 remains. If Raney Ni is used as a catalyst, both groups 1 and 2 are cleaved.

3

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 23

Greene’s book summarizes conclusions on protective group reactivity under selected conditions in tables called Reactivity Charts. For each combination “PG, FG and catalyst”, these tables assign a reactivity label “H” (highly reactive in deprotection reactions), ”L” (low reactivity) “M” (medium reactivity) or “R” (PG undergoes chemical transformations). Thus, Reactivity Charts are supposed to guide a chemist to choose conditions (catalyst or reagent) leading to the desired chemical transformations – deprotection or keeping a given protective group. Greene’s book does not contain clear explanation of how these reactivity labels were assigned nor does it provide any information about the amount of data used to assign these labels. A significant drawback of the Reactivity Charts is that they do not account for the chemical environment of the PG. As stated on p. 1268 of the latest edition of the book 7, "the reactivities in the charts refer only to the protected functionality, not to atoms adjacent to the functional group … Reactivity of the entire substrate must be evaluated by the chemist." Thus, two questions arise in relation to Greene’s Reactivity Charts: (i) are they consistent with the reactivity data available in large chemical databases, and (ii) is there an alternative means of reactivity assessment accounting for the PG’s chemical environment? Here, we try to answer these questions considering the particular case of deprotection reactions running under hydrogenation conditions. In order to achieve these objectives, we have developed an automatized workflow based on the Condensed Graphs of Reaction approach, 10 which includes reactions standardization and curation steps followed by data analysis and modeling.

Automatized extraction of Reactivity Charts from reaction databases Automated analysis of PG reactivity using chemical reaction databases is a challenging task for the following reasons: (i) the automatic detection of cleavage or retention of a PG is not always pertinent if the substructural search procedure is used, (ii) a quantitative assessment of reaction similarity is hardly possible for a chemical reaction represented in its conventional form (reactants and products separated by an arrow), (iii) reaction condition parameters (catalyst, solvent, additives) are often omitted, and, (iv) no unique standardized name for a given catalyst is used. For simplicity reasons, only three popular functional groups have been considered: phenols, aliphatic alcohols and amine groups in amides and carbamates (Figure 2).

4

ACS Paragon Plus Environment

Page 5 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 2. Functional groups considered in this study.

Data treatment and analysis can be significantly simplified in the framework of the Condensed Graph of Reaction (CGR) approach

10

. A CGR condenses an ensemble of molecular graphs describing a

chemical reaction (reactants and products) into a single molecular graph (see an example on Figure 3). A CGR can be considered as a pseudomolecule characterized by both conventional chemical bonds (single, double, aromatic, etc.) and special “dynamic” bonds characterizing chemical transformations (created single, broken single, single to double, etc.). Consequently, all chemoinformatics approaches developed for individual molecules can be applied to chemical reactions encoded by a CGR. Earlier, we reported the CGR application to the identification of erroneous atom-to-atom mapping 11, reaction similarity searching 12, and structure-reactivity modeling

3, 13, 14

. In this study, the CGR approach was

used to develop a number of tools for reactions classification and analysis. An automatized workflow involving several steps is described in Figure 3. Firstly, various examples of hydrogenation reactions were selected from the Reaxys database

15

which supports the extraction of

limited amount of data for authorized users using its standard interface. Data extraction was performed using a query describing one-step reactions with defined yield running in the presence of hydrogen specified as a reagent or a catalyst “RXD.STP='1' AND RXD.NYD >= 0 AND ( RXD.RGTCAT = 'hydrogen' OR RX.RCT = 'hydrogen')”. The dataset extracted as Reaction Data File contained 142111 chemical transformations proceeding under ca. 271000 reaction conditions. The latter large number reflects the fact that a given reaction may take place under different conditions. The downloaded dataset contained many incomplete and erroneous data, as well as reactions not involving protective groups. For example, the yield and the catalyst were reported for 67.8% and 95.6% of the reactions, respectively. Reactions with no information about catalyst or yield were discarded. The remaining 5

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

75230 reactions data were treated in four steps: (1) structure standardization

Page 6 of 23

16

and atom-to-atom

mapping 17, (2) selection of reactions involving PGs, (3) classification of reactions involving PGs, and (4) catalyst annotation. Step (2) was performed using substructure search for the reaction encoded by CGR. First, SMARTS queries describing two possible scenarios - PG leaves or PG remains - were prepared for each PG linked to a given FG. Then both queries and database reactions were transformed into CGR followed by the related substructure search. Steps 1-3 resulted in the selection of 72230 reaction conditions under which a protective group is either cleaved (CPG) or retained (RPG). In such a way, 31676 CPG and 40554 RPG reactions were selected for further analysis. Protective group distribution in substrates of CPG and RPG reactions was different. Reactions of CPG class involve cleavage of benzyl ether of aliphatic alcohols (35% of CPG reactions), benzyl carbamate (30%), benzyl ether of phenols (20%); any other PG occurs in less than 1% of reactions. Thus, the benzyl group in different environment occurs in more than 85% of CPG reactions, which shows its high reactivity under the hydrogenation conditions. Substrates of RPG reactions contain methyl ether of phenols (45%), tert-butyl carbamate (23%), acetamide protection (8%), tert-butyldimethylsilyl ether of alcohols (8%), methyl ether of alcohols (6%), benzyl ether of alcohols (3%), benzyl ethers of phenols (1%), any other PG occurs in less than 1% reactions. Some 91% of CPG reactions are catalyzed by Pdcontaining compounds and nore than 12% of catalysts used in this reaction class are activated by the acid addition. In RPG reactions, Pd-based catalysts are still the most popular (70%) whereas other catalysts - metallic Ni (3%) and PtO2 (4%) – are much less frequently used. Notice that some catalytic poisons (Lindlar, addition of amine and others) are often used to diminish activity of catalysts used in the reactions of this class.

6

ACS Paragon Plus Environment

Page 7 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 3. Automatized data processing workflow: from raw reaction data to annotated subsets of reactions involving cleaved or retained Protective Groups. The crossed line and circle in the Condensed Graph of Reaction correspond to broken and created single bonds, respectively.

Next, for each combination “catalyst - PG - FG” the number of reactions in folders CPG and RPG were calculated and a Cleavage Rate CR = CPG/(CPG+RPG) was estimated. Since the Greene’s book contains no clear explanation of the assignment of the reactivity labels “H”, “M” and “L”, this was done on the basis of CR values. For each combination “catalyst - PG - FG”, the label “H” was assigned for CR ≥ 80%, ”L” for CR ≤ 20%, and “M” for all others. The catalysts names were standardized according to the procedure described in the Methods section (see below). The results of this analysis together with the information from the Greene’s Reactivity Charts for phenol protection are given in Table 1 and in the Supplementary Material (Tables S2-S4) for alcohol and amine protection. Surprisingly, the majority of PG reactivity assignments in Greene’s Reactivity Charts are poorly statistically supported. Thus, for almost half of the combinations “catalyst-PG-FG” considered in Greene’s Reactivity Charts, no data were found in Reaxys data used (highlighted in grey 7

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 23

in Table 1). In some cases (highlighted in yellow in Table 1), the number of available data is fewer than 10. For some 7-10% of cases, Reactivity Chart assignments are in conflict with our statistical analysis (highlighted in red in Table 1). Only some 25% of the assignments are in line with our analysis (in green, Table 1). In order to check these results, a substructural search for protective groups followed by manual analysis of associated reactions for the combinations “PG - FG” highlighted in grey in Table 1 was performed in Reaxys. Although some rare cases were retrieved, they were not detected automatically because of errors in the catalyst name. A similar search performed with SciFinder Scholar produced some additional reactions, most of which were published in journals not covered by Reaxys. This demonstrates that the reactivity analysis strongly depends on the data sources used. Since our approach is completely automated, it could be used to update statistically supported Reactivity Charts as soon as new data enter the database. Table 1. Reactivity analysis of deprotection of different groups protecting the hydroxyl group in phenols[a]. Rh/C or Rh/Al2O3

PG[b]

Raney (Ni)

Pt, pH 2-4[c]

Pd/C

Me

L

0

L

0

L

0

L

0

L

0

MOM

L



M



L

1

L

0

L

0

MEM

L



M



L

4

L



L



Cy

L



L



L

0

L



L



t-Bu

L



L



L

0

L

0

L



Bn

H

75

H

17

H

99

L

37

H

28

TBDMS

L



H



L

1

L

0

L

0

Ac

L

0

M

0

L

1

L

0

L

0

piv

L



L



L

6

L



L



Bz

L

0

L



L

50

L



L



-CO2CH3

L



H



L

0

L



L



Cbz

H



H



H

100

L



L



Ms

R

0

L



L

8

L



L



Lindlar

[a] The reactivity labels “H”, “M” and “L” are taken from Greene’s Reactivity Charts (GRC). Color code shows agreement (green) or disagreement (red) of our statistical analysis with GRC. Too small a subset (≤ 10 reactions) are shown in yellow, while combinations “PG, FG, catalyst” not found in the studied dataset are highlighted in grey. The numbers correspond to Cleavage Rate (CR) values in %. CR=100% means that the PG

8

ACS Paragon Plus Environment

Page 9 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

is cleaved in all reactions; CR=0% - PG cleavage is never observed. [b] See Table S1 in Supplementary Materials for Protective group name. [c] In our catalyst dictionary, this corresponds to “Pt/C + acid” annotation.

Partia disagreement between our analysis and Reactivity Charts comes from the Cleavage Rate thresholds, which we established for assigning reactivity labels. Thus, “Raney Ni – Bn – OH phenol” case, CR=75 % (Table 1) is not too far from the threshold of CR ≥ 80% chosen for “H”. However, for other cases in Table 1 highlighted in red, reported CR values show a real disagreement between our results and Reactivity Chart assignments. This might be explained both by the difference in data sources used for this analysis and a relatively small number of the reactions considered in the book 7. Although Reactivity Charts describe the most probable behavior of protective groups as a function of reaction conditions, this may not be sufficient to assess the PG reactivity in some particular cases. For instance, in 98.7% of the 5781 reactions involving phenols protected by a benzyl group, the protective group leaves, and it is retained in only 73 reactions (1.3%). If the chemist’s query is related to these “rare” reactions, the Reactivity Charts would provide the wrong recommendations. This means that a tool helping the chemist to select optimal reaction conditions should account for the particular structural environment of the protecting group rather than be based solely on statistical analysis. Reaction similarity based approach. The obvious limitations of the Greene’s Reactivity Charts motivated us to design a prototype of an expert system for PG reactivity assessment based on the reaction similarity principle. The similarity principle

18

states that “similar compounds possess similar properties or biological activities”. In the

context of chemical reactivity it can be reformulated as “similar reactions proceed under similar conditions”. Thus, for a given query (Q), one should search the database for one or several reactions similar to Q. Experimental conditions (catalyst, solvent additive, etc.) under which the retrieved reactions proceed must be specified for the query reaction. The similarity searching technique widely used for individual molecules cannot be directly applied to chemical reactions represented by their canonical form (molecular graphs for reactants and products separated by arrow). On the other hand, encoding reactions by a single graph (CGR) allows one to overcome this problem. Indeed, CGR can be considered as a pseudomolecule for which either a descriptor vector or a fingerprint can be generated. Thus, similarity of any two chemical reactions represented by respective CGRs can be 9

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 23

quantitatively estimated by any conventional measure of chemical similarity, e.g., the Tanimoto coefficient (Tc) 19. For each substrate under given reaction conditions, two possible scenarios were considered: the PG either leaves or stays. Thus, a query Q is submitted in parallel in CPG and RPG folders followed by selection of reactions most similar to Q as defined by the highest TcСPG and TcRPG values (see Figure 4). We supposed that for a given catalyst, the PG is cleaved if the query reaction is most similar to the nearest neighbor from CPG class whereas the PG remains if its closest neighbor is in the RPG class. In order to do this, the following value is calculated: ∆Tc = TcСPG - TcRPG.

Figure 4. Workflow of the similarity-based expert system on Protective Groups reactivity. A query reaction Q prepared by the user is standardized, mapped (1) and transformed into the Condensed Graph of Reaction (2). This graph is submitted for similarity searching (3) to two distinct datasets containing reactions with cleaved (CPG) or retained (RPG) protective groups and where each reaction is annotated with respect to the catalysts A, B, C, etc., see Figure 3. In this procedure, a similarity score (the Tanimoto coefficient, Tc) between Q and each dataset reaction is calculated. For each catalyst, the program selects reactions most similar to Q from the CPG and RPG subsets (4), followed by their comparison and selection of one of them with the largest similarity score Tc (5). In such a way, the

10

ACS Paragon Plus Environment

Page 11 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

program identifies if a given catalyst leads to PG cleavage (reaction most similar to Q is of CPG type) or not (reaction most similar to Q is of RPG type). Some nonnegative threshold T0 that reflects confidence of prediction is introduced. If ∆Tc ≥ T0, the PG is predicted as cleaved in given catalyst; if ∆Tc ≤ -T0, it remains. Then, the set of catalysts recommended for the reaction Q is returned (Figure 4). Some additional information (solvent, list of nearest neighbor reaction, citations) can also be returned to the user. If a query is equally similar to its nearest neighbors in the CPG and RPG folders (T0 ≥ ∆Tc ≥ -T0) predictions are considered nonconfident. Notice that the nearest CPG or RPG neighbors are considered only if Tc > 0.5. Thus, if a query is dissimilar (Tc < 0.5) to both CPG and RPG reactions, no recommendations of experimental conditions are provided. In order to have sufficient number of reactions in RPG and CPG folders per catalyst, for the latter we used its generalized definition (Level 1, see Figure 7 in the Methods section). Thus, the following catalyst classes were considered: Pd-containing, Ni-containing, Pt-containing and Rh-containing catalysts. Rather frequent Lindlar catalyst was also considered as a specific class. Reactions employing mixture of catalysts were ignored. An optimal threshold T0 value was found in the Leave-One-Out (LOO) cross-validation procedure on the subsets corresponding to individual protective groups occurred, at least, in 40 reactions of both CPG and RPG classes. Totally, 27356 reactions were considered. In this procedure, each reaction was used as a query Q for which the program automatically detected protective and functional groups types, following the workflow described in Figure 4. Then, the catalyst predictions were compared with the known experimental data for Q. The calculations show that ∆Tc = 0.05 provides a reasonable balanced accuracy of predictions varying from 85% to 95% as a function of protective and functional groups (see Tables S5-S7 in Supplementary Material). A Receiver Operating Characteristic (ROC) was built for every considered PG by varying T0 threshold. The area under obtained ROC curves (AUC) varied in the range of 0.94 - 0.98 demonstrating good prediction performance of the similarity-based approach (see Figure S1 in Supplementary Material).

11

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 23

Figure 5. Decision making procedure in the reaction similarity-based approach. Two different scenarios were considered for each protective group: it stays or it leaves. The nearest neighbor analysis shows that Pd-containing catalyst and methanol can be recommended for the desired reaction. If a substrate contains several protective groups, the program analyses the reactivity of each of them followed by intersection of recommended reaction conditions for each PG. This is illustrated in the example given in Figure 5, in which the substrate contains two protective groups: benzylcarbamate amine-protecting group (PG1) and benzyl OH-protecting group (PG2). The goal is to identify optimal experimental conditions for the reaction in which PG1 leaves and PG2 stays. With a Pd-containing catalyst and methanol, PG1 behavior is closer to the reaction from the CPG folder where benzylcarbamate group cleaves (Tc = 1.0) than to that from the RPG folder (Tc = 0.651) where the benzylcarbamate group remains. Similar consideration shows that PG2 behavior is closer to that of RPG (Tc = 0.767) than to CPG (Tc = 0.716) with the same catalyst and solvent. Combining the results 12

ACS Paragon Plus Environment

Page 13 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

for PG1 and PG2, one can conclude that a Pd-containing catalyst and methanol are the optimal conditions for the desired selectivity, as indeed appears to be the case in experimental data from reference

20

(Pd/C, ethanol). Note that one can never comes to this conclusion using Greene’s

Reactivity Charts, which state that both protective groups are easily cleaved (“H”) under Pd/C conditions. External validation of the ability of the reaction similarity based expert system to suggest an optimal catalyst has been performed on two datasets collected from the literature. The first set contained seven reactions involving only one PG (Figure 6), whereas another set contained five reactions with two concurrent protective groups (Figure 7). Notice that these reactions were not present in the dataset initially extracted from Reaxys. For four reactions from the first set (1.4 – 1.7, Table 2), the expert system recommended the catalyst exactly corresponding to the experiment. For the reaction 1.3, the catalyst type (Pd-containing) has been correctly predicted. Similarity searching for the reaction 1.2 returned four database reactions with similar Tc values, one of which was carried out using the same catalyst type (Rh-containing) as the query reaction. Only for the reaction 1.1, no firm recommendations was obtained because of very low similarity (Tc < 0.5) between the query and its nearest CPG or RPG neighbors. Notice that for 3 out of 7 reactions (1.5 – 1.7, Table 2), the Greene’s Reactivity Charts failed to correctly assess PG reactivity. For all reactions involving two concurrent PGs (Figure 7), the expert system correctly predicts the optimal catalyst (Pd/C, Table 3, Table S8 in Supplementary Materials). For three reactions, the calculations reveal a possibility to use alternative catalysts: Ni for the reaction 2.1, 2.3, and Ni or Lindlar for the reaction 2.5. Interestingly that Greene’s Reactivity Charts predicted all protective groups in 2.1 – 2.5 cleaved under given experimental conditions, whereas in each reaction only one out of two groups is cleaved (Table 3). This demonstrates a clear limit of Reactivity Charts to handle reaction selectivity issues.

13

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 23

Figure 6. Reactions used for the validation of the similarity-based approach to optimal catalyst assessments. Red and blue circles correspond to cleaved and remained PGs, respectively. 14

ACS Paragon Plus Environment

Page 15 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 2. Protective groups reactivity assessment and catalyst recommendations for the reaction given in Figure 6 PG reactivity class Reaction number a

1.1

Experimental b

H

Greene’s assessment c H

Catalyst Experimental

Predicted e

d

Similarity to CPG/RPG nearest neighbor (Tc) f

Pd/C 21

no recommendations f

0.66 / 0.74 0.63 / 0.73 0.65 / 0.70 0.51 / 0.67 0.61 / 0.41

1.2

L

L

Rh/Al2O3 21

PtO2 Lindlar Ni RhCl(PPh)3

1.3

H

H

Pd(OH)2/C 22

Pd/C

Pd/C 1.00 / 0.69 1.4 Pd/C Pt 0.94 / 0.76 H H Raney Ni 0.91 / 0.65 Pd/C 0.80 / 1.00 1.5 Pd/C 20 L H PtO2 0.46 / 0.98 RhCl(PPh3)3 0.36 / 0.66 Pd/C 0.71 / 1.00 1.6 Pd/C 20 L H Pt/C 0.48 / 0.71 Lindlar 0.49 / 0.61 Rh/Al2O3 none / 0.55 g Pd/C 0.20 / 0.66 1.7 Rh/Al2O3 21 L H Pt/C none / 0.58 g Lindlar none / 0.64 g Ni none / 0.52 g a See reactions on Figure 6. b For the comparison purpose, instead of reactivity attributes CPG and RPG, corresponding Greene’s labels H and L were used. c Assessed from the Greene’s Reactivity Charts. d The references are given in brackets. e Catalyst(s) reported for the nearest neighbor reaction(s). f The nearest neighboring reactions from both CPG and RPG classes were too dissimilar to the query reaction (Tc < 0.5). g no CPG reactions were found for a given catalyst. 23

15

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 23

Figure 7. Reactions involving substrates with two concurrent protective groups used for the validation the ability of the similarity-based approach to assess an optimal catalyst. The numbers correspond to concurrent PGs; red and blue circles correspond to cleaved and remained PGs, respectively.

16

ACS Paragon Plus Environment

Page 17 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 3. Protective groups reactivity assessment and catalyst recommendations for the reaction given in Figure 7 a, b Reaction number c

PG number

PG reactivity class Experimental

Greene’s assessment

1

H

H

2

L

H

1

H

H

2

L

H

1

H

H

2

L

H

1

H

H

2

L

H

1

H

H

L

H

2.1

2.2

2.3

2.4

2.5

a

2

Catalyst Experimental

Predicted с

Pd/C 20

Pd d Ni

Pd/C 24

Pd/C

Pd/C 25

Pd/C Ni

Pd/C 26

Pd/C

Pd/C 27

Pd/C Ni Lindlar

See footnotes for Table 2. b Similarity measures (Tc) between query reaction and its nearest neighbors

in CPG and RPG subsets are provided in Table S8 of Supplementary Materials. c See reactions in Figure 7. d Pd-containing catalyst.

Conclusions This study demonstrates efficiency of the Condensed Graph of Reaction (CGR) methodology for knowledge extraction from raw reaction data. CGR significantly simplifies a reaction representation by reducing it to a single graph for which molecular descriptors or binary fingerprints can be generated. This approach has been used to perform statistical analysis of protective groups’ reactivity for a large dataset of 142111 catalytic hydrogenation reactions. Comparison of our results with the manually

17

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 23

prepared Greene’s Reactivity Charts 7 shows that reactivity assignments for some PGs in the Greene’s Charts are erroneous or statistically poorly supported. For a given reaction database, the ensemble of developed tools is capable of building an analogue of the Greene’s Reactivity Charts in a fully automatic and customizable manner. The PG reactivity could be re-evaluated as soon as new data entry the database without any human intervention. An alternative to Greene’s Reactivity Charts approach to PG reactivity analysis has been suggested. It is based on quantitative assessment of reaction similarity and accounts for chemical environment of considered protective groups. As we demonstrated, this approach can be applied efficiently to predict optimal experimental conditions of deprotecting reactions involving one or several PGs. Although this study was focused on PG reactivity in catalytic hydrogenation reactions, developed methodology can be applied to any other class of deprotection reactions. Moreover, similarity-based approach can easily be extended to any type of reaction and, in such a way, may become a useful tool to treat the totality of reaction data stored in the public and proprietary databases. It should be noted that prediction performance of the similarity-based methodology strongly depends on the size and diversity of the reaction data used. The more examples of different types of reactions are present in the database, the more accurate reactivity assessment for novel reactions can be achieved. Thus, this approach appears to be a promising way for the efficient handling of the everincreasing volume of experimental data.

METHODS Reaction structures curation and encoding. ChemAxon JChem tools

28

were used to prepare and

standardize chemical structures, to perform atom-to atom mapping, substructure and superstructure searches. Condensed Graph of Reactions has been prepared with the in-house ISIDA Condenser-2015 program by superposing atoms of reactants and products with same numbers. Encoding a CRG by a bitstring (hashed fingerprint) has been performed in two steps. First, for each CGR, ensemble of atoms/bonds sequences containing from 1 to 6 atoms (including all intermediate sequences) has been prepared using the ISIDA Fragmentor program 29. At the second step, md5 30 and sha1 31 Python3 libraries were used to generate two hash-codes for each type of fragment occurred in a 18

ACS Paragon Plus Environment

Page 19 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

CGR. Each of two generated hash-codes was used to identify a position of an active bit in the bitstring of the length of 2048 bits. This has been achieved by dividing each hash-code by the length of the bitstring then taking modulo as an active bit position in the string. Each reaction was considered in the context of two possible scenarios: protective group is cleaved or retained. In order to identify the reaction class - CPG (Cleaved PG) or RPG (Retained PG) – the fingerprints were generated for queries corresponding to these two reaction classes (see Figure 4). The attribution of the class for each reaction in the database was performed by comparison of its fingerprint with those of the CPG and RPG queries. Reaction class was assigned to a given reaction if its fingerprint contains the active bits exactly at the same positions as in the fingerprint of the query representing the given class. Catalyst annotation. The catalyst descriptions in Reaxys are not directly exploitable because of the use of different names for the same catalyst. For instance, three different annotations “10% Pd on carbon(eggshell)”, “10% palladium on charcoal” and “10% palladium/C” describe the same catalyst “10% Pd/C” (Figure 8). Overall, more than 25 000 different catalysts and reagents names were detected for the considered dataset, 550 of which occurred in some 90% of reactions. In order to standardize these data, a dictionary of synonyms has been prepared. This helped reducing the 550 original catalysts names to 419 unique representations. In order to compare our results with the Greene’s book 7, we used a limited number of standard representations indicating metal or metal-containing compound, support and additive (Level 2 annotation on Figure 8). In the reaction similarity based approach, even more general annotation (Level 1) was used. Inorganic and organic additives were classified into general classes: “acid”, “alkali”, “base”, “organophosphorus compound”, etc.

19

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 23

Figure 8. Example of catalyst name standardization in a database and in the expert system. Notice that Level 2 annotation corresponds to that in the Greene’s Reactivity Charts. The most general Level 1 annotation is used in reaction similarity based analysis.

ACKNOWLEDGMENT We thank Drs G. Marcou and D. Horvath for their help with data analysis and software development and Prof. J. Harrowfield for valuable advice. We thank the Reaxys database (Elsevier, Nederlands) for providing us with the experimental reaction data and ChemAxon company for the software license. The Russian Science Foundation (Contract No. 14-43-00024) is acknowledged for supporting these studies.

ABBREVIATIONS PG, protective group; CGR, condensed graph of reaction; FG, functional group; H, highly reactive in deprotection reactions; L, low reactivity; M, medium reactivity; R, PG undergoes chemical transformations; CPG, cleaved protective group; RPG, retained protective group; CR, cleavage rate; Bn, benzyl; Me, methyl; MOM, methoxymethyl; MEM, 2-methoxyethoxymethyl; Cy, cyclohexyl; t20

ACS Paragon Plus Environment

Page 21 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Bu, t-Butyl; TBDMS, t-Butyldimethylsilyl; Ac, acetate; piv, pivalate; Bz, benzoate; Cbz, benzyl carbonate; Ms, methanesulfonate; GRC, Greene’s Reactivity Charts; Q, a given query; Tc, Tanimoto coefficient; LOO, Leave-One-Out cross-validation. ASSOCIATED CONTENT Supporting Information This material is available free of charge via the Internet at http://pubs.acs.org. Additional data (as described in the text) provided in supporting Tables S1-S8 and Figure S1

REFERENCES 1. Barone, R.; Chanon, M., Computer‐Assisted Synthesis Design (CASD). Handbook of Chemoinformatics: From Data to Knowledge in 4 Volumes 2008, 1428-1456. 2. Struebing, H.; Ganase, Z.; Karamertzanis, P. G.; Siougkrou, E.; Haycock, P.; Piccione, P. M.; Armstrong, A.; Galindo, A.; Adjiman, C. S., Computer-Aided Molecular Design of Solvents for Accelerated Reaction Kinetics. Nature chemistry 2013, 5, 952-957. 3. Marcou, G.; Aires de Sousa, J.; Latino, D. A.; de Luca, A.; Horvath, D.; Rietsch, V.; Varnek, A., Expert System for Predicting Reaction Conditions: The Michael Reaction Case. J. Chem. Inf. Model. 2015, 55, 239-250. 4. Patel, H.; Bodkin, M. J.; Chen, B.; Gillet, V. J., Knowledge-Based Approach to de Novo Design Using Reaction Vectors. J. Chem. Inf. Model. 2009, 49, 1163-1184. 5. HazELNut, NextMove software: Cambridge, 2015.6. Big data. https://en.wikipedia.org/wiki/Big_data (Sept 2 2016). 7. Wuts, P. G. M., The Role of Protective Groups in Organic Synthesis. In Greene's Protective Groups in Organic Synthesis, fifth edition; Wuts, P. G. M., Eds.; John Wiley & Sons, Inc., Hoboken: New Jersey, USA, 2014; pp 1-17. 8. Kocienski, P. J., Death, Taxes, and Protecting Groups, In Protecting groups. third edition, Georg Thieme Verlag, Stuttgart,. Germany:2005, pp 1-49. 9. Llàcer, E.; Romea, P.; Urpí, F., Studies on the Hydrogenolysis of Benzyl Ethers. Tetrahedron Lett. 2006, 47, 5815-5818. 10. Varnek, A.; Fourches, D.; Hoonakker, F.; Solov’ev, V. P., Substructural Fragments: An Universal Language to Encode Reactions, Molecular and Supramolecular Structures. J. Comput. Aided Mol. Des. 2005, 19, 693-703. 11. Muller, C.; Marcou, G.; Horvath, D.; Aires-de-Sousa, J.; Varnek, A., Models for Identification of Erroneous Atom-to-Atom Mapping of Reactions Performed by Automated Algorithms. J. Chem. Inf. Model. 2012, 52, 3116-3122. 12. de Luca, A.; Horvath, D.; Marcou, G.; Solov’ev, V.; Varnek, A., Mining Chemical Reactions Using Neighborhood Behavior and Condensed Graphs of Reactions Approaches. J. Chem. Inf. Model. 2012, 52, 2325-2338. 13. Hoonakker, F.; Lachiche, N.; Varnek, A.; Wagner, A., Condensed Graph of Reaction: Considering a Chemical Reaction as One Single Pseudo Molecule. Intern. J. Artificial Intelligence Tools 2011, 20, 253-270.

21

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 23

14. Madzhidov, T. I.; Polishchuk, P. G.; Nugmanov, R. I.; Bodrov, A. V.; Lin, A. I.; Baskin, II; Varnek, A. A.; Antipin, I. S., Structure-Reactivity Relationships in Terms of The Condensed Graphs of Reactions. Russ. J. Org. Chem. 2014, 50, 459-463. 15. Reaxys. www.reaxys.com (accessed January 20, 2012) 16. Fourches, D.; Muratov, E.; Tropsha, A., Trust, but Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research. J. Chem. Inf. Model. 2010, 50, 1189-1204. 17. Chen, W. L.; Chen, D. Z.; Taylor, K. T., Automatic Reaction Mapping and Reaction Center Detection. Wiley Interdisciplinary Reviews: Computational Molecular Science 2013, 3, 560-593. 18. Concepts and applications of molecular similarity. Johnson, M.A. and Maggiora, G.M., Eds.; John Wiley & Sons, New York, USA, 1990, 19. Rogers, D. J.; Tanimoto, T. T., A Computer Program for Classifying Plants. Science 1960, 132, 1115-1118. 20. Sajiki, H.; Hirota, K., A Novel Type of Pd/C-Catalyzed Hydrogenation Using a Catalyst Poison: Chemoselective Inhibition of the Hydrogenolysis for O-Benzyl Protective Group by the Addition of a Nitrogen-Containing Base. Tetrahedron 1998, 54, 13981-13996. 21. Bindra, J. S.; Grodski, A., An Efficient Route to Intermediates for the Synthesis of 11Deoxyprostaglandins. J. Org. Chem. 1978, 43, 3240-3241. 22. Jarowicki, K.; Kocienski, P., Protecting Groups. J. Chem. Soc.-Perkin Trans. 1 2000, 24952527. 23. Heathcock, C. H.; Ratcliffe, R., Stereoselective Total Synthesis of the Guaiazulenic Sesquiterpenoids .Alpha.-Bulnesene and Bulnesol. J. Am. Chem. Soc. 1971, 93, 1746-1757. 24. Surfraz, M. B.-U.; Akhtar, M.; Allemann, R. K., Bis-Benzyl Protected 6-Amino Cyclitols are Poisonous to Pd/C Catalysed Hydrogenolysis of Benzyl Ethers. Tetrahedron Lett.2004, 45, 1223-1226. 25. Jansson, A. M.; Grøtli, M.; Halkes, K. M.; Meldal, M., Palladium on Carbon Encapsulated in POEPOP1500:  A Resin-Supported Catalyst For Hydrogenation Reactions. Organic Lett. 2002, 4, 2730. 26. Meienhofer, J.; Kuromizu, K., Catalytic Hydrogenolysis in Liquid Ammonia: Stability and Cleavage of Some Protecting Groups Used in Peptide Synthesis. Tetrahedron Lett. 1974, 15, 32593262. 27. Ikawa, T.; Hattori, K.; Sajiki, H.; Hirota, K., Solvent-Modulated Pd/C-Catalyzed Deprotection of Silyl Ethers and Chemoselective Hydrogenation. Tetrahedron 2004, 60, 6901-6911. 28. JChem 15.3.16. www.chemaxon.com, (accessed May 16, 2016) 29. Ruggiu, F.; Marcou, G.; Varnek, A.; Horvath, D., ISIDA Property‐Labelled Fragment Descriptors. Molecular Informatics 2010, 29, 855-868. 30. Rivest, R., The MD5 Message-Digest Algorithm; MIT, Rivest, USA, April 1992. 31. Eastlake 3rd, D.; Jones, P. US secure hash algorithm 1 (SHA1); NSA, Maryland, USA, September 2001

22

ACS Paragon Plus Environment

Page 23 of 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table of Contents graphic

23

ACS Paragon Plus Environment