Generation of tautomers using micro-pKa's - Journal of Chemical

2 days ago - We present examples of our algorithm when applied to organic and drug-like molecules, with a focus on novel structures where traditional ...
0 downloads 0 Views 1MB Size
Subscriber access provided by ALBRIGHT COLLEGE

Computational Chemistry

Generation of tautomers using micro-pKa's Mark A. Watson, Haoyu S. Yu, and Art D. Bochevarov J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.8b00955 • Publication Date (Web): 09 May 2019 Downloaded from http://pubs.acs.org on May 10, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Generation of tautomers using micro-pKa’s Mark A. Watson, Haoyu S. Yu, and Art D. Bochevarov∗ Schr¨odinger, Inc., 120 West 45th St, New York, New York 10036 E-mail: [email protected]

Abstract Solutions of organic molecules containing one or more heterocycles with a number of conjugated bonds may exist as a mixture of tautomers, but typically only a very few will be significantly populated even though the potential number grows combinatorially with the number of protonation and deprotonation sites. Generating the most stable tautomers from a given input structure is therefore an important and challenging task and numerous algorithms to tackle it have been proposed in the literature. This work describes a novel approach for tautomer prediction, which involves a combined use of molecular mechanics, semi-empirical quantum chemistry, and density functional theory. The key idea in our method is to identify the protonation and deprotonation sites using estimated micro-pKa ’s for every atom in the molecule, as well its nearest protonated and deprotonated forms. To generate tautomers in a systematic way with minimal bias, we then consider the full set of tautomers that arise from the combinatorial distribution of all such mobile protons among all protonatable sites, with efficient post-processing to screen away high-energy species. To estimate the micro-pKa ’s, we present a new method designed for the current task, but emphasize that any alternative method can be used in conjunction with our basic algorithm. Our approach is therefore grounded in the computational prediction of physical properties in aqueous solution, in contrast to other approaches that may rely on the use of hard-coded rules of proton distribution, previously observed tautomerization patterns from a known chemical space, or human input. We present

1

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 52

examples of our algorithm when applied to organic and drug-like molecules, with a focus on novel structures where traditional methods are expected to perform worse.

1 INTRODUCTION Tautomerism is a fascinating phenomenon. 1–3 It has gained a considerable traction in diverse areas of chemical research, for example cheminformatics, 4–7 medicinal chemistry, 8–11 electronic spectroscopy, 12,13 gas-phase spectrometry, 14,15 and X-ray crystallography. 16,17 When it comes to computational modeling of such properties as logD and pKa one often has to generate tautomers and compute their relative energies (score them). 18–23 These computational operations are necessary because failing to account for the existence of different tautomer forms or failing to recognize the lowest energy tautomer in a given solvent can result in serious mispredictions or disagreements with experiment. 24,25 As computational prediction of molecular properties becomes increasingly more common, and as screening applies to larger and larger sets of organic molecules, automated generation of tautomers becomes practically indispensable. Organic molecules with certain arrangements of double bonds in aliphatic chains and/or conjugated heterocycles may be prone to tautomerism. The number of plausible tautomers that could be considered for such medium-sized and large molecules can be prohibitively large, numbering in dozens and hundreds. In practice, however, chemists tend to rely on previous practical experience to consider only a handful of tautomers deemed energetically favorable. Needless to say, “manual” generations of tautomers are unacceptable if one is building algorithms and workflows that are meant to process molecules in an automated fashion. There have been numerous proposed solutions for tautomer generation, 6,19,21,23,26 although a complete solution to the problem, applicable in all contexts of chemical research appears to be elusive. A number of software programs utilize these and other developments to offer tautomer generation (see, for example, Refs. 27 and 28 ). Any algorithm for tautomer generation requires knowledge about which protons and which molecular sites can be involved in tautomeric transformations.

2

ACS Paragon Plus Environment

Page 3 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

This knowledge can come from built-in rules, such as predefined SMARTS patterns 29 encoding patterns of transformations, 6,21,30,31 or from the transformations built into InChI. 26 It can also come from human developers using experience or chemical intuition to favor certain motifs, typically accompanied by quantum chemical calculations to verify the energetics. 19 These approaches are not completely satisfactory, however. The rules-based approaches may fail to generate relevant tautomers if the necessary transformation rule is missing (such as in cases of novel or unusual heterocycles or uncommon type of tautomerism). The second alternative relies on imperfect human intuition and is a barrier to automation. More interesting, therefore, are approaches that determine protonation sites in an unbiased, automated way, using physics-based atom-level descriptors. A recent work by the Grimme group 31 reports on one such approach. It utilizes a semi-empirical localization of molecular orbitals as a means of locating protonation sites. The method of Grimme and co-workers is currently geared toward generating gas phase tautomers, although multiple applications have to deal with tautomers that exist in the solution phase. While in some cases the same tautomeric structure has been predicted to have the lowest energy in both gas phase and solution, 32,33 in general the relative energetics of tautomers is phase-dependent. 1,34,35 The aim of this work, therefore, was to develop a novel method to generate tautomers in an automated, unbiased fashion, without dependence on explicit rules for shifting protons around the molecule. The method should not require user input other than the initial structure, and it should generate tautomers relevant to the solution phase, particularly those that exist in aqueous solutions. Our central idea to achieve these goals is to use micro-pKa ’s as the key ingredient for a physics-based approach. Therefore, we first estimate the micro-pKa of every atom in the molecule with a rapid algorithm. Then, we use these atomic data to determine the protons’ capacity for dissociation, and, reciprocally, the heavy atoms’ capacity for being protonated or deprotonated. Having identified the atoms we expect to be active in the tautomeric equilibria, we then generate a full combinatorial set of tautomers involving all these atoms in an unbiased manner. A close relationship between solution phase micro-pKa ’s and tautomeric equilibria has been noted before, 6,19 but the idea of tautomer generation driven by a complete set of micro-pKa ’s has not been explored,

3

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 52

Molecule Rapidly evaluate micro-pKa at each atom in initial molecule and its protonation forms H

-

H

-9.8 1.6

N

2.1

4.1 N

-20.3 N

3.8

H

A = active I = inactive

A

-

N

11.2

H

7.3

= large value

Assign activity of atoms with respect to protonation and deprotonation H

A

N

A

IO

H

I

I

I

9.2 -

-6.7 N

H

I

3.8 -

-

N

N

10.3 H

O

H

-

-

-3.9 N -

25.3

5.4

O

N N

I H

A

I

N

I

H

A

I

Generate tautomers by combinatorial permutation of active atoms

PM7

Filter tautomers with a semi-empirical method

Low energy tautomers

Figure 1: A general organization of the tautomer generation workflow. assignments were made by the “sorting” method described in Section 4.2.

4

ACS Paragon Plus Environment

The micro-pKa

Page 5 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

to the best of our knowledge. a A small digression about the micro-pKa terminology, as used in this work, is necessary. Micro-pKa is traditionally defined as characterizing a proton exchange equilibrium between two individual tautomers. Statistical averaging over micro-pKa ’s that includes all the relevant tautomers for the given system yields macro-pKa which is directly comparable to the equilibria constants measured in experiment. For more information about the distinction between micro- and macro-constants see the early text Ref. 36 as well as the recent works. 37,38 Tautomers may possess various conformations, so, equipped with a computational method that operates on 3D structures, one can compute micro-pKa ’s for every different conformations for a given pair of the protonated and the deprotonated tautomers. Thus, a finer gradation, of micro-pKa ’s, reflecting conformational effects, is possible. In an earlier work, 39 we introduced such a gradation with the notion of nano-pKa that is defined as the pKa computed for a particular pair of the protonated and deprotonated conformations. Conformational statistical averaging over nano-pKa ’s, as described in Ref. 39 leads to micro-pKa . In this work, every time we compute micro-pKa we technically compute nano-pKa , since we deal with a particular pair of conformations, and do not perform conformational averaging afterwards. However, in order to not complicate the discussion with the less familiar terminology, we continue referring to these nano-pKa ’s as micro-pKa ’s. Obviously, nano-pKa ’s is equivalent to micro-pKa if we assume the existence of a single conformation for the protonated form and a single conformation for the deprotonated form. We can certainly make this assumption in this work, as it is immaterial for the construction of our pKa prediction methods, and allows for simpler terminology. The choice of method to evaluate the micro-pKa ’s is arbitrary (if sufficiently accurate) but for practical utility, it is desirable to be able to compute them as quickly as possible. Methods for rapid micro-pKa predictions have been reported before. 19,40,41 These methods, however, have been designed for, or at least tested on, narrow chemical spaces (typically common functional groups like amines or carboxylic acids). For our purposes we are seeking a method that in principle would be applicable to any type of atom or functional group, including those not typically perceived as prone to protonation/deprotonation.

5

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2: An analogy between the complete active space (CAS) of determinants from electronic structure theory and the CAS of tautomers from this work. The mobile (active) units to be permuted to create a CAS are illustrated within the dashed or solid lines (for determinants and tautomers, respectively). In this picture, hydrogens and heavy atoms marked by circles are analogous to electrons and orbitals, respectively. For detailed notation of active atoms see Figure 3. Our workflow uses a data-driven approach to efficiently evaluate the necessary micro-pKa ’s as a weighted average over known reference micro-pKa ’s comprising a training set, or what we shall refer to in the following as the interpolation set. See Section 4.2 for full details. It turns out that naively inspecting the micro-pKa ’s of every atom in the input molecule is insufficient to adequately classify whether the atoms are involved in the tautomeric equilibria or not, and this is not simply an artifact of inaccurate pKa ’s. In Section 3, we therefore explore in detail two algorithms for classifying tautomerically active atoms according to pKa data. To avoid bias, we then generate a combinatorial space of all tautomeric forms from all active tautomeric atoms, which we refer to as a complete active space (CAS) of tautomers, as discussed in Section 2. However, this space is typically much larger than the space of low-energy, or relevant, tautomers. We therefore also explain how we can efficiently “filter” them using a semi-empirical quantum chemical method that is reliable enough to flag very high energies and therefore certainly irrelevant tautomers. Figure 1 gives a high-level illustration of our workflow and Section 5 reports the results of our algorithm when used in practice. 6

ACS Paragon Plus Environment

Page 6 of 52

Page 7 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Here we would like to stress that the micro-pKa predictor should be viewed as a “pluggable” component of the tautomer generation method. Micro-pKa ’s can be predicted by approaches alternative to those described in this work, without compromising the main idea behind the scheme shown in Figure 1. We hope that it is clear from the above discussion that our approach is indeed physics- rather than rules-based, and should provide a less biased foundation for tautomer generation than other options. It is true that some human expertise goes into constructing and balancing the interpolation set, which in turn potentially biases which atoms are identified to be active in the tautomeric equilibria. However, the tautomer generation itself is carried out in an objective way. This situation can be likened to the development of new density-functional approximations in quantum chemistry. Although a training set of molecules is typically required to fit the parameters in the functional and is chosen in a more or less subjective way to “represent” chemical space, the subsequent calculations utilizing these functionals are generally viewed as entirely objective.

2 COMPLETE ACTIVE SPACE As explained in the Introduction, our method is designed to achieve an unbiased generation of tautomers, and to help with this we introduce the concept of a complete active space (CAS) of structures. The idea is borrowed from electronic structure theory, where the exact many-electron wave function, for a given set of one-electron orbitals, may be expressed as a linear combination of all possible Slater determinants constructed from a combinatorial distribution of all the electrons among all the orbitals. The set of determinants thus generated may be systematically truncated by selecting only a subset of the electrons or orbitals to combine. The resulting combinatorial subset of determinants is known as a complete active space of determinants. The truncation allows calculations to be performed that would otherwise be impractical due to the large number of determinants in the exact expansion. Compared to other approximations, the great advantage of this truncation is that it avoids putting bias towards any one determinant, but instead treats a

7

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

selected group on equal footing. 42 Similarly, one can imagine the (whole) space of all possible tautomers for a given molecule as the set of all ways in which the molecule’s protons can be combinatorially distributed among the molecule’s heavy atoms. (Attachments of protons to heavy atoms can only be limited by the heavy atoms’ maximal valencies, by the octet rules, such as four for nitrogen.) Like the full space of determinants in electronic structure theory, this full space of tautomers may be too large to be manageable in practice, and in any case, of little or no utility since it is only the relatively low-energy structures that we invariably care about. (For the drug-like molecule with brutto formula C21 H18 F3 N3 O5 , more than one billion tautomers can be generated this way, while only a handful or so will be energetically interesting.) The key question is how to systematically truncate this space of tautomers with minimal bias, and our answer is obvious from the analogy we have just described. Equating protons to electrons and heavy atoms to orbitals, we choose some physically-motivated subset of each and form a CAS of tautomers. See Figure 2 for an illustration of the concept. Indeed, as already mentioned, our physics-based approach to tautomer generation is to use the micro-pKa ’s of each atom to classify which protons and heavy atoms should be active in the tautomeric equilibria. At this stage of the discussion it is appropriate to note that our definition of tautomers purposefully avoids considering energetic barriers between any pair of tautomers. There are at least three reasons for that. First, there is no established or agreed upon hard divide on the energy scale between two proton distributions that will classify the pair as either two tautomers or, more generally, as two isomers. Obviously, the setting of such a divide would be made practically useless by a change in temperature. Second, if we explicitly considered energy barriers separating one protomer distribution from another we would have to know or compute the potential energy surface (PES) that encompasses the structures, which is certainly an impossible task, especially for a simple algorithm for tautomer generation that we seek. And third, even if some approximate estimate of the energy barrier could be found without constructing the whole PES, for example by finding a transition state on a linear path of proton transfer in internal coordinates, complex mechanisms of

8

ACS Paragon Plus Environment

Page 8 of 52

Page 9 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 3: (a): The graphical notation of active protons and active sites used in other figures in this work. X symbolizes any element other than H, and is in practice usually C, N, O, or S. (b): Examples of double and triple active sites. proton exchange in water would undermine the simplistic estimate. Even though our discussion will gloss over the estimation of energy barriers, the construction of our algorithm will be based on chemical principles. As such, it is expected to result in generation of tautomers likely to interconvert in solution.

Figure 4: Illustration of the concepts of the RpH range, and classification of heavy atoms into proton donors and proton acceptors.

Now it is necessary to discuss some particulars of the distribution of a molecule’s protons among its other atoms and to formalize the terminology that we are going to use. It will be convenient to refer to a set of the molecule’s detachable protons and protonatable sites as “active

9

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 52

atoms”. And in order to count the number of possible tautomers that the molecule can have we will need a special nomenclature for the types of sites among which the protons can be distributed. We distinguish between three types of sites: singlet, doublet, and triplet (not to be confused with the analogous terms referring to spin multiplicity). Singlet sites can exist in only one state. They cannot accept or lose protons. The carbon atom in CCl4 , as well as every chlorine atom in that molecule is a singlet site. Doublet sites can exist in two possible states that differ by a single proton. A good example of a doublet site is the carbon atom in trinitromethane CH(NO2 )3 : it can be either in the CH or C− state. Triplet sites can exist in three states: a base state, or with one extra proton added to it, or with two extra protons. A typical example of a triple site is an NH− group, which can exist on its own, can accept one extra proton and turn into NH2 , or can accept two extra protons and become NH+ 3 . Figure 3 provides a graphical definition of doublet and triplet sites. In the following discussion the total number of doublet and triplet sites in a molecule will be denoted by D and T , respectively. Because singlet sites do not change their protonation states they are of no significance when accounting for the total number of possible tautomers, so the total number of singlet sites does not need a special symbol. The total number of active protons in the molecule will be denoted M . If, for simplicity, we ignore the symmetry, stereoisomers, and other three-dimensional (3D) effects, we can write down the total formal number of tautomers Ntaut for the specified values of M , D, and T : Ntaut

  T  X D+T −ζ T , = M − 2ζ ζ ζ=0

(1)

where the round brackets denote the binomial coefficient. In general, given a molecule, it is clearly hard to establish which of its hydrogen atoms should be distributed among which heavy atoms without the danger of spawning a huge number of tautomers that play essentially zero role in actual equilibria, or neglecting potentially important ones. One might consider using pre-defined patterns for known functional groups and heterocycles to truncate the space, but these might miss important tautomers if the patterns are applied to a novel chemical structure (which is often the situation when searching for candidates in drug discovery, 10

ACS Paragon Plus Environment

Page 11 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

for example). Restricting the types of elements to be considered in tautomer generation is also fraught with problems. If one were to consider, for example, only nitrogens and oxygens as possible hydrogen acceptors, there are numerous examples of carbon atoms actively participating in tautomeric equilibria, and, conversely, multiple cases where oxygen and nitrogen atoms are essentially inert to hydrogen attachments, as in the nitro-group. The next section is dedicated to our approach for meeting this challenge; that is, to the task of selecting the active atoms using micro-pKa ’s.

3 SELECTION OF ACTIVE ATOMS VIA pKa Our main idea behind the selection of the active sites and active protons can be outlined as follows. The most likely protonation/deprotonation sites and active protons are those that have favorable micro-pKa ’s. These favorable micro-pKa ’s are associated with the region of common pH values accompanying practically important proton equilibria in water, taken in this work to be between 0 and 14 (range RpH ). In the following, we shall refer to two algorithms for classifying active atoms on the basis of this idea: the depth zero algorithm and the depth one algorithm. The first directly classifies each atom according to its associated micro-pKa , but it turns out to be inadequate in practice and led us to the development of the depth one algorithm. Both are explored in detail below.

3.1 Depth zero algorithm In this simple algorithm, we define heavy atoms as proton acceptors if the micro-pKa of their protonation is larger than the minimum value of RpH . Similarly, heavy atoms can be proton donors if the micro-pKa of their deprotonation is lower than the maximum value of RpH . In turn, the protons that are attached to proton donors should be considered active protons. If there are two or more protons attached to the proton donor, then only one of these protons should be considered an active one. The other ones are going to lead to equivalent tautomers later in the algorithm anyway,

11

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

and can be disregarded. Finally, if the active site can be both donor and acceptor it is classified as a triplet site, and if it is a donor but not an acceptor (or vice versa) it is a doublet site. See Figure 4 for an illustration of these concepts. The name of the algorithm was chosen to emphasize that we are identifying active atoms (hence tautomers) of the input molecule by considering only the properties of the molecule itself, and not its protonated or deprotonated forms in addition. In contrast, we will see in Section 3.2 that the depth one algorithm lifts this restriction. Note, for example, that the depth zero method cannot identify triplet active sites which are formed by consecutive double protonation or deprotonation of a heavy atom. Fortunately, double subsequent protonations and deprotonations are rare in practice, unless the initial structure is an uncommon protonation form. Therefore, if we assume that we can predict the micro-pKa of every atom in the molecule sufficiently accurately for our purpose, then we might hope that the simple depth zero method should offer a reasonable approximation to classify the active atoms.

Figure 5: The molecule of 2-amino-4-hydroxypyridine with its active protons and active sites marked according to the results produced by a micro-pKa predictor (MPP) and using the depth zero tautomer generation algorithm. The active sites are marked by numbers (in italics for triplet sites) and their proton occupations are shown in square boxes. Only one hydrogen of the amino-group needs to be marked as an active hydrogen, the other being equivalent, as explained at the beginning of Section 3. The green square indicates the tautomer and its mobile elements that are ready for generating the final set of tautomers by distributing the active hydrogens across the doublet and triplet sites. To understand better the truth of this statement, we shall first consider an example where the depth zero algorithm successfully produces a reasonable set of tautomers. Next, however, we will explain a situation in which the simple method fails to generate some important classes of 12

ACS Paragon Plus Environment

Page 12 of 52

Page 13 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

tautomers, fueling the need to go beyond the zero order method. Consider the molecule of 2-amino-4-hydroxypyridine for which our micro-pKa predictor (MPP) might select the active protons and active sites as shown in Figure 5. These are actual sites selected by one of the MPP “flavors” described in Section 4. Here we have M = 2, D = 2, and T = 1, which brings about Ntaut = 4 tautomers, also shown in Figure 5 (for the notation see Formula 1). For now we can ignore the possible cis-trans isomerism that is of relevance to one of the tautomers, the one with the imino-group. The cis-trans isomerism is a relatively easy problem which can be handled in post-processing. The four tautomers generated as a result of identification of active protons are a result of an exhaustive distribution of the two mobile protons according to the algorithm outlined in the beginning of this section, and are as expected. However, the molecule which will be discussed in the next example reveals a problem in the zero depth algorithm. The depth zero algorithm misses some tautomers that would be expected by a human, even if the micro-pKa’s of the initial molecule are computed accurately. First we show the example, discuss the reason behind the failure, and then propose an improvement to the algorithm that should remedy the problem. Consider the molecule of 2-amino-4-hydroxypyridinium (which is a protonated form of the molecule from Figure 5). The left part of Figure 6 (corresponding to an application of the depth zero algorithm) indicates the molecule’s active hydrogens and active sites, as might be suggested by an MPP. All these suggestions for which atoms should be proton donors and proton acceptors are seemingly reasonable. The hydroxy-group’s proton is dissociatable although the group’s oxygen cannot be protonated. This agrees with our expectations for phenol-like compounds. The pyridinium nitrogen can only release its proton, but it cannot be protonated. This is also in line with expectations for pyridinium-like structures. Finally, it is imaginable that one of the hydrogens of the amino-group is liable to dissociation (especially while the heterocycle is protonated). The protonation of the amino-group would create a doubly protonated species and is unlikely to happen in a common pH range. Even though all the active atoms of 2-amino-4-hydroxypyridinium are as expected and apparently need not be augmented with other active atoms, these active atoms lead

13

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

to a generation of a single tautomer only, as shown in the left part of Figure 6. There are three protons to distribute among thee proton donor sites, and there is only one combinatorial outcome. It is obvious, however, that a significant tautomer, with the protonated amino group and the deprotonated pyridinium nitrogen, is absent from the depth zero algorithm output (see Figure 6). The missing tautomer would have been generated if the amino group’s nitrogen in the original tautomer had been identified as a triplet, not as a doublet site. We have, however, already admitted as reasonable the fact that the amino group of the original tautomer is only likely to be deprotonated and not protonated, while a proton is attached to the pyridinium nitrogen. If the proton could be lifted from the pyridinium nitrogen the amino group would be recognized (by our MPP) as protonatable. And the pyridinium group could indeed be deprotonated, as already indicated in Figure 6. The problem with the depth zero algorithm is that it does not have enough depth to consider protonation of atoms as the other deprotonatable active sites shed their protons, and, likewise, it does not consider removal of protons as the other protonatable sites gain protons. To resolve this situation, one needs to go one level further in exploration of possible protonation or deprotonation in order to better identify the active atoms. This task is left over to the depth one algorithm. We must conclude that even though depth zero algorithm is a reasonable start for tautomer generation in some situations, such as for 2-amino-4-hydroxypyridine, it is not sufficient in other situations. Application of the depth zero algorithm to diverse tautomerism-prone structures confirms this conclusion.

3.2 Depth one algorithm It appears more pedagogical to explain the operation of the depth one algorithm with an example, and only then formulate it rigorously. The right hand side of Figure 6 illustrates the approach when applied to 2-amino-4-hydroxypyridinium, the molecule for which the zero depth algorithm was shown to be unsatisfactory. In the depth one scheme, we start off with an input molecule and initially identify its active 14

ACS Paragon Plus Environment

Page 14 of 52

Page 15 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

7

Figure 6: An illustration of the depth zero and the depth one algorithms of generation of active atoms. The green borders indicate the tautomer and its mobile elements that are ready for generating the final set of tautomers by distributing the active hydrogens across the doublet and triplet sites.

15

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 52

protons and active sites with the depth zero algorithm. But in contrast to the depth zero algorithm, we also explore the protonation/deprotonation equilibria after additional protonation or deprotonation of the original molecule. For 2-amino-4-hydroxypyridinium, no atoms can be protonated, according to the results of the depth zero algorithm, so we only have the choice of deprotonating atoms. There are three detachable protons and we remove these one by one. We then explore the proton micro-equilibria in the resulting deprotonated structures by repeatedly applying the MPP and identifying any new active protons and active sites as before. In this example, after removing any of the protons, it turns out that the amino nitrogen atom is identified as a proton acceptor by the MPP. Therefore, unlike the depth zero algorithm, this site must be classified as a triplet site instead of a doublet site. Therefore the total number of active atoms is still six, but now consists of three active protons, two doublet sites, and one triplet site. Note that the addition of the triplet site is sufficient to generate the tautomer that was manifestly missing in the depth zero algorithm, showing the utility of the new method. This result is illustrated in Figure 6, where the structure in the green rectangle on the right hand side represents the final classification of active atoms. Distributing the active protons among the active sites yields three tautomers (for M = 3, D = 2, and T = 1) as shown in Figure 7. At this stage one could consider repeating the iteration again on the latest set of active atoms, applying the MPP and searching for any additional changes to the classification of atoms and active sites. The totality of active atoms should then be augmented with any new atoms that are found, and this process should be iterated until the number of active atoms becomes constant. Thus, we go from depth zero to depth one in an iterative fashion until convergence. The formal description of the algorithm can then be outlined as follows: 1. Read in a single molecule X of charge C. 2. Initialize empty sets {S} and {P }, corresponding to the active heavy atoms and active protons, respectively. 3. Generate the micro-pKa ’s for all atoms using an MPP.

16

ACS Paragon Plus Environment

Page 17 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

4. Classify the active protons and active sites for the pKa ’s that fall within the range RpH (typically 0 to 14). 5. Update the sets {S} and {P } with the newly found active sites and active protons. 6. For each proton M in set {P }: (a) Remove M to create molecule XM of charge C − 1. (b) Repeat steps 3 and 4 for XM . (c) Update the sets {S} and {P } with the newly found active sites and active protons, excluding, however the site of attachment of M when updating {S}. 7. For each active site S in set {S}: (a) Protonate S to create molecule XS of charge C + 1. (b) Repeat steps 3 and 4 for XS . (c) Update the sets {S} and {P } with the newly found active sites and active protons, excluding, however, the site S when updating {S}, and the proton that was added to that site when updating {P }. 8. Repeat steps 6 and 7 until {S} and {P } do not change. 9. Distribute all protons from {P } across all the sites from {S} so that the total adds up to Ntaut from Formula 1.

3.3 Possible generalizations and improvements It should be straightforward to generalize the algorithm given in the previous section to one more level of depth. Such an algorithm would not just iterate over all the protons, as in step 6 or iterate over all the active sites, as in step 7, but would also go over all pairs of protons and active sites in a similar fashion. Or, similarly, one might add an extra loop inside steps 6a and 7a to create structures with charge C −2 and C +2. In other types of generalizations one might consider quadruplet active 17

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

sites in addition to triplets and doublets. Another generalization consists in applying either depth zero or depth one algorithms to all the generated tautomers in an iterative fashion, until no new tautomers are produced. We have not attempted to implement and test these generalizations and leave their investigation for future work. An exploration of conformational space might be important for the generation of some tautomers, for instance those with hydroxy-groups attached to aromatic rings. This operation can be included in our tautomer generation workflow by launching a conformational search on every tautomer generated, and selecting one or more of its stable conformers, before proceeding to the next steps.

Figure 7: The set of tautomers of protonated 2-amino-4-hydroxypyridine produced by the depth one algorithm. See Figure 6 for the set of tautomers produced by the inadequate for this case depth zero algorithm. The active sites are marked by numbers (in italics for triplet sites) and their proton occupations are shown in square boxes. The green square indicates the tautomer and its mobile elements that are ready for generating the final set of tautomers by distributing the active hydrogens across the doubled and triplet sites. A few words must be said about the ring-chain tautomerism. 22,43–45 This relatively uncommon but still important type of tautomerism is actually already addressed by the algorithms described above, as far as the identification of the active atoms goes. However, the 3D structures of the generated proton distributions corresponding to ring-chain tautomerism might be unphysical because some of their bond lengths, in the 3D structures generated by force fields might be too long or too short. In order to generate stable, normally-looking structures of these chain-ring tautomers it might be necessary to combine the proton-distributing algorithms with an exploration of the conformational degrees of freedom followed by a quantum chemical geometry optimization. Such workflows would have the possibility to form and break rings, according to the stability of the 18

ACS Paragon Plus Environment

Page 18 of 52

Page 19 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

initial proton distributions. We have checked that such an approach works and successfully produces a ring-chain tautomer with a satisfactory geometry for a single uncomplicated molecule. This effort should be regarded as a proof of principle only. Another solution for ring-chain tautomer generation is detecting certain types of atoms in the generated tautomers (such as protonated carbonyl groups and deprotonated hydroxy-groups) capable of forming five- and six-membered rings upon covalent bonding. This solution is less attractive conceptually because it requires that we handle Lewis structures and atom types, but it might be effective post-processing step in practice that does not threaten the overall integrity of the tautomer generation methods based on micro-pKa ’s. A more thorough exploration and validation of ring-chain tautomer generators is needed in a future work.

4 THE MICRO-PKA PREDICTOR The implementation of the tautomer generation algorithms discussed in the previous section is contingent on the availability of a method that would predict micro-pKa ’s of all the atoms in the molecule, and thus identify its active atoms. For a molecule with i heavy atoms and j hydrogen atoms, we are seeking i + j micro-pKa ’s corresponding to protonation of every heavy atom and removal of every hydrogen atom. The algorithms would still be valid if the MPP could act on just a few of the atoms in the molecule. The protons and sites with the unidentified micro-pKa ’s could then be regarded as non-participating in the protonation/deprotonation equilibria. But when designing our tautomer generation algorithms we tried to avoid this simplification. Additionally, due to the large number of micro-pKa ’s that need to be computed in a typical workflow involving drug-like molecules, it is desirable to have a MPP with a relatively low computational cost. Finally, the method should have sufficient accuracy to reliably classify every atom as either active or inactive for the purposes of tautomer generation. Fortunately, there is some built-in tolerance to the accuracy we require from the MPP. Clearly, the danger of wrongly classifying an atom is larger if the true micro-pKa lies near the minimum or maximum values of RpH . But it

19

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

is worth recognizing that some atoms are more important to classify correctly than others if we are to avoid missing the dominant low-energy tautomers, which is the prime goal of the tautomer generator. Good proton acceptors are expected to have relatively high micro-pKa ’s of protonation while good proton donors are expected to have relatively low micro-pKa ’s of deprotonation. This should be clear from Figure 4. Therefore, it is the weak proton donors or acceptors which are more likely to be misclassified, and these are much less likely to contribute to the low-energy tautomers. In short, unless the errors of the approximate micro-pKa prediction are rather large, the influence of misclassification for the generation of the highest-populated tautomers should be negligible. Let us first consider using our DFT-based Jaguar pKa program as an MPP. The Jaguar pKa program, 38,39,46 being a hybrid DFT-empirical approach 47 that covers a large chemical space, can conveniently predict the micro-pKa ’s of most, if not all atoms in a molecule. The Jaguar pKa program embodies an elaborate workflow that combines DFT calculations (using B3LYP for a density functional and PBF for implicit solvent model) with empirical corrections. The latter are fitted to experimental pKa values for a large number of functional groups and compensate for the deficiencies of DFT energetics in gas and solution phases. The workflow has been shown to give accurate pKa predictions for diverse types of structures (including large and flexible drug-like molecules), with the errors lying typically within 0.5 - 1.0 pKa units. 38,39 However, there are two objections to using Jaguar pKa for the purposes of generating tautomers. First, Jaguar pKa is not going to be useful for routine and practical generation of tautomers, because DFT calculations that it performs (even though they are highly parallelized) may take hours for a single micro-pKa prediction for large drug-like molecules. The second problem is that the parameterization and training sets on which Jaguar pKa relies have often not been designed with a view to accurately handle unusual protonation and deprotonation processes that accompany micro-pKa prediction of all the atoms in a molecule. Therefore, the accuracy will be highly variable and possibly inadequate for all atoms and tautomers, even though it is very good in many situations. 38,39 As an alternative to expensive ab initio methods like DFT, we might consider an empirical

20

ACS Paragon Plus Environment

Page 20 of 52

Page 21 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

rules-based or atom-type based MPP. Earlier in the text we warned about the dangers of these approaches. Judging from the way such methods are realized in practice, they are likely to work satisfactorily on well-known structure types or a subset of micro-pKa ’s, but underperform or even fail on less common ones due to an absence of parameters or rules for some types of atoms or functional groups. A more serious flaw in approaches based on atom-typing is that they struggle with the problem of alternative mesomer (or resonance) structures. Ideally, atom parameters should be assigned regardless of the molecule’s resonance structure (the way the double bonds and formal charges are distributed across the atoms). Figure 8 shows a couple of typical mesomeric complications which might lead to different assignments of parameters (such as atomic radii, atomic charges, or micro-pKa ’s), depending on which of the mesomeric representations the molecule is passed to the atom-typing code. Importantly, there is no “right” or “better” resonance structure from the quantum chemical point of view; it is more appropriate to talk about a more “conventional” or “familiar” resonance structure. The conventionality, or appropriateness of resonance structures can be determined, for example, on the basis of chemical intuition for where formal atomic charged should be placed, or following a computation of partial atomic charges, which are not physical observables. Atom-typing codes deal with the problem of mesomers in several ways, for example, by canonicalizing resonance structures or by making rules for formal equivalence for some of them (see, for example, the complicated rules pertaining to resonance structures in an atom typing algorithm from Ref. 48 ). Although such solutions work for anticipated types of structures, it is usually not difficult to break them for less conventional ones. This can lead to different results for different resonance structures used as input. Because of the problems indicated in this and the previous paragraphs, atom typing approaches do not look attractive to us. For the purposes of the current work, we therefore decided to avoid atom typing and develop our own MPP designed with the following criteria in mind. It should have a relatively low computational cost, be applicable to all atoms (typically for drug-like molecules), and have reasonable (but not necessarily high) accuracy with low variance (i.e. robust to different charge states, elements

21

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 52

and functional groups). Clearly this is a challenging research problem in its own right, so we are leaving the task of perfecting the MPP until future work, and for now we set the objective of constructing a reasonably well-performing engine sufficient to illustrate the utility of the proposed tautomer generation algorithms. In Section 4.2 we describe in detail our new MPP method and in Section 5 we present results when applying it to tautomer generation.

4.1 Local atomic descriptor The MPP used in this work depends crucially on the existence of a similarity metric we can use to compare an unknown pKa process to a training set of reference data. In this Section we therefore describe a key component of this metric, which we refer to as a local atomic descriptor (LAD) which serves the purpose of comparing the 3D geometric environment of dissociating protons in two pKa processes, and also includes some information about the electronic similarity. A description of the LAD along with the motivation for its construction and examples of its use has already been provided in our earlier work. 38 In the present work we utilized a slightly modified variant of LAD, the so-called nested LAD, a brief account of which is given below. The local atomic descriptor for atom A consists of a pair of quantities: h i ~ A , qA . LA = M

(2)

Here, qA is simply the partial charge on atom A, which can be computed by a variety of methods. In the present work it is computed by the molecular mechanics OPLS3e method. The crux of the ~ A , which we refer to as the multidimensional coordination number (MCN), LAD is the vector, M h i ~ ~ ~ ~ MA = CNA (H), CNA (He), CNA (Li), ...

(3)

with one component for each chemical element. In our nested-LAD variant, each of these components is also a vector, resolving the coordination over every individual pair of atoms in the molecule. For

22

ACS Paragon Plus Environment

Page 23 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

example, ~ A (H) = [CNA (H1 ), CNA (H2 ), ..., CNA (HN )] CN

(4)

for every hydrogen atom Hi , not including atom A, and the individual (scalar) coordination number CNA (B) for a given atom pair is defined by    1 k2 (RA + RB ) = 1 + exp −k1 −1 CNA (B) rAB

(5)

The adjustable parameters k1 and k2 are pre-optimized on a training set, RA and RB are preset atomic radii, and rAB is the interatomic distance between A and B. The coordination numbers in the form given by equation (5) were originally developed by Grimme and co-workers 49 and are an important part of their post-energy correction to density functionals known as the D3 correction. The regular LAD used in our previous work is related to the nested LAD by summing up all the scalar coordination numbers for each element. An advantage of the nested LAD is that it provides a higher “resolution” description of the local environment. Importantly, all the components of the LAD have no formal dependence on interatom connectivity (also known as single, double, or triple bonds) or formal charges. This attractive feature makes LAD independent of the Lewis (resonance) structure 50 typically set by the user or a software program. Different representations (Lewis structures) of the compound should produce exactly the same LADs for the structure’s atoms, provided the internal coordinates are conserved. It is worth noting that modern software, if passed different two-dimensional (2D) Lewis representations of the same structure, will normally generate 3D structures with slightly different internal coordinates. This is because the varying connectivity (such as single and double bonds) between the same atoms in 2D leads to different bond lengths and angles in 3D, as assigned by molecular mechanics component of the software. A quantum mechanical or semi-empirical geometry optimization in the gas phase usually generates the same internal coordinates for all the structures of concern and thus resolves the problem. In any case, small deviations in the internal coordinates will lead to small differences in the LADs, which are smooth and differentiable (with respect to atomic

23

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 52

displacements) entities. Having introduced the descriptor itself, we can now define an “overlap” or “similarity” between two descriptors LA and LB , where atoms A and B, designating local environments, can reside either on the same structure or on two different structures. For two LADs LA and LB , which can be associated with atoms from the same structure, but are usually associated with atoms from two different structures, we define the overlap as: i h ~A −M ~ B )2 − wq (qA − qB )2 LAB = exp −wM (M

(6)

where the adjustable parameters wM and wq control the interplay between the geometric and charge components of LA and LB . To evaluate this expression in practice, we sort the vectors of coordination numbers for each element from largest to smallest to remove an arbitrary dependence ˆ let on atom labeling. Denoting the sorting operator by S, ~A −M ~ B )2 = (M

2 X ˆ CN ~ A (E)] − S[ ˆ CN ~ B (E)] S[

(7)

E

where E ∈ {H, He, Li...} and ~ A (E) − CN ~ B (E) = CN

X

|CNA (Ei ) − CNB (Ei )|

(8)

i

where the sum over i runs over the smallest number of components in the two vectors.

4.2 The interpolation weights and pKa estimator As mentioned in the Introduction, the MPP used in this work is based on an empirical data-driven approach to estimate micro-pKa ’s as a weighted average over known reference values. See Eqn.(12). The weights are derived from a similarity metric used to compare the dissociation process of interest to those in the reference data. The metric is defined using both local information from LAD’s as well as non-local descriptors because protonation equilibria are often dependent on an 24

ACS Paragon Plus Environment

Page 25 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 8: Examples of different mesomeric representations of the same molecule [an asymmetric guanidine in (a) and 4-pyridone in (b)] that might lead to different assignments of atom parameters in atom-typing approaches if the resonance structures are not explicitly treated as equivalent with an additional piece of code. interplay between both the local capacity for protonation or deprotonation and non-local energetic stability. While the local descriptor helps establish the similarity of 3D geometries in the vicinity of the dissociating proton, the non-local descriptors capture long distance effects, particularly the resonance effect and stabilization by aromaticity. The influence of the inductive effect is actually captured by the LAD descriptor through the built-in partial atomic charge. Below we describe precisely how this similarity metric is defined and how the final micro-pKa ’s are constructed in practice. Note that we will refer, for example, to a deprotonation process using the notation AH = A− + H+ where AH is the molecule of interest and A is strictly a molecular fragment. For notational convenience, however, we shall also refer to A as being the “heavy atom” attached to the dissociating proton, to be used as the reference atom for LAD generation, for example. The interpretation of A as fragment or atom should be clear from the context.

Figure 9: Two molecules with very similar 3D environments (around the dissociating hydrogen atom shown in orange) and delocalized positive charge cannot be effectively distinguished by LAD alone without the assistance from the additional energy descriptor ∆E and the damped total charges from Eqn.(9). Upon dissociation, the molecule on the left produces an aromatic species and thus is expected to have a large negative pKa whereas the molecule on the right is an aliphatic species with a large positive pKa . 25

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 52

Our reference data, or interpolation set, is explained in detail in the following section. In summary, it includes N data points, where N is currently 20201 (although this number is expected to grow in future versions of the program as we add new data) and each data point describes a deprotonation process AH = A− + H+ for a particular molecule AH and heavy atom A. For some molecules, data points are available for multiple atoms. In addition to a reference micro-pKa value (taken from experiment or high-quality theoretical data) each data point has two descriptors: (1) the LAD for AH computed at the heavy atom A and (2) the gas phase energy difference ∆E = E(AH) − E(A− ) computed at the PM7 semi-empirical level of theory. 51 In our tautomer generation algorithms, we need to compute the micro-pKa for every atom in the input structure, but to describe the implementation of our MPP, it is sufficient to explain how this is done for one heavy atom X. Let us first consider the deprotonation process XH = X− + H+ . In this case we compute the LAD at X (while X is attached to H), denoted Lx . Next we compute the set of overlaps {Lix }i=1..N of Lx with the N LADs from the interpolation set using the overlap metric 6. It turns out that even though the LAD already includes atomic charge, the single partial atomic charge is not very effective at describing the electronic environment when the charge can be delocalized through the resonance effect. For example, LADs alone cannot effectively distinguish the two structures like those shown in Figure 9. Therefore, for the development of our MPP, we introduce two new descriptors to augment the LAD and define the similarity metric. For the first additional descriptor, we use the following damped total charge definition for an arbitrary atom A: QA =

X

1+ B6=A

qB −s e q (R−rAB )

(9)

where the sum goes over all the atoms B in the molecule with partial atomic charges qB . Here, ˚ and rAB is the distance to atom B from the reference atom A. The adjustable parameter R=3A sq is defined below. The idea is that QA contains more non-local information than the simple atomic partial charges. We then define a new quantity ∆Qix using the difference between these charges for atom X in the input molecule and the heavy atom associated with data point i in the 26

ACS Paragon Plus Environment

Page 27 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

interpolation set as ∆Qix = |Qx − Qi |.

(10)

Finally, for an additional global descriptor, we compute the PM7 gas-phase energy difference ∆Ex = E(XH) − E(X− ) and compare it to those in the interpolation set to define the quantities

∆∆Eix = |∆Ex − ∆Ei |

(11)

where ∆Ei is the energy difference E(AH)−E(A− ) stored for every data point i in the interpolation set. We have found that these additional two descriptors are very important for describing changes in stability of molecular species upon deprotonation in cases where the deprotonation leads to significant changes of stability associated with resonance or aromaticity but where the local 3D geometries are very similar. We now have all the components we need to estimate the micro-pKa of the deprotonation XH = X− + H+ using the weighted average over micro-pKa ’s from the interpolation set according to the equation pKa =

PN

wi pKa,i PN i wi

i

(12)

where the pKa,i are micro-pKa ’s from the interpolation set, and the wi are weights, as defined later in the text. For brevity, we dropped the index x in the notation for the final weights for atom X. The denominator accounts for normalization of the weights. The weights decay rapidly and typically only a very small fraction of the data in the interpolation set contributes significantly to the average. This approach is similar to the (weighted) k-nearest-neighbor methods commonly used in the machine-learning community. 52 The weights are defined as a product of three contributions: Q L E wi = wix × wix × wix

27

ACS Paragon Plus Environment

(13)

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 52

where we let L wix =

E wix =

1 1 + e−sL (Lix −cL )

,

1 1+

e−sE (∆∆Eix −cE )

(14) ,

(15)

and Q wix = e−sq ∆Qix .

(16)

The parameters sL , cL , sE , cE and sq were obtained by fitting to 1413 micro-pKa values (known from experiment and previous calculations) for 76 neutral, cationic, and anionic molecules. The optimized parameters are, respectively, 80, 0.7, 10, -10.0, and 100. The non-linear optimization was performed in an iterative process, in which all parameters except one were fixed, and the minimum was found for the single parameter that was varied. For more robust structural matching of the proton dissociation process the overlap Lix may also deprot be defined as Lprot , where Lprot and Ldeprot are computed at the same heavy atom but ix × Lix ix ix

for the protonated and the deprotonated forms, respectively. In practice we found this to be more satisfactory and this is the form we used in the results presented below. Now, after having evaluated the micro-pKa of the deprotonation process XH = X− + H+ at atom X it is necessary to evaluate the micro-pKa of the corresponding protonation process + XH+ 2 = XH + H . To perform the latter operation we additionally protonate the molecule at atom

X (even though X might be already bonded to a proton), thus creating XH+ 2 , and then repeat the whole series of operations explained in the previous paragraph for the protonation form XH. Finally, we classify the atoms for our tautomer generator. If no proton is attached to X to begin with, then X obviously cannot be a proton donor and the only micro-pKa that we can evaluate at this atom will correspond to the dissociation of the additionally protonated X. The pKa of the process XH = X− + H+ will determine if X is a proton donor while the pKa of the process + XH+ 2 = XH + H will define if X is a proton acceptor (see Figure 4).

When the interpolation set becomes sufficiently dense (in a sense that it covers the relevant chemical space well), it is possible to dispense with the interpolation procedure and with the 28

ACS Paragon Plus Environment

Page 29 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 10: The test set of molecules prone to tautomerism. Results of applying different tautomer generation methods to these molecules are summarized in Table 1. calculation of weights 13, and simply select the pKa corresponding to the top match from the interpolation set. For example, we have found the following progressive sorting, or filtering procedure efficient for the interpolation set consisting of over 20000 data points. Sort all matches by the L overlap and select the first hundred top matches (those that have the largest values of L). Within these hundred matches, select thirty that have the smallest ∆Q value associated with Formula 11. Within these thirty matches, select one that has the minimal ∆∆E∆Q/L value. Then take the pKa of this single match as the final result. This approach will be referred to as the “sorting” method. It is easy to interpret the final pKa value, if need arises. The weights wi corresponding to the best matches (or a single best match, in case of the sorting method) can be output alongside the structures and their protonation sites. This way one can see which structures contribute most significantly to the final pKa value. The accuracy with which the “interpolation” and the “sorting” methods classify atoms into active and inactive is comparable. One test set consisted of 76 molecules and 1400 atoms, where the atom activities (active or inactive) were assigned by humans using chemical intuition and, in case of doubt, Jaguar pKa calculations. The amount of true actives (expected active and predicted active), true inactives (expected inactive and predicted inactive), false inactives (expected active

29

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

and predicted inactive) false actives (expected inactive and predicted active) were, respectively, 126, 1169, 84, 21 for the interpolation method and 182, 1131, 28, 59 for the sorting method. As only the false inactive predictions are considered undesirable, this corresponds to 94% and 98% of accuracy, respectively. In view of the rather small size of the test set, and due to the fact that the performance of the algorithms was compared to assignments based on chemical intuition, it would be premature to claim that one method is definitely more accurate than another. The results presented in this work were generated with the interpolation method, unless otherwise indicated. A critical question at this point is the construction of the interpolation set. If chemical similarity between the query molecule and the interpolation set is low (the maximal overlap L being below 0.2 or so) then neither the “interpolation” nor the “sorting” method is expected to yield reasonable pKa values. See the next section for a way to deal with this problem.

4.3 The interpolation set The success of the pKa estimator Eqn. (12) depends on both the evaluation of the weights wi and the completeness of the interpolation set. The previous section dealt with the weights and the current section will go over the construction of the interpolation set. An interpolation set {I} consists of data points Ik . Each data point Ik is composed of the following: (i) a pair of 3D structures the connectivity of which differ by a single proton only; (ii) the micro-pKa value associated with the proton dissociation process; (iii) MCN for each of the two structures computed at the heavy atom X and the proton attached to X; (iv) semi-empirical heats of formations computed for each of the two structures. As our interests for tautomer generation lie with organic and drug-like molecules we used such molecules for building our interpolation set. Construction of interpolation sets for other chemical spaces should be also achievable. A small number of data points Ik had their micro-pKa values taken from experimental measurements (whenever they were available) but for the majority the micro-pKa ’s were computed by the Jaguar pKa program. The computation of MCN and heats of formation was according to the procedure described in the previous section. 30

ACS Paragon Plus Environment

Page 30 of 52

Page 31 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

When constructing an interpolation set one strives to produce a few large weights wk corresponding to data points Ik the local environments of which resemble the local environments of atoms of molecule X, for every molecule X of interest. When all values wk in a particular micro-pKa calculation are small (less than 0.1) this means that the interpolation set does not represent the particular environment of molecule X well. Interpolations in which all weights are small are not likely to yield reasonable micro-pKa predictions. In practice, two things can be done in such situations. First, one can automatically enrich the interpolation set by fragmenting molecule X along a few key single bonds (excluding cycles) and launching Jaguar pKa calculations on the fragments. Fragmentation does not greatly perturb the local environments and renders the DFT-based pKa calculations much faster because such calculations scale as the second to third power of the number of atoms. Then, a subsequent calculation on molecule X or its analogs is going to have weights wk of appropriate magnitude. This approach is not always satisfactory in practice because Jaguar pKa calculations are computationally expensive and usually require several minutes for each atom environment. This will greatly slow down generation of tautomers for molecule X. However, once the enrichment for molecule X has been completed it is likely to be unnecessary for molecules similar to X, if those are to be processed next. It has been already mentioned that Jaguar pKa does not always predict reliable micro-pKa values when applied to highly unusual protonation/deprotonation processes (such as protonation of carbon atoms in heterocycles). There might be also problems with highly unstable structures thus created, which, for instance, fall apart or otherwise “react” during geometry optimizations. In such cases we fall back on the second options which is described in the next paragraph. The second option is pragmatic. If no molecules in a sufficiently enriched interpolation set match a particular atom of X we assume that the atom is inactive. So, we assign a large positive pKa value µ to it if the atom is a hydrogen, and a large negative value −µ otherwise (the actual value of µ is immaterial, as long as it signals an inactive atom). The logic behind the assumption that the atom is inactive is based on the observation that in a typical molecule there are actually very few atoms that are active. Active atoms also tend to be given more prominence in an interpolation

31

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 52

set than their “natural” proportion in molecules would otherwise suggest. This is because the micro-pKa ’s of such atoms are more often measured and predicted computationally and are more readily available for an inclusion in the interpolation set. Thus, if an atom has no good match to anything in the interpolation set it is more likely than not to be inactive.

Figure 11: Tautomers 4a and 4b with relative M06-2X-D3/cc-pVDZ(-f) solution phase (LAD-PBF) energies 0.00 and 1.65 kcal/mol, respectively, are related by a long-distance proton transfer, which might be difficult for local rules-based methods to handle, but is not a problem for approaches such as depth one. There may be a high proportion of neutral molecules (and their singly protonated and deprotonated forms) in the interpolation set. This is natural because researchers tend to deal with neutral molecules and their conjugate acids and bases (as opposed to multiply charged ions). At depth one, given a neutral molecule as an input, unless it lacks active atoms at depth zero, we necessarily have to compute micro-pKa ’s of singly protonated and singly deprotonated molecules. The energy gaps ∆∆E generally depend on the total charges of the protonated/deprotonated pair, even for very similar local environments. This is because it generally costs more energy to protonate a positively charged molecule than a neutral one (the local environment being similar), and similarly for deprotonation. When predicting pKa with one of our proposed approaches, in order to get E several large values wi one would need to have large energy weights wix . In agreement with the

argument made about the dependence of the energy gap on the total charge, if one predicts the pKa ’s of molecule X with the total charge C at depth zero, one needs a good representation of molecules of charge C (as well as their conjugated acids and bases) in the interpolation set. If one uses the depth one algorithm for molecule X with the total charge C one’s interpolation set needs to have a good representation of molecules with the total charges C − 1 and C + 1 (as well as their conjugated acids and bases). Our current interpolation set consists of 15076 data points for 32

ACS Paragon Plus Environment

Page 33 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

charge transitions 0/ − 1 and 0/ + 1, and 5125 data points for charge transitions −1/ − 2 and +1/ + 2. This indicates that our current interpolation set should be more reliable for the generation of tautomers of neutral molecules. In view of this data distribution, the number and quality of our non-µ micro-pKa ’s assignments decreases with transition from neutral to progressively more charged molecules (see Figure 1). Certainly, it is more desirable to get an actual pKa prediction rather than µ because a predicted value based on well matched data gives more assurance in assignment of atom activity. With the addition of new data points to the interpolation set the proportion of µ pKa predictions should decrease.

5 ILLUSTRATIVE RESULTS The arguments earlier in the text make it clear that it is not the sheer number of generated tautomers but their relevance that matters in practice. The criterion for the relevance of a tautomer is its population in the tautomeric equilibrium. At room temperature, a few kcal/mol of energy difference from the lowest energy renders a tautomer’s population insignificant. General knowledge of physical organic chemistry implies that stability and reactivity can be viewed as complimentary and inversely contrasted concepts. 53 A few low-energy tautomers might thus be of interest in studies of covalent reactivity and binding to proteins. 17,24,54 Therefore, when generating tautomers higher-energy structures must also be considered. In the discussion below we will use the window of 5 kcal/mol (corresponding to about 0.02% population in the mixture between two tautomers at room temperature) for distinction between relevant and irrelevant tautomers. Even though this window is quite arbitrary (Greenwood and co-workers, 19 for example, use the window 2.76 kcal/mol) it will not influence the general conclusions about our tautomer generation method and will only serve as a convenient reference metric. Our numerous tests of the depth zero and depth one algorithms have shown that only the latter is satisfactory in terms of the number and quality of tautomers generated. For this reason, we will

33

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

omit the depth zero algorithm from the following discussion.

5.1 Computational details It is unnecessary for the purposes of obtaining a qualitative picture to go into the additional computational expense of computing free energies in solution, as the relative free energies of rigid, asymmetric molecules used in the illustrations should be very similar to their relative solution phase energies. The reference energies are computed at the M06-2X-D3/cc-pVDZ(-f) level of theory in conjunction with a locally modified implicit solvent model PBF, 55–57 as implemented in the quantum chemistry program Jaguar. 46 This variant of PBF, dubbed LAD-PBF, sets the atomic radii and the local non-electrostatic corrections to the solution phase energy using an interpolation scheme based on local atomic descriptor, following the methodology similar to the one delineated in Section 4.1. As the interpolation set we use 8567 atomic environments and their corresponding PBF parameter assignments using uncomplicated molecules on which PBF is known to perform reliably. The overall performance of LAD-PBF for predicting solvation energies is similar to the good performance of PBF (1.91 kcal/mol for LAD-PBF vs. 1.92 kcal/mol of mean unsigned error on a test set of 451 molecules comprising small and large neutral molecules, anions, and cations). Importantly, and thanks to the LAD-based parameter assignments, LAD-PBF energies are a function of Cartesian coordinates only and do not depend on the Lewis structure. The independence of energies on bond and formal charge assignments, especially in structures with charge delocalization, particularly anions and cations, is crucial when comparing relative energetics of generated tautomers. All geometry optimizations preceding M06-2X-D3 single point energy calculations, were performed at the B3LYP-D3/6-31G** level of theory in conjunction with the LAD-PBF model. The semi-empirical filtering with the PM7 method was performed with the computer program MOPAC2016 58 in the solution phase (COSMO solvation model), and the tautomers having relative energy above 20 kcal/mol were discarded. The energy window of 20 kcal/mol, which is substantially larger than the target energy window of 5 kcal/mol, is needed as a protection against a larger margin of error associated with semi-empirical methods. A 34

ACS Paragon Plus Environment

Page 34 of 52

Page 35 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

similar combined DFT/semi-empirical protocol for tautomer filtration has been used by the Varnek group. 59 Let us comment on the choice of theoretical methods that we selected to rank tautomers by energies. DFT functionals, in particular M06-2X, 60 have served as a theoretical method of choice in combined theoretical and experimental studies of tautomer populations in condensed media. 61,62 Benchmarking studies show that M06-2X is especially well suited for tautomer energetics predictions. 19,63 The dispersion-corrected analog M06-2X-D3 should have a similar performance to M06-2X and was recently selected as one of the few recommended functionals in a large-scale assessment of two hundred functionals by Mardirossian and Head-Gordon. 64 Less is known about performance of PM7 on tautomers, but the accuracy of this latest method in the PM series has been compared favorably to DFT, in application to conformers. 65 Since tautomers can be viewed as isomers, it might be also mentioned that the predecessor of PM7, the PM6 method was shown to rank the stability of neutral isomeric species reliably (in agreement with M06-2X) but performed badly for anions. 66 PM6 was also used for ranking tautomers in the recent work. 59 The semiempirical methods MNDO and AM1 were not found to be sufficiently accurate tautomer filters by Cruz-Cabeza and co-authors. 30 Filtration of the initial set of generated tautomers with semi-empirical methods like PM7 needs to be studied further. We have determined that the use of a solvation model was essential for reasonable ranking and for effective filtering. In our experience, this protocol works reliably for tautomers originating from structures like 1-8 (vide infra). However, we have found that for porphyrin-like tautomers such as those recently reported by Kim and co-workers 67 PM7 does not serve as a good filter, discarding low-energy tautomers generated at the depth one level, even when very large energy windows are used. Presently we regard such structures as exceptions for which more efficient filtering techniques need to be found.

35

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 12: A study of the invariance of the depth one/PM7 algorithm with respect to the input tautomer. The structures of input tautomers 6a-6j are shown horizontally. The tautomers generated from each of the initial tautomers 6a-6j are arranged vertically by their relative energies computed at the M06-2X-D3/cc-pVTZ(-f) level of theory in conjunction with the locally modified PBF implicit solvation model. The bar separating relevant from irrelevant tautomers is set at 5 kcal/mol.

36

ACS Paragon Plus Environment

Page 36 of 52

Page 37 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 1: The number of tautomers generated by various methods for the structures shown in Figure 10. Relevant tautomers are the structures with DFT energies within ǫ kcal/mol from the lowest energy. The data for ǫ = 2.5 and 5.0 kcal/mol are shown in the format Aǫ=2.5 /Bǫ=5.0 , where appropriate. Since the relevant tautomers were picked from the pool generated by the depth one method, it is posited for the purposes of this table that this method misses no relevant tautomers. Therefore, missing tautomers are those which are present in the list of relevant tautomers, but not found by the specified scheme. The cis-trans R = N − H forms are counted as separate tautomers.

Compound

Relevant

Depth one

1 2 3 4 5 6 7 8

2/5 3/3 2/2 2/2 1/1 2/3 2/2 1/2

9 9 14 4 6 10 10 35

Depth one + PM7 Generated Missing 8 0/0 4 0/0 5 0/0 2 0/0 2 0/0 8 0/0 9 0/0 4 0/0

Tautomerizer Generated Missing 5 0/0 1 2/2 2 2/2 1 1/1 4 0/0 1 2/3 1 2/2 1 0/1

5.2 Discussion We present results of the tautomer generation for a number of organic heterocycles susceptible to tautomerism. The structures themselves are displayed in Figure 10 (only one representative tautomer, the one that served as input to the algorithms, is shown per structure). This test set was constructed from molecules with unusual aspects of tautomerism, as previously discussed in Ref. 1 (i.e. structures 1-6), and from two additional structures 7 and 8 (also known as allopurinol and enprofylline) the tautomerism of which has been analyzed in some detail in the literature. 26,30 It must be noted that simpler and more familiar heterocycles prone to tautomerism such as 2or 4-pyridones did not present problems to either the depth one algorithm or to the rules-based Tautomerizer algorithm (vide infra) and were therefore not particularly interesting for our tests. We compare the results from different tautomer generation methods in Table 1. First, for reference, we report the total number of tautomers generated by the depth one algorithm, and they are seen to be consistent with Formula 1 and the number of cis-trans tautomers created in post-processing. For example, for structure 3 the algorithm identified M = 3, D = 4, T = 1, and there were no cis-trans isomers, which resulted in 14 tautomers. For 2, the active atoms were 37

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

characterized by M = 2, D = 4, T = 0, which produced 6 tautomers. From these 6 tautomers, 3 additional tautomers were generated due to cis-trans isomerization, thus bringing the total to 9 tautomers. Second, we report the number of relevant tautomers on the basis of their DFT energies, as discussed above, after ranking the full set of tautomers taken from the depth one results. Table 1 reveals that structures like those from Figure 10 typically possess a small number of (no more than five) relevant tautomers, according to their DFT energies. Third, we apply PM7 filtering and report the total number of resulting tautomers, as well as the number of “missing” tautomers, being those relevant tautomers that are not returned. Finally, we report the total number of tautomers, as well as the number of missing tautomers, when using an alternative rules-based method developed internally and based on the ideas published in Ref., 27 referred to as “Tautomerizer”. For the purpose of assessing the PM7 filtering, and to compare with the Tautomerizer, we use the results of the depth one algorithm as a reference (i.e. we treat it as an exhaustive tautomer enumerator) and this is our justification for reporting the number of “missing” tautomers. Of course, there is no guarantee that it will find all the highly-populated tautomers, but nevertheless, we believe the sets it generates are typically comprehensive. The full sets of tautomers generated by the depth one algorithm for each structure in Table 1 are shown in the Supporting Information, with the relevant tautomers highlighted. Note that the depth one algorithm generates several times more tautomers than are energetically relevant or expected from a human perspective, without any obvious omissions. If anything, the algorithm includes active atoms too aggressively. The occasional aggressivity of the depth one algorithm should not present a problem in practice, however. When the PM7 filtering is applied to the generated tautomers, we see that only a small multiple of the total number of tautomers is left, and furthermore, no relevant tautomers are missed. In contrast, the rules-based Tautomerizer misses relevant tautomers in all cases but one. We estimate that “non-local” proton jumps characterizing equilibria such as between the two relevant tautomers of 4 (see Figure 11), correctly identified by our physics-based algorithm, will be particularly challenging for rules-based methods which are local in nature. The output of the depth one (as well as depth zero) tautomer generation algorithm is formally

38

ACS Paragon Plus Environment

Page 38 of 52

Page 39 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 13: The number of tautomers generated by the depth one algorithm for drug-like molecules which are actually not expected to have a large number of tautomers. The purpose of the test is to make sure that an application of the method to such molecules produces a fairly small number of tautomers.

39

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

dependent on the initial input tautomer. That is, if a compound possesses N tautomers Tk ; k = 1..N , then the set of tautomers generated from Tk are not guaranteed to be the same for every k. In practice, the formal dependence of the results on the initial tautomer is not important. What is important is the invariant generation of the high-relevance tautomers (those with relative energies within the 5 kcal/mol window). Even though the latter is not guaranteed either, in practice the output of our depth-1 algorithm varies only in very high-energy tautomers, and all the high-relative tautomers are generated without gaps in all our tests. Figure 12 shows an infrequent case in which one of ten input tautomers (the set was initially generated by the depth one algorithm) failed to produce one of the tautomers (the highest energy one) from the relevant 5 kcal/mol region. The remaining nine inputs did generate the same set of 3 relevant tautomers (although the total number of tautomers was indeed dependent on the input). Remarkably, the prediction of the unusual zwitterionic form 6a as the most stable tautomer in this case actually agrees with the prediction by Katrizky and co-workers 1 made at the time when accurate quantum chemical calculations for molecules of this size could not be conducted. Molecules studied in Table 1 are relatively small. As the number of tautomers grows significantly with the number of detected active atoms, it is important to verify that the depth one algorithm does not produce an unmanageable number of tautomers for larger molecules, such as practically important drug-like molecules. This could occur, for example, if the underlying MPP predicted a large number of active atoms for the atoms that are in reality non-active. Figure 13 gives evidence that our depth one algorithm together with our MPP behave reasonably. Four drug-like molecules (among them a singly charged anion and two singly charged cations) are shown as examples. No more than one or two tautomers are expected by a human to be relevant for each of these structures. The number of tautomers generated in each case is small (the largest number is six), and this number can be further reduced upon the PM7 filtering.

40

ACS Paragon Plus Environment

Page 40 of 52

Page 41 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

6 CONCLUSION We have described a new, general, physics-based method to generate tautomers based on the identification of active protonation and deprotonation sites from estimated micro-pKa ’s, in contrast to other approaches that may rely on the use of hard-coded rules of proton distribution, previously observed tautomerization patterns from a known chemical space, or human input. Our approach is grounded in the computational prediction of physical properties in aqueous solution and is built around physical descriptors and precomputed and experimental pKa data in water, which serves as training data. As opposed to other physics-based methods that are more relevant to the gas phase, 31 the use of the pKa data makes our method applicable for generating tautomers relevant in solution phase. Importantly, our method aims to avoid using Lewis structures. Full independence from Lewis structures, regrettably, could not be achieved, as the concept of bonds and formal charges is implicitly present in several stages of our algorithm via intermediate structures generated by molecular mechanics methods. Nevertheless, the troubles related to mesomeric variations are virtually eliminated, and the serious problem of having to account for alternative resonance structures is absent from our scheme. A disadvantage of our method is that it relies on training data, which necessitates preparation and refinement by a human. However, human involvement is common in empirical methods, so this should not be viewed as a significant drawback. Besides, the generation and entry of new training data which requires human participation has been reasonably automated, and by construction it circumvents the need for managing cumbersome and error-prone proton shift patterns present in other methods. Having identified active protonation and deprotonation sites, we proposed two methods for the subsequent generation of tautomers in a systematic way with minimal bias, and to post-process the results to filter out species expected to have low population. Our most basic method (the depth zero algorithm followed by a PM7 filtering) requires up to several minutes of execution time for a typical large drug-like molecule. For example, an execution of the sorting variant of our algorithm 41

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(the interpolation variant being slightly more expensive) took 330 seconds on a single CPU for the drug dasatinib the molecule of which comprises 59 atoms. A depth zero calculation on each of the smaller molecules from Figure 4.2 on a single CPU took about a minute. This is negligible time if the results of such tautomer generation are to be used in a subsequent DFT-based pKa or similar calculation. Most of the computational time is spent for force field (FF) cleanup of the generated protonated and deprotonated structures, computation of FF atomic charges, and computation of the overlap L across the interpolation set. Only a small fraction of time is spent for the semi-empirical calculations. Importantly, the algorithm can be trivially parallelized across atoms since calculations on the individual atoms are independent. We have not yet attempted this implementation, but it is clear that an execution of depth zero method can take mere seconds for even large drug-like molecules on a handful of CPUs. Unfortunately, the depth zero algorithm does not always produce a desirable set of tautomers. A much better method as far as the completeness of tautomer generation goes, is the depth one algorithm, which is several times more expensive. The execution time of the depth zero algorithm is approximately T1 = T0 Nact where T0 is the execution time for the depth one algorithm for the given molecule, and Nact is the number of active atoms at the depth zero level. Assuming Nact is no more than ten for a typical drug-like molecule, the generation of tautomers at the most reliable depth one level (in combination with PM7 filtering) will take several minutes on a multi-CPU system. Even this increase in computational cost of tautomer generation is still a small fraction of the overall cost of a typical DFT-based workflow to which the tautomers serve as an input. Our approach depends critically on the quality of the micro-pKa predictor (MPP) used to identify active sites. In this work we presented a new method designed specifically for this algorithm, to yield a robust yet efficient estimation of the micro-pKa of every atom in a molecule. It should be possible to speed up the generation of tautomers by designing more efficient MPPs, which are not only faster but also more accurate, thus avoiding the inclusion of unnecessary active sites that generate spurious high-energy tautomers, perhaps using trained neural networks. We leave exploration of these possibilities for future work. We note that the first attempts to use neural

42

ACS Paragon Plus Environment

Page 42 of 52

Page 43 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

networks for predicting pKa ’s have been recently reported. 68,69 An exciting application of this work is to the generation of tautomers to be used in the computation of macro-pKa ’s. We have already applied the methods described here to our own in house macro-pKa prediction workflow. The workflow itself will be presented elsewhere but it is already possible to point out that the depth zero and depth one tautomer generation methods have proved to be more promising and more dependable than rules-based methods that we also experimented with.

7 ACKNOWLEDGMENT We would like to thank Drs. John Shelley, Ryne Johnston, Mats Svensson, Leif Jacobson, and Karl Leswing for useful discussions as well as Dr. Daniel Levine for technical assistance with some details of our algorithm implementation.

8 SUPPORTING INFORMATION Tautomers generated by the depth one method structures 1-8 are presented. The low-energy (relevant) tautomers are indicated.

43

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 44 of 52

References (1) Elguero, J.; Marzin, C.; Katritzky, A. R.; Linda, P. The Tautomers of Heterocycles; Academic Press: New York, 1976. (2) Antonov, L., Ed. Tautomerism: Concepts and Applications in Science and Technology; Wiley-VCH: Weinheim, Germany, 2016. (3) Sayle, R. A. So you Think you Understand Tautomerism? J. Comput. Aided Mol. Des. 2010, 24, 485–496. (4) Warr,

W.

A.

Tautomerism

in

Chemical

Information

Management

Systems.

J. Comput. Aided Mol. Des. 2010, 24, 497–520. (5) Guasch, L.; Yapamudiyansel, W.; Peach, M. L.; Kelley, J. A.; Barchi Jr., J. J.; Nicklaus, M. C. Experimental and Chemoinformatics Study of Tautomerism in a Database of Commercially Available Screening Samples. J. Chem. Inf. Model. 2016, 56, 21492161. (6) Milletti, F.; Storchi, L.; Sforna, G.; Cross, S.; Cruciani, G. Tautomer Enumeration and Stability Prediction for Virtual Screening on Large Chemical Databases. J. Chem. Inf. Model. 2009, 49, 68–75. (7) Glavatskikh, M.; Madzhidov, T.; Baskin, I. I.; Horvath, D.; Nugmanov, R.; Gimadiev, T.; Marcou, G.; Varnek, A. Visualization and Analysis of Complex Reaction Data: The Case of Tautomeric Equilibria. Mol. Inf. 2018, 37, 1800056. (8) Pospisil, P.; Ballmer, P.; Scapozza, L.; Folkers, G. Tautomerism in Computer-aided Drug Design. J. Recept. Signal Transduct. 2003, 23, 361–371. (9) Nemeria, N.; Korotchkina, L.; McLeish, M. J.; Kenyon, G. L.; Patel, M. S.; Jordan, F. Elucidation of the Chemistry of Enzyme-bound Thiamin Diphosphate Prior to Substrate Binding: Defining Internal Equilibria Among Tautomeric and Ionization States. Biochemistry 2007, 46, 10739–10744. 44

ACS Paragon Plus Environment

Page 45 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(10) Katritzky, A. R.; Hall, C. D.; El-Dien, B.; El-Gendy, M.; Draghici, B. Tautomerism in Drug Discovery. J. Comput. Aided Mol. Des. 2010, 24, 475–484. (11) Ramakrishnan, A.; Chourasiya, S. S.; Bharatam, P. V. Azine or hydrazone? The dilemma in amidinohydrazones. RSC Adv. 2015, 5, 55938–55947. (12) Waluk, J. Spectroscopy and Tautomerization Studies of Porphycenes. Chem. Rev. 2017, 117, 2447–2480. (13) Rauf, M. A.; Hisaindee, S.; Saleh, N. Spectroscopic Studies of Keto-enol Tautomeric Equilibrium of Azo Dyes. RSC Adv. 2015, 5, 18097–18110. (14) Graton, J.; Berthelot, M.; Gal, J.-F.; Girard, S.; Laurence, C.; Lebreton, J.; Le Questel, J.-Y.; Maria, P.-C.; P. Nauˇs, P. Site of Protonation of Nicotine and Nornicotine in the Gas Phase: Pyridine or Pyrrolidine Nitrogen? J. Am. Chem. Soc. 2002, 124, 1055210562. (15) Colasurdo, D. D.; Pila, M. N.; Iglesias, D. A.; Laurella, S. L.; Ruiz, D. L. Tautomerism of Uracil and Related Compounds: A Mass Spectrometry Study. Eur. J. Mass. Spectrom. 2018, 24, 214–224. (16) Borbulevych, O.; Martin, R. I.; Tickle, I. J.; Westerhoff, L. M. XModeScore: a Novel Method for Accurate Protonation/tautomer-state Determination Using Quantum-Mechanically Driven Macromolecular X-ray Crystallographic Refinement. Acta Cryst. 2016, D72, 586–598. (17) Bax, B.; Chung, C.; Edge, C. Getting the Chemistry Right: Protonation, Tautomers and the Importance of H Atoms in Biological Chemistry. Acta Cryst. 2017, D73, 131–140. (18) Haranczyk, M.; Gutowski, M. Combinatorial- Computational-chemoinformatics (C3) Approach to Finding and Analyzing Low-energy Tautomers. J. Comput. Aided Mol. Des. 2010, 24, 627–638. (19) Greenwood, J. R.; Calkins, D.; Sullivan, A. P.; Shelley, J. C. Towards the Comprehensive,

45

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Rapid, and Accurate Prediction of the Favorable Tautomeric States of Drug-like Molecules in Aqueous Solution. J. Comput. Aided Mol. Des. 2010, 24, 591–604. (20) Will, T.; Hutter, M. C.; Jauch, J.; Helms, V. Batch Tautomer Generation with MolTPC. J. Comp. Chem. 2013, 34, 2485–2492. (21) Kochev, N. T.; Paskaleva, V. H.; Jeliazkova, N. AmbitTautomer: An Open Source Tool for Tautomer Generation. Mol. Inf. 481-504, 32, 481–504. (22) Guasch, L.; Sitzmann, M.; Nicklaus, M. C. Enumeration of Ring-Chain Tautomers Based on SMIRKS Rules. J. Chem. Inf. Model. 2014, 54, 2423–2432. (23) Urbaczek, S.; Kolodzik, A.; Rarey, M. The Valence State Combination Model: A Generic Framework for Handling Tautomers and Protonation States. J. Chem. Inf. Model. 2014, 54, 756–766. (24) Martin, Y. C. Let’s not Forget Tautomers. J. Comput. Aided Mol. Des. 2009, 23, 693–704. (25) Porter, W. R. Warfarin: History, Tautomerism and Activity. J. Comput. Aided Mol. Des. 2010, 24, 553–573. (26) Thalheim, T.; Vollmer, A.; Ebert, R.-U.; K¨uhne, R.; Sch¨uu¨ rmann, G. Tautomer Identification and Tautomer Structure Generation Based on the InChI Code. J. Chem. Inf. Model. 2010, 50, 12231232. (27) Shelley, J. C.; Cholleti, A.; Frye, L.; Greenwood, J. R.; Timlin, M. R.; Uchimaya, M. Epik: a Software Program for pK( a ) Prediction and Protonation State Generation for Drug-like Molecules. J. Comput. Aided Mol. Des. 2010, 21, 681–691. (28) Chemicalize. ChemAxon (http://www.chemaxon.com), 2018. (29) James, C. A.; Weininger, D. Daylight Theory Manual, Version 4.9. Daylight Chemical Information Systems, Inc.: Laguna Niguel, CA, 2011. 46

ACS Paragon Plus Environment

Page 46 of 52

Page 47 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(30) Cruz-Cabeza, A. J.; Schreyer, A.; Pitt, W. R. Annular Tautomerism:

Experimental

Observations and Quantum Mechanics Calculations. J. Comput. Aided Mol. Des. 2010, 24, 575586. (31) Pracht, P.; Bauer, C. A.; Grimme, S. Automated and Afficient Quantum Chemical Determination and Energetic Ranking of Molecular Protonation Sites. J. Comp. Chem. 2017, 38, 2618–2631. (32) Tadi´c, J. M.; Xu, L. Ab Initio and Density Functional Theory Study of KetoEnol Equilibria of Deltic Acid in Gas and Aqueous Solution Phase: A Bimolecular Proton Transfer Mechanism. J. Org. Chem. 2012, 77, 86218626. ¨ Ozdemir, N.;

(33) Arslan, N. B.; Mu˘glu,

H.

Direct

and

Dayan, O.;

Dege, N.;

Solvent-assisted

Koparir, M.;

Thione-thiol

Koparir, P.;

Tautomerism

in

5-(thiophen-2-yl)-1,3,4-oxadiazole-2(3H)-thione: Experimental and Molecular Modeling Study. Chem. Phys. 2014, 439, 1–11. (34) Cieplak, P.; P. Bash, U. C. S.; Kollman, P. A. A Theoretical Study of Tautomerism in the Gas Phase and Aqueous Solution: a Combined Use of State-of-the-art Ab Initio Quantum Mechanics and Free Energy-perturbation Methods. J. Am. Chem. Soc. 1987, 109, 62836289. (35) Schr¨oder, D.; Budˇesˇ´ınsk´y, M.; Roithov´a, J. Deprotonation of p-Hydroxybenzoic Acid: Does Electrospray Ionization Sample Solution or Gas-Phase Structures? J. Am. Chem. Soc. 2012, 124, 1589715905. (36) Perrin, D. D.; Dempsey, B.; Serjeant, E. P. pKa Prediction for Organic Acids and Bases; Chapman and Hall: London, 1981. (37) Fraczkiewicz, R.; Lobell, M.; G¨oller, A. H.; Krenz, U.; Schoenneis, R.; Clark, R. D.; Hillisch, A. Best of Both Worlds: Combining Pharma Data and State of the Art Modeling Technology To Improve in Silico pKa Prediction. J. Chem. Inf. Model. 2015, 55, 389–397.

47

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(38) Yu, H. S.; Watson, M. A.; Bochevarov, A. D. Weighted Averaging Scheme and Local Atomic Descriptor for pKa Prediction Based on Density Functional Theory. J. Chem. Inf. Model. 2018, 58, 271–286. (39) Bochevarov, A. D.; Watson, M. A.; Greenwood, J. R.; Philipp, D. M. Multiconformation, Density Functional Theory-Based pKa Prediction in Application to Large, Flexible Organic Molecules with Diverse Functional Groups. J. Chem. Theory Comput. 2016, 12, 6001–6019. (40) Citra, M. J. Estimating the pKa of Phenols, Carboxylic Acids and Alcohols from Semi-empirical Quantum Chemical Methods. Chemosphere 1999, 38, 191–206. (41) Jensen, J. H.; Swain, C. J.; Olsen, L. Prediction of pKa Values for Druglike Molecules Using Semiempirical Quantum Chemical Methods. J. Phys. Chem. A 2017, 121, 699707. (42) Helgaker, T.; Jørgensen, P.; Olsen, J. Molecular Elecronic-Structure Theory; John Wiley & Sons: Chichester, UK, 2000. (43) Valters, R. E.; Flitsch, W. Ring-Chain Tautomerism; Plenum Press: New York, 1985. (44) Mchedlov-Petrossyan, N. O.; Cheipesh, T. A.; Vodolazkaya, N. A. Acid-base Dissociation and Tautomerism of Two Aminofluorescein Dyes in Solution. J. Mol. Liq. 2017, 225, 696–705. (45) Malde, A. K.; Stroet, M.; Caron, B.; Visscher, K. M.; Mark, A. E. Predicting the Prevalence of Alternative Warfarin Tautomers in Solution. J. Chem. Theory Comput. 2018, 14, 4405–4415. (46) Bochevarov, A. D.; Harder, E.; Hughes, T. F.; Greenwood, J. R.; Braden, D. A.; Philipp, D. M.; Rinaldo, D.; Halls, M. D.; Zhang, Z.; Friesner, R. A. Jaguar: A High-performance Quantum Chemistry Software Program with Strengths in Life and Materials Sciences. Int. J. Quant. Chem. 2013, 113, 2110–2142. (47) Klici´c, J. J.; Friesner, R. A.; Liu, S.-Y.; Guida, W. C. Accurate Prediction of Acidity Constants

48

ACS Paragon Plus Environment

Page 48 of 52

Page 49 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

in Aqueous Solution via Density Functional Theory and Self-Consistent Reaction Field Methods. J. Phys. Chem. A 2002, 106, 1327–1335. (48) Vanommeslaeghe, K.; MacKerell Jr., A. D. Automation of the CHARMM General Force Field (CGenFF) I: Bond Perception and Atom Typing. J. Chem. Inf. Mod. 2012, 52, 3144–3154. (49) Grimme, S.; Anthony, J.; Ehrlich, S.; Krieg, H. A Consistent and Accurate Ab Initio Parameterization of Denisty Functional Dispersion Correction (DFT-D) for the 94 Elements H-Pu. J. Chem. Phys. 2010, 132, 154104. (50) Lewis, G. N. The Atom and the Molecule. J. Am. Chem. Soc. 1916, 38, 762–785. (51) Stewart, J. J. P. Optimization of Parameters for Semiempirical Methods VI: More Modifications to the NDDO Approximations and Re-optimization of Parameters. J. Mol. Modeling 2013, 19, 1–32. (52) Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer Series in Statistics; Springer New York Inc.: New York, NY, USA, 2001. (53) Anslyn, E. V.; Dougherty, D. A. Modern Physical Organic Chemistry; University Science Books: Sausalito, California, 2006. (54) Orgov´an, Z.; Ferenczy, G. G.; Steinbrecher, T.; Szil´agyi, B.; Bajusz, D.; Kaser˝u, G. M. Validation of Tautomeric and Protomeric Binding Modes by Free Energy Calculations. A Case Study for the Structure Based Optimization of D-amino Acid Oxidase Inhibitors. J. Comput. Aided Mol. Des. 2018, 32, 331–345. (55) Marten, B.; Kim, K.; Cortis, C.; Friesner, R. A.; Murphy, R. B.; Ringnalda, M. N.; Sitkoff, D.;

Honig, B. New Model for Calculation of Solvation Free Energies:

Correction of Self-Consistent Reaction Field Continuum Dielectric Theory for Short-Range Hydrogen-Bonding Effects. J. Phys. Chem. 1996, 100, 11775–11788. 49

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 50 of 52

(56) Cortis, C. M.; Friesner, R. A. An Automatic Three-Dimensional Finite Element Mesh Generation System for the Poisson-Boltzmann Equation. J. Comp. Chem. 1997, 18, 1570–1590. (57) Cortis, C. M.; Friesner, R. A. Numerical Solution of the Poisson-Boltzmann Equation Using Tetrahedral Finite-element Meshes. J. Comp. Chem. 1997, 18, 1591–1608. (58) Stewart,

J.

J.

P.

MOPAC2016.

Stewart

Computational

Chemistry.

ChemAxon

(http://www.chemaxon.com): Colorado Springs, 2016. (59) Gimadiev, T. R.; Madzhidov, T. I.; Nugmanov, R. I.; Baskin, I. I.; Antipin, I. S.; Varnek, A. Assessment of Tautomer Distribution Using the Condensed Reaction Graph Approach. J. Comput. Aided Mol. Des. 2018, 32, 401414. (60) Zhao, Y.; Truhlar, D. G. The M06 Suite of Density Functionals for Main Group Thermochemistry, Thermochemical Kinetics, Noncovalent Interactions, Excited States, and Transition Elements: Two New Functionals and Systematic Testing of Four M06-class Functionals and 12 other functionals. Theor. Chem. Acc. 2008, 120, 215–241. (61) Hansen, P.;

S.Kamounah, F.;

Zhiryakova, D.;

1,1’,1”-(2,4,6-Trihydroxybenzene-1,3,5-triyl)triethanone

Manolova, Y.;

Antonov, L.

Tautomerism

Revisited.

Theor. Chem. Acc. 2014, 55, 354–357. (62) Ivanova, D.; Deneva, V.; Nedeltcheva, D.; Kamounah, F. S.; Gergov, G.; Hansen, P. E.; Kawauchi, S.; Antonov, L. Tautomeric Transformations of Piroxicam in Solution: a Combined Experimental and Theoretical Study. RSC Adv. 2015, 5, 31852–31860. (63) Kawauchi, S.; Antonov, L. Description of the Tautomerism in Some Azonaphthols. J. Phys. Org. Chem. 2013, 26, 643–652. (64) Mardirossian, N.; Head-Gordon, M. Thirty Years of Density Functional Theory in

50

ACS Paragon Plus Environment

Page 51 of 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Computational Chemistry:

an Overview and Extensive Assessment of 200 Density

Functionals. Mol. Phys. 2017, 115, 2315–2372. (65) Kanal, I. Y.; Keith, J. A.; Hutchison, G. R. A Sobering Assessment of Small-molecule Force Field Methods for Low Energy Conformer Predictions. Int. J. Quantum Chem. 2018, 118, e25512. (66) Hidalgo, A.; Giroday, T.; Mora-Diez, N. Thermodynamic Stability of Neutral and Anionic PFOAs. Theor. Chem. Acc. 2015, 134, 124. (67) Kim, V.; Piatkowski, L.; Pszona, M.; J¨ager, R.; Ostapko, J.; Sepioł, J.; Meixner, A. J.; Waluk, J. Unusual Effects in Single Molecule Tautomerization:

Hemiporphycene.

Phys. Chem. Chem. Phys. 2018, 20, 26591–26596. (68) Li, M.; Zhang, H.; Chen, B.; Wu, Y.; Guan, L. Prediction of pKa Values for Neutral and Basic Drugs based on Hybrid Artificial Intelligence Methods. Sci. Rep. 2018, 8, 3991. (69) Zhou, T.; Jhamb, S.; Liang, X.; Sundmacher, K.; Gani, R. Prediction of Acid Dissociation Constants of Organic Compounds Using Group Contribution Methods. Chem. Eng. Sci. 2018, 183, 95–105.

51

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

228x101mm (600 x 600 DPI)

ACS Paragon Plus Environment

Page 52 of 52