Generation of Tautomers Using Micro-pKa's | Journal of Chemical

(26) It can also come from human developers using experience or chemical intuition to favor ..... Good proton acceptors are expected to have relativel...
0 downloads 0 Views 2MB Size
Article Cite This: J. Chem. Inf. Model. 2019, 59, 2672−2689

pubs.acs.org/jcim

Generation of Tautomers Using Micro‑pKa’s Mark A. Watson, Haoyu S. Yu, and Art D. Bochevarov* Schrödinger, Inc., 120 West 45th Street, New York, New York 10036, United States

Downloaded via UNIV OF SOUTHERN INDIANA on July 17, 2019 at 07:59:43 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.

S Supporting Information *

ABSTRACT: Solutions of organic molecules containing one or more heterocycles with conjugated bonds may exist as a mixture of tautomers, but typically only a few of them are significantly populated even though the potential number grows combinatorially with the number of protonation and deprotonation sites. Generating the most stable tautomers from a given input structure is an important and challenging task, and numerous algorithms to tackle it have been proposed in the literature. This work describes a novel approach for tautomer prediction that involves the combined use of molecular mechanics, semiempirical quantum chemistry, and density functional theory. The key idea in our method is to identify the protonation and deprotonation sites using estimated micro-pKa’s for every atom in the molecule as well as in its nearest protonated and deprotonated forms. To generate tautomers in a systematic way with minimal bias, we then consider the full set of tautomers that arise from the combinatorial distribution of all such mobile protons among all protonatable sites, with efficient postprocessing to screen away high-energy species. To estimate the micro-pKa’s, we present a new method designed for the current task, but we emphasize that any alternative method can be used in conjunction with our basic algorithm. Our approach is therefore grounded in the computational prediction of physical properties in aqueous solution, in contrast to other approaches that may rely on the use of hard-coded rules of proton distribution, previously observed tautomerization patterns from a known chemical space, or human input. We present examples of the application of our algorithm to organic and drug-like molecules, with a focus on novel structures where traditional methods are expected to perform worse.

1. INTRODUCTION Tautomerism is a fascinating phenomenon.1−3 It has gained considerable traction in diverse areas of chemical research, for example cheminformatics,4−7 medicinal chemistry,8−11 electronic spectroscopy,12,13 gas-phase spectrometry,14,15 and Xray crystallography.16,17 When it comes to computational modeling of such properties as logD and pKa, one often has to generate tautomers and compute their relative energies (i.e., score them).18−23 These computational operations are necessary because failing to account for the existence of different tautomer forms or to recognize the lowest-energy tautomer in a given solvent can result in serious mispredictions or disagreements with experiment.24,25 As computational prediction of molecular properties becomes increasingly more common and as screening applies to larger and larger sets of organic molecules, automated generation of tautomers becomes practically indispensable. Organic molecules with certain arrangements of double bonds in aliphatic chains and/or conjugated heterocycles may © 2019 American Chemical Society

be prone to tautomerism. The number of plausible tautomers that could be considered for such medium-sized and large molecules can be prohibitively large, numbering in the dozens or hundreds. In practice, however, chemists tend to rely on previous practical experience to consider only a handful of tautomers deemed energetically favorable. Needless to say, “manual” generation of tautomers is unacceptable if one is building algorithms and workflows that are meant to process molecules in an automated fashion. Numerous solutions for tautomer generation have been proposed,6,19,21,23,26 although a complete solution to the problem that is applicable in all contexts of chemical research appears to be elusive. A number of software programs utilize these and other developments to offer tautomer generation (see, for example, refs 27 and 28). Any algorithm for tautomer generation requires knowledge about which protons and Received: December 26, 2018 Published: May 9, 2019 2672

DOI: 10.1021/acs.jcim.8b00955 J. Chem. Inf. Model. 2019, 59, 2672−2689

Article

Journal of Chemical Information and Modeling molecular sites can be involved in tautomeric transformations. This knowledge can come from built-in rules, such as predefined SMARTS patterns,29 encoding patterns of transformations,6,21,30 or from the transformations built into InChI.26 It can also come from human developers using experience or chemical intuition to favor certain motifs, typically accompanied by quantum-chemical calculations to verify the energetics.19 These approaches are not completely satisfactory, however. The rules-based approaches may fail to generate relevant tautomers if the necessary transformation rule is missing (such as in cases of novel or unusual heterocycles or uncommon types of tautomerism). The second alternative relies on imperfect human intuition and is a barrier to automation. More interesting, therefore, are approaches that determine protonation sites in an unbiased, automated way using physicsbased atom-level descriptors. A recent work by the Grimme group31 reports on one such approach. It utilizes a semiempirical localization of molecular orbitals as a means of locating protonation sites. The method of Grimme and coworkers is currently geared toward generating gas-phase tautomers, although multiple applications have to deal with tautomers that exist in the solution phase. While in some cases the same tautomeric structure has been predicted to have the lowest energy in both the gas phase and solution,32,33 in general the relative energetics of tautomers is phase-dependent.1,34,35 The aim of this work, therefore, was to develop a novel method to generate tautomers in an automated, unbiased fashion without dependence on explicit rules for shifting protons around the molecule. The method should not require user input other than the initial structure, and it should generate tautomers relevant to the solution phase, particularly those that exist in aqueous solutions. Our central idea to achieve these goals is to use micro-pKa’s as the key ingredient for a physics-based approach. Therefore, we first estimate the micro-pKa of every atom in the molecule with a rapid algorithm. Then we use these atomic data to determine the protons’ capacity for dissociation and reciprocally the heavy atoms’ capacity for being protonated. Having identified the atoms we expect to be active in the tautomeric equilibria, we then generate a full combinatorial set of tautomers involving all of these atoms in an unbiased manner. A close relationship between solution-phase micro-pKa’s and tautomeric equilibria has been noted before,6,19 but to the best of our knowledge the idea of tautomer generation driven by a complete set of micro-pKa’s has not been explored. A small digression about the micro-pKa terminology as used in this work is necessary. A micro-pKa is traditionally defined as characterizing a proton exchange equilibrium between two individual tautomers. Statistical averaging over micro-pKa’s that includes all of the relevant tautomers for the given system yields the macro-pKa, which is directly comparable to the equilibrium constant measured by experiment. For more information about the distinction between micro and macro constants see the early text (ref 36) as well as the recent works (refs 37 and 38). Tautomers may possess various conformations, and therefore, equipped with a computational method that operates on three-dimensional (3D) structures, one can compute micro-pKa’s for all of the different conformations for a given pair of protonated and deprotonated tautomers. Thus, a finer gradation of micro-pKa’s that reflects conformational effects is possible. In an earlier work,39 we introduced such a

Figure 1. General organization of the tautomer generation workflow. The micro-pKa assignments were made by the “sorting” method described in section 4.2.

gradation with the notion of a nano-pKa, defined as the pKa computed for a particular pair of protonated and deprotonated conformations. Conformational statistical averaging over nanopKa’s, as described in ref 39, leads to the micro-pKa. In this work, every time we compute a micro-pKa, we technically are computing a nano-pKa, since we are dealing with a particular pair of conformations and do not perform conformational averaging afterward. However, in order to avoid complicating the discussion with the less familiar terminology, we continue to refer to these nano-pKa’s as micro-pKa’s. Obviously, a nanopKa is equivalent to a micro-pKa if we assume the existence of a single conformation for the protonated form and a single conformation for the deprotonated form. We can certainly make this assumption in this work, as it is immaterial for the construction of our pKa prediction methods and allows for simpler terminology. 2673

DOI: 10.1021/acs.jcim.8b00955 J. Chem. Inf. Model. 2019, 59, 2672−2689

Article

Journal of Chemical Information and Modeling

Figure 2. Analogy between the complete active space (CAS) of determinants from electronic structure theory and the CAS of tautomers from this work. The mobile (active) units to be permuted to create a CAS are illustrated within the dashed and solid lines (for determinants and tautomers, respectively). In this picture, hydrogens and heavy atoms marked by circles are analogous to electrons and orbitals, respectively. Detailed notation for active atoms is shown in Figure 3.

The choice of method to evaluate the micro-pKa’s is arbitrary (if it is sufficiently accurate), but for practical utility it is desirable to be able to compute them as quickly as possible. Methods for rapid micro-pKa predictions have been reported before.19,40,41 However, these methods have been designed for, or at least tested on, narrow chemical spaces (typically common functional groups like amines or carboxylic acids). For our purposes we are seeking a method that in principle would be applicable to any type of atom or functional group, including those not typically perceived as prone to protonation/deprotonation. Our workflow uses a data-driven approach to efficiently evaluate the necessary micro-pKa’s as weighted averages over known reference micro-pKa’s constituting a training set, which in the following we shall call the interpolation set (see section 4.2 for full details). It turns out that naively inspecting the micro-pKa’s of every atom in the input molecule is insufficient to adequately classify whether the atoms are involved in the tautomeric equilibria, and this is not simply an artifact of inaccurate pKa’s. In section 3, we therefore explore in detail two progressively sophisticated algorithms for classifying tautomerically active atoms according to pKa data. To avoid bias, we then generate a combinatorial space of all tautomeric forms from all of the active tautomeric atoms, which we call a complete active space (CAS) of tautomers, as discussed in section 2. However, this space is typically much larger than the space of low-energy (or relevant) tautomers. We therefore also explain how we can efficiently “filter” them using a semiempirical quantumchemical method that is reliable enough to flag very high energies and therefore certainly irrelevant tautomers. Figure 1 gives a high-level illustration of our workflow, and section 5 reports the results of our algorithm when used in practice. Here we stress that the micro-pKa predictor should be viewed as a “pluggable” component of the tautomer generation method. Micro-pKa’s can be predicted by approaches alternative to those described in this work without compromising the main idea behind the scheme shown in Figure 1.

We hope that it is clear from the above discussion that our approach is indeed physics- rather than rules-based and should provide a less biased foundation for tautomer generation than other options. It is true that some human expertise goes into constructing and balancing the interpolation set, which in turn potentially biases which atoms are identified to be active in the tautomeric equilibria. However, the tautomer generation itself is carried out in an objective way. This situation can be likened to the development of new density functional approximations in quantum chemistry. Although a training set of molecules is typically required to fit the parameters in the functional and is chosen in a more or less subjective way to “represent” chemical space, the subsequent calculations utilizing these functionals are generally viewed as entirely objective.

2. COMPLETE ACTIVE SPACE As explained in the Introduction, our method is designed to achieve unbiased generation of tautomers, and to simplify the discussion we introduce the concept of a complete active space of structures. The idea is borrowed from electronic structure theory, where for a given set of one-electron orbitals, the exact many-electron wave function may be expressed as a linear combination of all possible Slater determinants constructed from a combinatorial distribution of all of the electrons among all of the orbitals. The set of determinants thus generated may be systematically truncated by selecting only a subset of the electrons or orbitals to combine. The resulting combinatorial subset of determinants is known as a complete active space of determinants. The truncation allows one to perform calculations that would otherwise be impractical because of the very large number of determinants in the exact expansion. Compared with other approximations, the great advantage of this truncation is that it avoids putting bias toward any one determinant but instead treats a selected group on an equal footing.42 Similarly, one can imagine the (whole) space of all possible tautomers for a given molecule as the set of all ways in which the molecule’s protons can be combinatorially distributed among the molecule’s heavy atoms (attachments of protons to 2674

DOI: 10.1021/acs.jcim.8b00955 J. Chem. Inf. Model. 2019, 59, 2672−2689

Article

Journal of Chemical Information and Modeling

NH2, or can accept two extra protons and become NH3+. Figure 3 provides a graphical definition of doublet and triplet sites.

heavy atoms can be limited only by the heavy atoms’ maximal valencies based on the octet rules, such as four for nitrogen). Like the full space of determinants in electronic structure theory, this full space of tautomers may be too large to be manageable in practice and in any case is of little or no utility since it is only the relatively low-energy structures that we invariably care about (for a drug-like molecule with overall formula C21H18F3N3O5, more than one billion tautomers can be generated this way, but only a handful or so will be energetically interesting). The key question is how to systematically truncate this space of tautomers with minimal bias, and our answer is obvious from the analogy we have just described. Equating protons to electrons and heavy atoms to orbitals, we choose some physically motivated subset of each and form a CAS of tautomers. Figure 2 illustrates the concept. Indeed, as already mentioned, our physics-based approach to tautomer generation is to use the micro-pKa’s of the individual atoms to classify which protons and heavy atoms should be active in the tautomeric equilibria. At this stage of the discussion, it is appropriate to note that our definition of tautomers purposefully avoids considering energetic barriers between any pair of tautomers. There are at least three reasons for that. First, there is no established or agreed upon hard divide on the energy scale between two proton distributions that will classify the pair as either two tautomers or, more generally, as two isomers. Obviously, the setting of such a divide would be made practically useless by a change in temperature. Second, if we were to explicitly consider energy barriers separating one protomer distribution from another, we would have to know or compute the potential energy surface (PES) that encompasses the structures, which is certainly an impossible task, especially for the simple algorithm for tautomer generation that we seek. Third, even if some approximate estimate of the energy barrier could be found without constructing the whole PES, for example by finding a transition state on a linear path of proton transfer in internal coordinates, complex mechanisms of proton exchange in water would undermine the simplistic estimate. Even though our discussion will gloss over the estimation of energy barriers, the construction of our algorithm is based on chemical principles. Thus, it is expected to result in the generation of tautomers that are likely to interconvert in solution. Now it is necessary to discuss some particulars of the distribution of a molecule’s protons among its other atoms and to formalize the terminology that we are going to use. It will be convenient to refer to a set of the molecule’s detachable protons and protonatable sites as “active atoms”. Furthermore, in order to count the number of possible tautomers that the molecule can have, we will need a special nomenclature for the types of sites among which the protons can be distributed. We distinguish between three types of sites: singlet, doublet, and triplet (not to be confused with the analogous terms referring to spin multiplicity). Singlet sites can exist in only one state. They cannot accept or lose protons. The carbon atom in CCl4, as well as every chlorine atom in that molecule, is a singlet site. Doublet sites can exist in two possible states that differ by a single proton. A good example of a doublet site is the carbon atom in trinitromethane, CH(NO2)3: it can be in either the CH or C− state. Triplet sites can exist in three states: a base state or states with one or two extra protons added to it. A typical example of a triplet site is an NH− group, which can exist on its own, can accept one extra proton and turn into

Figure 3. (a) Graphical notation for active protons and active sites used in other figures in this work. X symbolizes any element other than H and is in practice usually C, N, O, or S. (b) Examples of doublet and triplet active sites.

In the following discussion, the total numbers of doublet and triplet sites in a molecule are denoted by D and T, respectively. Because singlet sites do not change their protonation state, they are of no significance in accounting for the total number of possible tautomers, so the total number of singlet sites does not need a special symbol. The total number of active protons in the molecule is denoted by M. If for simplicity we ignore the symmetry, stereoisomers, and other 3D effects, we can write down the formal total number of tautomers, Ntaut, for the specified values of M, D, and T as follows: ij D + T − ζ yzjiT zy zzjj zz zzj ζ z M 2 ζ − ζ= 0 k {k {

∑ jjjj T

Ntaut =

(1)

where the round brackets denote the binomial coefficient. In general, given a molecule, it is clearly hard to establish which of its hydrogen atoms should be distributed among which heavy atoms without the danger of spawning a huge number of tautomers that play essentially zero role in actual equilibria or neglecting potentially important ones. One might consider using predefined patterns for known functional groups and heterocycles to truncate the space, but these might miss important tautomers if the patterns are applied to a novel chemical structure (which is often the situation when searching for candidates in drug discovery, for example). Restricting the types of elements to be considered in tautomer generation is also fraught with problems. For example, considering only nitrogens and oxygens as possible hydrogen acceptors would be problematic because there are numerous examples of carbon atoms actively participating in tautomeric equilibria and, conversely, multiple cases where oxygen and 2675

DOI: 10.1021/acs.jcim.8b00955 J. Chem. Inf. Model. 2019, 59, 2672−2689

Article

Journal of Chemical Information and Modeling

Figure 4. Illustration of the concepts of the RpH range and classification of heavy atoms as proton donors and proton acceptors.

depth zero method will offer a reasonable approximation to classify the active atoms. To understand better the truth of this statement, we shall first consider an example where the depth zero algorithm successfully produces a reasonable set of tautomers. Next, however, we will explain a situation in which the simple method fails to generate some important classes of tautomers, fueling the need to go beyond the zeroth-order method. Consider the molecule 2-amino-4-hydroxypyridine, for which our micro-pKa predictor (MPP) might select the active protons and active sites as shown in Figure 5. These are actual sites selected by one of the MPP “flavors” described in section 4. Here we have M = 2, D = 2, and T = 1, giving Ntaut = 4 tautomers, which are also shown in Figure 5 (for the notation, see eq 1). For now we can ignore the possible cis−trans isomerism that is of relevance to one of the tautomers, the one with the imino group. The cis−trans isomerism is a relatively easy problem that can be handled in postprocessing. The four tautomers generated as a result of identification of active protons are a product of an exhaustive distribution of the two mobile protons according to the algorithm outlined in the beginning of this section and are as expected. However, the molecule that will be discussed in the next example reveals a problem in the depth zero algorithm: it misses some tautomers that would be expected by a human, even if the micro-pKa’s of the initial molecule are computed accurately. First we show the example and discuss the reason behind the failure, and then we propose an improvement to the algorithm that should remedy the problem. Consider the molecule 2-amino-4-hydroxypyridinium (which is a protonated form of the molecule from Figure 5). The left part of Figure 6, corresponding to an application of the depth zero algorithm, indicates the molecule’s active hydrogens and active sites, as might be suggested by an MPP. All of these suggestions for which atoms should be proton donors and proton acceptors are seemingly reasonable. The hydroxyl group’s proton is dissociatable, although the group’s oxygen cannot be protonated. This agrees with our expectations for phenol-like compounds. The pyridinium nitrogen can release its proton, but it cannot be protonated. This is also in line with expectations for pyridinium-like structures. Finally, it is imaginable that one of the hydrogens of the amino group is liable to dissociation (especially while the heterocycle is protonated). Protonation of the amino group would create a doubly protonated species, which is unlikely to happen in a common pH range. Even though all of the active atoms of 2amino-4-hydroxypyridinium are as expected and apparently need not be augmented with other active atoms, these active

nitrogen atoms are essentially inert toward hydrogen attachment, as in the nitro group. The next section is dedicated to our approach for meeting this challenge, that is, to the task of selecting the active atoms using micro-pKa’s.

3. SELECTION OF ACTIVE ATOMS VIA pKa Our main idea behind the selection of the active sites and active protons can be stated as follows: the most likely protonation/deprotonation sites and active protons are those that have favorable micro-pKa’s. These favorable micro-pKa’s are associated with the region of common pH values accompanying practically important proton equilibria in water, taken in this work to be between 0 and 14 (range RpH). In the following, we shall refer to two algorithms for classifying active atoms on the basis of this idea: the depth zero algorithm and the depth one algorithm. The first directly classifies each atom according to its associated micro-pKa, but this turned out to be inadequate in practice, which led us to the development of the depth one algorithm. Both are explored in detail below. 3.1. Depth Zero Algorithm. In this simple algorithm, we define a heavy atom as a proton acceptor if the micro-pKa of its protonation is larger than the minimum value of RpH. Similarly, a heavy atom can be a proton donor if the micro-pKa of its deprotonation is lower than the maximum value of RpH. Figure 4 illustrates these concepts. In turn, the protons that are attached to proton donors should be considered active protons. If two or more protons are attached to the proton donor, then only one of these protons should be considered an active one. The other ones will lead to equivalent tautomers later in the algorithm anyway and can be disregarded. Finally, if the active site can be both a donor and an acceptor, it is classified as a triplet site, and if it is a donor but not an acceptor (or vice versa), it is classified as a doublet site. The name of the algorithm was chosen to emphasize that we are identifying active atoms (and hence tautomers) of the input molecule by considering only the properties of the molecule itself and not its protonated or deprotonated forms in addition. In contrast, we will see in section 3.2 that the depth one algorithm lifts this restriction. It should be noted, for example, that the depth zero method cannot identify triplet active sites that are formed by consecutive double protonation or deprotonation of a heavy atom. Fortunately, double subsequent protonations and deprotonations are rare in practice unless the initial structure is an uncommon protonation form. Therefore, if we assume that we can predict the micro-pKa of every atom in the molecule sufficiently accurately for our purpose, then we might hope that the simple 2676

DOI: 10.1021/acs.jcim.8b00955 J. Chem. Inf. Model. 2019, 59, 2672−2689

Article

Journal of Chemical Information and Modeling

Figure 5. The molecule 2-amino-4-hydroxypyridine with its active protons and active sites marked according to the results produced by a micropKa predictor (MPP) and using the depth zero tautomer generation algorithm. The active sites are marked by numbers (in italics for triplet sites), and their proton occupancies are shown in square boxes. Only one hydrogen of the amino group needs to be marked as an active hydrogen, with the other being equivalent, as explained at the beginning of section 3. The green square indicates the tautomer and its mobile elements that are ready for generation of the final set of tautomers by distributing the active hydrogens across the doublet and triplet sites.

resulting deprotonated structures by repeatedly applying the MPP and identifying any new active protons and active sites as before. In this example, it turns out that after removal of any of the protons, the amino nitrogen atom is identified as a proton acceptor by the MPP. Therefore, unlike the depth zero algorithm, this site must be classified as a triplet site instead of a doublet site. Thus, the total number of active atoms is still six but now consists of three active protons, two doublet sites, and one triplet site. It should be noted that the addition of the triplet site is sufficient to generate the tautomer that was manifestly missed by the depth zero algorithm, showing the utility of the new method. This result is illustrated in Figure 6, where the structure in the green rectangle on the right side represents the final classification of active atoms. Distributing the active protons among the active sites yields three tautomers (for M = 3, D = 2, and T = 1), as shown in Figure 7. At this stage one could consider repeating the iteration again on the latest set of active atoms, applying the MPP, and searching for any additional changes to the classification of atoms and active sites. The totality of active atoms should then be augmented with any new atoms that are found, and this process should be iterated until the number of active atoms becomes constant. Thus, we go from depth zero to depth one in an iterative fashion until convergence is reached. The formal description of the algorithm can then be outlined as follows: 1. Read in a single molecule X of charge C. 2. Initialize empty sets {S} and {P} corresponding to the active heavy atoms and active protons, respectively. 3. Generate the micro-pKa’s for all of the atoms using an MPP. 4. Classify the active protons and active sites for the pKa’s that fall within the range RpH (typically 0 to 14). 5. Update the sets {S} and {P} with the newly found active sites and active protons, respectively. 6. For each proton M in set {P}: (a) Remove M to create molecule XM of charge C − 1. (b) Repeat steps 3 and 4 for XM. (c) Update the sets {S} and {P} with the newly found active sites and active protons, excluding the site of attachment of M when updating {S}. 7. For each active site S in set {S}: (a) Protonate S to create molecule XS of charge C + 1. (b) Repeat steps 3 and 4 for XS.

atoms lead to the generation of a single tautomer only, as shown in the left part of Figure 6. There are three protons to distribute among three active sites, and there is only one combinatorial outcome. It is obvious, however, that a significant tautomer, with the amino group protonated and the pyridinium nitrogen deprotonated, is absent from the output of the depth zero algorithm (see Figure 6). The missing tautomer would have been generated if the amino group’s nitrogen in the original tautomer had been identified as a triplet site, not as a doublet site. However, we have already admitted as reasonable the fact that the amino group of the original tautomer is likely only to be deprotonated and not protonated while a proton is attached to the pyridinium nitrogen. If the proton could be lifted from the pyridinium nitrogen, the amino group would be recognized (by our MPP) as protonatable, and the pyridinium group could indeed be deprotonated, as already indicated in Figure 6. The problem with the depth zero algorithm is that it does not have enough depth to consider protonation of atoms as the other deprotonatable active sites shed their protons, and likewise, it does not consider removal of protons as the other protonatable sites gain protons. To resolve this situation, one needs to go one level further in exploration of possible protonations or deprotonations in order to better identify the active atoms. This task is left over to the depth one algorithm. We must conclude that even though the depth zero algorithm is a reasonable start for tautomer generation in some situations, such as for 2-amino-4-hydroxypyridine, it is not sufficient in other situations. Application of the depth zero algorithm to diverse tautomerism-prone structures confirms this conclusion. 3.2. Depth One Algorithm. It appears more pedagogical to explain the operation of the depth one algorithm with an example and only then to formulate it rigorously. The right side of Figure 6 illustrates the application of the approach to 2amino-4-hydroxypyridinium, the molecule for which the depth zero algorithm was shown to be unsatisfactory. In the depth one scheme, we start off with an input molecule and initially identify its active protons and active sites with the depth zero algorithm. However, in contrast to the depth zero algorithm, we also explore the protonation/deprotonation equilibria after additional protonation or deprotonation of the original molecule. For 2-amino-4-hydroxypyridinium, no atoms can be protonated according to the results of the depth zero algorithm, so we have only the choice to deprotonate atoms. There are three detachable protons, and we remove these one by one. We then explore the proton microequilibria in the 2677

DOI: 10.1021/acs.jcim.8b00955 J. Chem. Inf. Model. 2019, 59, 2672−2689

Article

Journal of Chemical Information and Modeling

Figure 6. Illustration of the (left) depth zero and (right) depth one algorithms for generation of active atoms. The green border in each panel indicates the tautomer and its mobile elements that are ready for generation of the final set of tautomers by distributing the active hydrogens across the doublet and triplet sites.

selecting one or more of its stable conformers before proceeding to the next steps. A few words must be said about ring−chain tautomerism.22,43−45 This relatively uncommon but still important type of tautomerism is actually already addressed by the algorithms described above as far as the identification of the active atoms goes. However, the 3D structures of the generated proton distributions corresponding to ring−chain tautomerism might be unphysical because some of their bond lengths in the 3D structures generated by force fields might be too long or too short. In order to generate stable, normal-looking structures of these ring−chain tautomers, it might be necessary to combine the proton-distributing algorithms with an exploration of the conformational degrees of freedom followed by a quantumchemical geometry optimization. Such workflows would have the possibility to form and break rings according to the stability of the initial proton distributions. We have checked that such an approach works and successfully produces a ring−chain tautomer with a satisfactory geometry for a single uncomplicated molecule. This effort should be regarded as a proof of principle only. Another solution for ring−chain tautomer generation would be to detect certain types of atoms in the generated tautomers (such as protonated carbonyl groups and deprotonated hydroxyl groups) that are capable of forming five- and six-membered rings upon covalent bonding. This solution is less attractive conceptually because it requires that we handle Lewis structures and atom types, but it might be an effective postprocessing step in practice that does not threaten the overall integrity of the tautomer generation methods based

(c) Update the sets {S} and {P} with the newly found active sites and active protons, excluding site S when updating {S} and the proton that was added to that site when updating {P}. 8. Repeat steps 6 and 7 until {S} and {P} do not change. 9. Distribute all of the protons from {P} across all of the sites from {S} so that the total adds up to Ntaut from eq 1. 3.3. Possible Generalizations and Improvements. It should be straightforward to generalize the algorithm given in the previous section to one more level of depth. Such an algorithm would not iterate only over all of the protons (as in step 6) or all of the active sites (as in step 7) but would also go over all pairs of protons and active sites in a similar fashion. One might also add an extra loop inside steps 6a and 7a to create structures with charges C − 2 and C + 2, respectively. In other types of generalizations, one might consider quadruplet active sites in addition to triplets and doublets. Another generalization would consist of applying either the depth zero or depth one algorithm to all of the generated tautomers in an iterative fashion until no new tautomers are produced. We did not attempt to implement and test these generalizations, leaving their investigation for future work. An exploration of conformational space might be important for the generation of some tautomers, for instance those with hydroxyl groups attached to aromatic rings. This operation can be included in our tautomer generation workflow by launching a conformational search on every tautomer generated and 2678

DOI: 10.1021/acs.jcim.8b00955 J. Chem. Inf. Model. 2019, 59, 2672−2689

Article

Journal of Chemical Information and Modeling

Figure 7. Set of tautomers of protonated 2-amino-4-hydroxypyridine produced by the depth one algorithm (see Figure 6 for the set of tautomers produced by the depth zero algorithm, inadequate for this case). The active sites are marked by numbers (in italics for triplet sites), and their proton occupancies are shown in square boxes. The green square indicates the tautomer and its mobile elements that are ready for generation of the final set of tautomers by distributing the active hydrogens across the doublet and triplet sites.

on micro-pKa’s. A more thorough exploration and validation of ring−chain tautomer generators is needed in a future work.

all atoms in a molecule. The Jaguar pKa program embodies an elaborate workflow that combines DFT calculations (using B3LYP as the density functional and PBF as the implicit solvent model) with empirical corrections. The latter are fitted to experimental pKa values for a large number of functional groups and compensate for the deficiencies of DFT energetics in the gas and solution phases. The workflow has been shown to give accurate pKa predictions for diverse types of structures (including large and flexible drug-like molecules), with the errors lying typically within 0.5−1.0 pKa units.38,39 However, there are two objections to using Jaguar pKa for the purposes of generating tautomers. First, Jaguar pKa would not be useful for routine and practical generation of tautomers because the DFT calculations that it performs (even though they are highly parallelized) may take hours for a single micropKa prediction for a large drug-like molecule. The second problem is that the parametrization and training sets on which Jaguar pKa relies have often not been designed with a view to accurately handle unusual protonation and deprotonation processes that accompany micro-pKa prediction for all of the atoms in a molecule. Therefore, the accuracy will be highly variable and possibly inadequate for all atoms and tautomers, even though it is very good in many situations.38,39 As an alternative to expensive ab initio methods like DFT, we might consider an empirical-rules-based or atom-type-based MPP. Earlier in the text we warned about the dangers of these approaches. Judging from the way such methods are realized in practice, they are likely to work satisfactorily on well-known structure types or a subset of micro-pKa’s but to underperform or even fail on less common ones because of an absence of parameters or rules for some types of atoms or functional groups. A more serious flaw in approaches based on atom typing is that they struggle with the problem of alternative mesomer (or resonance) structures. Ideally, atom parameters should be assigned regardless of the molecule’s resonance structure (the way the double bonds and formal charges are distributed across the atoms). Figure 8 shows a couple of typical mesomeric complications that might lead to different assignments of parameters (such as atomic radii, atomic charges, or micro-pK a ’s) depending on which of the mesomeric representations of the molecule is passed to the atom-typing code. Importantly, there is no “right” or “better” resonance structure from the quantum-chemical point of view; it is more appropriate to talk about a more “conventional” or “familiar”

4. THE MICRO-pKa PREDICTOR The implementation of the tautomer generation algorithms discussed in the previous section is contingent on the availability of a method that would predict micro-pKa’s for all of the atoms in the molecule and thus identify its active atoms. For a molecule with i heavy atoms and j hydrogen atoms, we are seeking i + j micro-pKa’s corresponding to protonation of every heavy atom and removal of every hydrogen atom. The algorithms would still be valid if the MPP could act on just a few of the atoms in the molecule. The protons and sites with the unidentified micro-pKa’s could then be regarded as not participating in the protonation/ deprotonation equilibria. However, when designing our tautomer generation algorithms, we tried to avoid this simplification. Additionally, because of the large number of micro-pKa’s that need to be computed in a typical workflow involving druglike molecules, it is desirable to have an MPP with a relatively low computational cost. Finally, the method should have sufficient accuracy to reliably classify every atom as either active or inactive for the purposes of tautomer generation. Fortunately, there is some built-in tolerance to the accuracy we require from the MPP. Clearly, the danger of wrongly classifying an atom is larger if its true micro-pKa lies near the minimum or maximum values of RpH. However, it is worth recognizing that some atoms are more important to classify correctly than others if we are to avoid missing the dominant low-energy tautomers, which is the prime goal of the tautomer generator. Good proton acceptors are expected to have relatively high micro-pKa’s for protonation, while good proton donors are expected to have relatively low micro-pKa’s for deprotonation. This should be clear from Figure 4. Therefore, it is the weak proton donors or acceptors that are more likely to be misclassified, and these are much less likely to contribute to the low-energy tautomers. In short, unless the errors of the approximate micro-pKa prediction are rather large, the influence of misclassification for the generation of the highest-populated tautomers should be negligible. Let us first consider using our DFT-based Jaguar pKa program as an MPP. The Jaguar pKa program,38,39,46 being a hybrid DFT−empirical approach47 that covers a large chemical space, can conveniently predict the micro-pKa’s of most, if not 2679

DOI: 10.1021/acs.jcim.8b00955 J. Chem. Inf. Model. 2019, 59, 2672−2689

Article

Journal of Chemical Information and Modeling LA = [MA , qA ]

(2)

Here qA is simply the partial charge on atom A, which can be computed by a variety of methods. In the present work, it was computed by the molecular mechanics method OPLS3e.49 The crux of the LAD is the vector MA, which we call the multidimensional coordination number (MCN), MA = [CNA(H), CNA(He), CNA(Li), ...]

(3)

in which there is one component for each chemical element. In our nested LAD variant, each of these components is also a vector, resolving the coordination over every individual pair of atoms in the molecule. For example, the vector CNA(H) would be written as

Figure 8. Examples of different mesomeric representations of the same molecule [an asymmetric guanidine in (a) and 4-pyridone in (b)] that might lead to different assignments of atom parameters in atom-typing approaches if the resonance structures are not explicitly treated as equivalent with an additional piece of code.

CNA(H) = [CNA (H1), CNA (H 2), ..., CNA (HN )]

(4)

for every hydrogen atom Hi, not including atom A, where the individual (scalar) coordination number CNA(B) for a given atom pair is defined by ÅÄ ÑÉÑo l o ÑÑ| 1 o ÅÅÅÅ k 2(RA + RB) ÑÑo = 1 + expm − k − 1 Å 1Å o o ÑÑ} o ÅÅÇ CNA (B) r ÑÖo AB (5) n ~

resonance structure. The conventionality, or appropriateness, of resonance structures can be determined, for example, on the basis of chemical intuition about where formal atomic charges should be placed or following a computation of partial atomic charges, which are not physical observables. Atom-typing codes deal with the problem of mesomers in several ways, for example by canonicalizing resonance structures or by making rules for formal equivalence for some of them (see, for example, the complicated rules pertaining to resonance structures in an atom-typing algorithm from ref 48). Although such solutions work for anticipated types of structures, it is usually not difficult to break them for less conventional ones. This can lead to different results for different resonance structures used as input. Because of the problems indicated in this and the previous paragraphs, atom-typing approaches do not look attractive to us. For the purposes of the current work, we therefore decided to avoid atom typing and to develop our own MPP designed with the following criteria in mind: it should have a relatively low computational cost, be applicable to all atoms (typically for drug-like molecules), and have reasonable (but not necessarily high) accuracy with low variance (i.e., it should be robust toward different charge states, elements, and functional groups). Clearly this is a challenging research problem in its own right, so we decided to leave the task of perfecting the MPP until future work and for now set the objective of constructing a reasonably well performing engine that is sufficient to illustrate the utility of the proposed tautomer generation algorithms. In section 4.2 we describe in detail our new MPP method, and in section 5 we present results from its application to tautomer generation. 4.1. Local Atomic Descriptor. The MPP used in this work depends crucially on the existence of a similarity metric that we can use to compare an unknown pKa process to a training set of reference data. In this section we describe a key component of this metric, which we call a local atomic descriptor (LAD), that serves the purpose of comparing the 3D geometric environments of dissociating protons in two pKa processes and also includes some information about the electronic similarity. A description of the LAD along with the motivation for its construction and examples of its use has already been provided in our earlier work.38 In the present work we utilized a slightly modified variant of the LAD, the socalled nested LAD, a brief account of which is given below. The local atomic descriptor for atom A consists of a pair of quantities:

in which the adjustable parameters k1 and k2 are preoptimized on a training set, RA and RB are preset atomic radii, and rAB is the interatomic distance between A and B. The coordination numbers in the form given by eq 5 were originally developed by Grimme and co-workers50 and are an important part of their postenergy correction to density functionals known as the D3 correction. The regular LAD used in our previous work is related to the nested LAD by summing up all of the scalar coordination numbers for each element. An advantage of the nested LAD is that it provides a higher “resolution” description of the local environment. Importantly, all of the components of the LAD have no formal dependence on the interatom connectivity (also known as single, double, or triple bonds) or formal charges. This attractive feature makes the LAD independent of the Lewis (resonance) structure51 typically set by the user or a software program. Different representations (Lewis structures) of the compound should produce exactly the same LADs for the structure’s atoms, provided that the internal coordinates are conserved. It is worth noting that modern software, if passed different two-dimensional (2D) Lewis representations of the same structure, will normally generate 3D structures with slightly different internal coordinates. This is the case because the varying connectivity (e.g., a single or double bond) between the same atoms in the 2D structure leads to different bond lengths and angles in the 3D structure as assigned by the molecular mechanics component of the software. A quantummechanical or semiempirical geometry optimization in the gas phase usually generates the same internal coordinates for all of the mesomeric structures of concern and thus resolves the problem. In any case, small deviations in the internal coordinates will lead to small differences in the LADs, which are smooth and differentiable (with respect to atomic displacements). Having introduced the descriptor itself, we can now define an “overlap” or “similarity” between two descriptors LA and LB, where atoms A and B, designating local environments, can reside either in the same structure or in two different structures. For two LADs LA and LB, which can be associated with atoms from the same structure but are usually associated 2680

DOI: 10.1021/acs.jcim.8b00955 J. Chem. Inf. Model. 2019, 59, 2672−2689

Article

Journal of Chemical Information and Modeling

consider the deprotonation process XH = X− + H+. In this case we compute the LAD at X (while X is attached to H), denoted as Lx. Next we compute the set of overlaps {Lix} (i = 1, ..., N) of Lx with the N LADs from the interpolation set using the overlap metric given in eq 6. It turns out that even though the LAD already includes atomic charge, the single partial atomic charge is not very effective at describing the electronic environment when the charge can be delocalized through the resonance effect. For example, LADs alone cannot effectively distinguish structures like the two shown in Figure 9. Therefore, for the development of our MPP, we introduce two new descriptors to augment the LAD and define the similarity metric.

with atoms from two different structures, we define the overlap as LAB = exp[−wM(MA − MB)2 − wq(qA − qB)2 ]

(6)

where the adjustable parameters wM and wq control the interplay between the geometric and charge components of LA and LB. To evaluate this expression in practice, we sort the vectors of coordination numbers for each element from largest to smallest to remove any arbitrary dependence on atom labeling. Denoting the sorting operator by Ŝ , we have (MA − MB)2 =

∑ (S[̂ CNA(E)] − S[̂ CNB(E)])2 E

(7)

where E ∈ {H, He, Li, ...} and CNA(E) − CNB(E) =

∑ |CNA(Ei) − CNB(Ei)| i

(8)

in which the shorter vector on the right hand side is padded with zeroes. 4.2. Interpolation Weights and the pKa Estimator. As mentioned in the Introduction, the MPP used in this work is based on an empirical-data-driven approach to estimate micropKa’s as weighted averages over known reference values (see eq 12). The weights are derived from a similarity metric used to compare the dissociation process of interest to those in the reference data. The metric is defined using not only local information from LADs but also nonlocal descriptors because protonation equilibria are often dependent on an interplay between the local capacity for protonation or deprotonation and nonlocal energetic stability. While the local descriptor helps establish the similarity of 3D geometries in the vicinity of the dissociating proton, the nonlocal descriptors capture longdistance effects, particularly the resonance effect and stabilization by aromaticity. The influence of the inductive effect is actually captured by the LAD through the built-in partial atomic charge. Below we describe precisely how this similarity metric is defined and how the final micro-pKa’s are constructed in practice. It should be noted that we will refer, for example, to a deprotonation process using the notation “AH = A− + H+” where AH is the molecule of interest and A is strictly a molecular fragment. For notational convenience, however, we shall also refer to A as being the “heavy atom” attached to the dissociating proton, to be used as the reference atom for LAD generation, for example. The interpretation of A as a fragment or atom should be clear from the context. Our set of reference data, or interpolation set, is explained in detail in the following section. In summary, it includes N data points, where N is currently 20 201 (although this number is expected to grow in future versions of the program as we add new data) and each data point describes a deprotonation process AH = A− + H+ for a particular molecule AH and heavy atom A. For some molecules, data points are available for multiple atoms. In addition to a reference micro-pKa value (taken from experiment or high-quality theoretical data), each data point has two descriptors: (1) the LAD for AH computed at the heavy atom A and (2) the gas-phase energy difference ΔE = E(AH) − E(A−) computed at the semiempirical PM7 level of theory.52 In our tautomer generation algorithms, we need to compute the micro-pKa for every atom in the input structure, but to describe the implementation of our MPP, it is sufficient to explain how this is done for one heavy atom X. Let us first

Figure 9. Two molecules with very similar 3D environments (around the dissociating hydrogen atom, shown in orange) for which the delocalized positive charge cannot be effectively distinguished by LAD alone without assistance from the additional energy descriptor ΔE and the damped total charges from eq 9. Upon dissociation, the molecule on the left produces an aromatic species and thus is expected to have a large negative pKa, whereas the molecule on the right is an aliphatic species with a large positive pKa.

As the first additional descriptor, we define the following damped total charge for an arbitrary atom A: qB QA = ∑ −sq(R − rAB) (9) B≠A 1 + e where the sum goes over all atoms B in the molecule having partial atomic charges qB, R is a distance parameter (here, R = 3 Å), and rAB is the distance from the reference atom A to atom B. The adjustable parameter sq is defined below. The idea is that QA contains more nonlocal information than the simple atomic partial charges. We then define a new quantity ΔQix as the magnitude of the difference between these charges for atom X in the input molecule and the heavy atom associated with data point i in the interpolation set: ΔQ ix = |Q x − Q i|

(10)

Finally, for an additional global descriptor, we compute the PM7 gas-phase energy difference ΔEx = E(XH) − E(X−) and compare it to those in the interpolation set to define the quantities ΔΔEix = |ΔEx − ΔEi|

(11) −

where ΔEi is the energy difference E(AH) − E(A ) stored for every data point i in the interpolation set. We have found that the additional descriptors (10) and (11) are very important for describing changes in stability of molecular species upon deprotonation in cases where the deprotonation leads to significant changes in stability associated with resonance or aromaticity but where the local 3D geometries are very similar. We now have all of the components we need to estimate the micro-pKa for the deprotonation process XH = X− + H+ using 2681

DOI: 10.1021/acs.jcim.8b00955 J. Chem. Inf. Model. 2019, 59, 2672−2689

Article

Journal of Chemical Information and Modeling the weighted average over micro-pKa’s from the interpolation set according to the equation

calculation of the weights using eq 13 and simply select the pKa corresponding to the top match from the interpolation set. For example, we have found the following progressive sorting (or filtering) procedure to be efficient for the interpolation set consisting of over 20 000 data points: (i) sort all matches by the L overlap and select the first 100 top matches (those that have the largest values of L); (ii) within these 100 matches, select the 30 matches that have the smallest ΔQ value associated with eq 10; (iii) within these 30 matches, select the one that has the minimal value of (ΔΔE)(ΔQ)/L; and (iv) take the pKa of this single match as the final result. This approach will be called the “sorting method”. It is easy to interpret the final pKa value, if need arises. The weights wi corresponding to the best matches (or the single best match in the case of the sorting method) can be output alongside the structures and their protonation sites. This way one can see which structures contribute most significantly to the final pKa value. The accuracies with which the interpolation and sorting methods classify atoms as active and inactive are comparable. One test set consisted of 76 molecules and 1400 atoms, where the atom activities (active or inactive) were assigned by humans using chemical intuition and, in case of doubt, Jaguar pKa calculations. The numbers of true actives (expected active and predicted active), true inactives (expected inactive and predicted inactive), false inactives (expected active and predicted inactive), and false actives (expected inactive and predicted active) were 126, 1169, 84, and 21, respectively, for the interpolation method and 182, 1131, 28, and 59, respectively, for the sorting method. As only the false inactive predictions are considered undesirable, these correspond to 94% for the interpolation and 98% for the sorting method. In view of the rather small size of the test set and the fact that the performance of the algorithms was compared to assignments based on chemical intuition, it would be premature to claim that one method is definitely more accurate than the other. The results presented in this work were generated with the interpolation method unless otherwise indicated. A critical question at this point is the construction of the interpolation set. If the chemical similarity between the query molecule and the interpolation set is low (the maximal overlap L being less than 0.2 or so), then neither the interpolation method nor the sorting method is expected to yield reasonable pKa values. The next section describes a way to deal with this problem. 4.3. The Interpolation Set. The success of the pKa estimator defined by eq 12 depends on both the evaluation of the weights wi and the completeness of the interpolation set. The previous section dealt with the weights, and the current section discusses the construction of the interpolation set. An interpolation set {I} consists of data points Ik. Each data point Ik is composed of the following: (i) a pair of 3D structures the connectivities of which differ by a single proton only; (ii) the micro-pKa value associated with the proton dissociation process; (iii) the MCN for each of the two structures computed at the heavy atom X and the proton attached to X; and (iv) semiempirical heats of formation computed for each of the two structures. As our interests for tautomer generation lie with organic and drug-like molecules, we used such molecules in building our interpolation set. Construction of interpolation sets for other chemical spaces should be also achievable. A small number of data points Ik had their micro-pKa values taken from experimental measurements

N

pK a =

∑i = 1 wi pK a, i N

∑i = 1 wi

(12)

where the pKa,i are the micro-pKa’s from the interpolation set and the wi are the weights defined below. For brevity, we have dropped the index X in the notation for the final weights for atom X. The denominator accounts for normalization of the weights. The weights decay rapidly, and typically only a very small fraction of the data in the interpolation set contributes significantly to the average. This approach is similar to the (weighted) k-nearest-neighbor methods commonly used in the machine learning community.53 The weights are defined as products of three contributions: wi = wixL × wixE × wixQ

(13)

where we let wixL =

wixE =

1 1+e

−sL(Lix − cL)

(14)

1 1 + e−sE(ΔΔEix − cE)

(15)

and

wixQ = e−sqΔQ ix

(16)

The parameters sL, cL, sE, cE, and sq were obtained by fitting to 1413 micro-pKa values (known from experiment and previous calculations) for 76 neutral, cationic, and anionic molecules. The optimized parameters are 80, 0.7, 10, −10.0, and 100, respectively. The nonlinear optimization was performed by an iterative process in which all of the parameters except one were fixed and the minimum was found for the single parameter that was varied. For more robust structural matching of the proton dissociation process, the overlap Lix may also be defined as Lprot × Ldeprot , where Lprot and Ldeprot are computed for the ix ix ix ix protonated and deprotonated forms of the same heavy atom, respectively. In practice we found this to be more satisfactory, and this is the form used in the results presented below. Now, after having evaluated the micro-pK a of the deprotonation process XH = X− + H+ for atom X, it is necessary to evaluate the micro-pKa of the corresponding protonation process XH2+ = XH + H+. To perform the latter operation, we additionally protonate the molecule at atom X (even though X might be already bonded to a proton), thus creating XH2+, and then repeat the whole series of operations explained in the previous paragraph for the protonation form XH. Finally, we classify the atoms for our tautomer generator. If no proton is attached to X to begin with, then X obviously cannot be a proton donor and the only micro-pKa that we can evaluate for this atom corresponds to the dissociation of the additionally protonated X. The pKa of the process XH = X− + H+ will determine whether X is a proton donor, while the pKa of the process XH+2 = XH + H+ will determine whether X is a proton acceptor (see Figure 4). When the interpolation set becomes sufficiently dense (in the sense that it covers the relevant chemical space well), it is possible to dispense with the interpolation procedure and the 2682

DOI: 10.1021/acs.jcim.8b00955 J. Chem. Inf. Model. 2019, 59, 2672−2689

Article

Journal of Chemical Information and Modeling

environments. This is the case because it generally costs more energy to protonate a positively charged molecule than a neutral one (the local environment being similar), and similarly for deprotonation. When predicting pKa with one of our proposed approaches, in order to get several large values wi one would need to have large energy weights wEix. In agreement with the argument made about the dependence of the energy gap on the total charge, if one predicts the pKa’s of molecule X with total charge C at depth zero, one needs a good representation of molecules of charge C (as well as their conjugate acids and bases) in the interpolation set. If one uses the depth one algorithm for molecule X with total charge C, one’s interpolation set needs to have a good representation of molecules with total charges C − 1 and C + 1 (as well as their conjugate acids and bases). Our current interpolation set consists of 15 076 data points for charge transitions 0/−1 and 0/+1 and 5125 data points for charge transitions −1/−2 and +1/+2. This indicates that our current interpolation set should be more reliable for the generation of tautomers of neutral molecules. In view of this data distribution, the number and quality of our non-μ micro-pKa assignments decrease with the transition from neutral to progressively more charged molecules. Certainly, it is more desirable to get an actual pKa prediction rather than μ because a predicted value based on well-matched data gives more assurance in the assignment of atom activity. With the addition of new data points to the interpolation set, the proportion of μ pKa predictions should decrease.

(whenever they were available), but for the majority the micropKa’s were computed by the Jaguar pKa program. MCNs and heats of formation were computed according to the procedure described in the previous section. When constructing an interpolation set, one strives to produce a few large weights wk corresponding to data points Ik the local environments of which resemble the local environments of atoms of molecule X, for every molecule X of interest. When all of the wk in a particular micro-pKa calculation are small (less than 0.1), the interpolation set does not represent the particular environment of molecule X well. Interpolations in which all of the weights are small are not likely to yield reasonable micro-pKa predictions. In practice, two things can be done in such situations. First, one can automatically enrich the interpolation set by fragmenting molecule X along a few key single bonds (excluding cycles) and launching Jaguar pKa calculations on the fragments. Fragmentation does not greatly perturb the local environments and renders the DFT-based pKa calculations much faster because such calculations scale as the second to third power of the number of atoms. Then a subsequent calculation on molecule X or its analogues will have weights wk of appropriate magnitude. This approach is not always satisfactory in practice because Jaguar pK a calculations are computationally expensive and usually require several minutes for each atom environment. This greatly slows down the generation of tautomers for molecule X. However, once the enrichment for molecule X has been completed, it is likely to be unnecessary for molecules similar to X, if those are to be processed next. It has been already mentioned that Jaguar pKa does not always predict reliable micro-pKa values when applied to highly unusual protonation/deprotonation processes (such as protonation of carbon atoms in heterocycles). There might also be problems with highly unstable structures thus created, which, for instance, fall apart or otherwise “react” during geometry optimizations. In such cases we fall back on the second option, which is described in the next paragraph. The second option is pragmatic. If no molecules in a sufficiently enriched interpolation set match a particular atom of X, we assume that the atom is inactive. Thus, we assign a large positive pKa value μ to that atom if it is a hydrogen and a large negative value −μ otherwise (the actual value of μ is immaterial, as long as it signals an inactive atom). See an illustration in Figure 1. The logic behind the assumption that the atom is inactive is based on the observation that in a typical molecule there are actually very few atoms that are active. Active atoms also tend to be given more prominence in an interpolation set than their “natural” proportion in molecules would otherwise suggest. This is the case because the micropKa’s of such atoms are more often measured and predicted computationally and are more readily available for inclusion in the interpolation set. Thus, if an atom has no good match to anything in the interpolation set, it is more likely than not to be inactive. There may be a high proportion of neutral molecules (and their singly protonated and deprotonated forms) in the interpolation set. This is natural because researchers tend to deal with neutral molecules and their conjugate acids and bases (as opposed to multiply charged ions). At depth one, given a neutral molecule as an input, unless it lacks active atoms at depth zero, we necessarily have to compute micro-pKa’s of singly protonated and singly deprotonated molecules. The energy gaps ΔΔE generally depend on the total charges of the protonated/deprotonated pair, even for very similar local

5. ILLUSTRATIVE RESULTS The arguments earlier in the text make it clear that it is not the sheer number of generated tautomers but their relevance that matters in practice. The criterion for the relevance of a tautomer is its population in the tautomeric equilibrium. At room temperature, an energy difference of a few kcal/mol from the lowest energy renders a tautomer’s population insignificant. General principles of physical organic chemistry imply that stability and reactivity can be viewed as complementary and inversely contrasted concepts.54 A few low-energy tautomers might thus be of interest in studies of covalent reactivity and binding to proteins.17,24,55 Therefore, in the generation of tautomers higher-energy structures must also be considered. In the discussion below, we will use a window of 5 kcal/mol (corresponding to about 0.02% population in the mixture of two tautomers at room temperature) to distinguish between relevant and irrelevant tautomers. Even though this window is quite arbitrary (Greenwood and co-workers,19 for example, use a window of 2.76 kcal/mol), it will not influence the general conclusions about our tautomer generation method and will only serve as a convenient reference metric. Our numerous tests of the depth zero and depth one algorithms have shown that only the latter is satisfactory in terms of the number and quality of tautomers generated. For this reason, we will omit the depth zero algorithm from the following discussion. 5.1. Computational Details. It is unnecessary for the purposes of obtaining a qualitative picture to incur the additional computational expense of computing free energies in solution, as the relative free energies of rigid, asymmetric molecules used in the illustrations should be very similar to their relative solution-phase energies. The reference energies were computed at the M06-2X-D3/cc-pVDZ(-f) level of theory in conjunction with a locally modified version of the 2683

DOI: 10.1021/acs.jcim.8b00955 J. Chem. Inf. Model. 2019, 59, 2672−2689

Article

Journal of Chemical Information and Modeling

Figure 10. Test set of molecules prone to tautomerism. The results of applying different tautomer generation methods to these molecules are summarized in Table 1.

implicit solvent model PBF,56−58 as implemented in the quantum-chemistry program Jaguar.46 This variant of PBF, dubbed LAD-PBF, sets the atomic radii and the local nonelectrostatic corrections to the solution-phase energy using an interpolation scheme based on local atomic descriptors following a methodology similar to the one delineated in section 4.1. As the interpolation set we employed 8567 atomic environments and their corresponding PBF parameter assignments using uncomplicated molecules on which PBF is known to perform reliably. The overall performance of LAD-PBF for predicting solvation energies is similar to the good performance of PBF (mean unsigned error of 1.91 kcal/mol for LADPBF vs 1.92 kcal/mol for PBF on a test set of 451 molecules comprising small and large neutral molecules, anions, and cations). Importantly, thanks to the LAD-based parameter assignments, LAD-PBF energies are a function of Cartesian coordinates only and do not depend on the Lewis structure. The independence of energies on bond and formal charge assignments, especially in structures with charge delocalization, particularly anions and cations, is crucial when comparing the relative energetics of generated tautomers. All of the geometry optimizations preceding M06-2X-D3 single-point energy calculations were performed at the B3LYP-D3/6-31G** level of theory in conjunction with the LAD-PBF model. The semiempirical filtering with the PM7 method was performed with the computer program MOPAC201659 in the solution phase using the COSMO solvation model, and the tautomers having relative energies above 20 kcal/mol were discarded. The energy window of 20 kcal/mol, which is substantially larger than the target energy window of 5 kcal/mol, is needed as a protection against a larger margin of error associated with semiempirical methods. A similar combined DFT/semiempirical protocol for tautomer filtration has been used by the Varnek group.60 Let us comment on our choice of theoretical methods to rank tautomers by their energies. DFT functionals, in particular M06-2X,61 have served as a theoretical method of choice in combined theoretical and experimental studies of tautomer populations in condensed media.62,63 Benchmarking studies have shown that M06-2X is especially well suited for tautomer energetics predictions.19,64 The dispersion-corrected analogue M06-2X-D3 should have similar performance as M06-2X and was recently selected as one of the few recommended functionals in a large-scale assessment of 200 functionals by

Mardirossian and Head-Gordon.65 Less is known about the performance of PM7 on tautomers, but the accuracy of this latest method in the PM series has been compared favorably to DFT in application to conformers.66 Since tautomers can be viewed as isomers, it might be also mentioned that the predecessor of PM7, the PM6 method, was shown to rank the stability of neutral isomeric species reliably (in agreement with M06-2X) but performed badly for anions.67 PM6 was also used for ranking tautomers in recent work.60 The semiempirical methods MNDO and AM1 were not found to be sufficiently accurate tautomer filters by Cruz-Cabeza and coauthors.30 Filtration of the initial set of generated tautomers with semiempirical methods like PM7 needs to be studied further. We have determined that the use of a solvation model was essential for reasonable ranking and effective filtering. In our experience, this protocol works reliably for tautomers originating from structures like 1−8 (vide infra). However, we have found that for porphyrin-like tautomers such as those recently reported by Kim and co-workers,68 PM7 does not serve as a good filter, discarding low-energy tautomers generated at the depth one level even when very large energy windows are used. Presently we regard such structures as exceptions for which more efficient filtering techniques need to be found. 5.2. Discussion. We present tautomer generation results for a number of organic heterocycles susceptible to tautomerism. The structures themselves are displayed in Figure 10 (only one representative tautomer, the one that served as input to the algorithms, is shown per structure). This test set was constructed from molecules with unusual aspects of tautomerism, as previously discussed in ref 1 (i.e., structures 1−6), and two additional structures, 7 and 8 (also known as allopurinol and enprofylline, respectively), the tautomerism of which has been analyzed in some detail in the literature.26,30 It must be noted that simpler and more familiar heterocycles prone to tautomerism, such as 2- or 4-pyridones, did not present problems to either the depth one algorithm or to the rules-based Tautomerizer algorithm (vide infra) and were therefore not particularly interesting for our tests. We compare the results from different tautomer generation methods in Table 1. First, for reference, we report the total numbers of tautomers generated by the depth one algorithm, and they are seen to be consistent with eq 1 and the number of cis−trans tautomers created in postprocessing. For example, 2684

DOI: 10.1021/acs.jcim.8b00955 J. Chem. Inf. Model. 2019, 59, 2672−2689

Article

Journal of Chemical Information and Modeling

Figure 11. Tautomers 4a and 4b with relative M06-2X-D3/cc-pVDZ(-f) solution-phase (LAD-PBF) energies of 0.00 and 1.65 kcal/mol, respectively, are related by a long-distance proton transfer, which might be difficult for local rules-based methods to handle but is not a problem for approaches such as depth one.

for structure 3 the algorithm identified M = 3, D = 4, and T = 1, and there were no cis−trans isomers, which resulted in 14 tautomers. For 2, the active atoms were characterized by M = 2, D = 4, and T = 0, which produced six tautomers. From these six tautomers, three additional tautomers were generated from cis−trans isomerization, thus bringing the total to nine tautomers. Second, we report the numbers of relevant tautomers on the basis of their DFT energies, as discussed above, after ranking of the full set of tautomers taken from the depth one results. Table 1 reveals that structures like those from Figure 10 typically possess a small number of relevant tautomers (no more than five) according to their DFT energies. Third, we report the total numbers of tautomers resulting from PM7 filtering as well as the numbers of “missing” tautomers (i.e., relevant tautomers that were not returned). Finally, we report the total numbers of tautomers and the numbers of missing tautomers obtained using an alternative rules-based method developed internally and based on the ideas published in ref 27, called the Tautomerizer. To assess the PM7 filtering and to compare it with the Tautomerizer, we use the results of the depth one algorithm as

structure in Table 1 are shown in the Supporting Information, with the relevant tautomers highlighted. It should be noted that the depth one algorithm generates several times more tautomers than are energetically relevant or expected from a human perspective, without any obvious omissions. If anything, the algorithm includes active atoms too aggressively. The occasional aggressivity of the depth one algorithm should not present a problem in practice, however. When the PM7 filtering is applied to the generated tautomers, we see that only a small multiple of the total number of tautomers is left, and furthermore, no relevant tautomers are missed. In contrast, the rules-based Tautomerizer misses relevant tautomers in all cases but two. We estimate that “non-local” proton jumps characterizing equilibria such as that between the two relevant tautomers of 4 (see Figure 11), which are correctly identified by our physics-based algorithm, will be particularly challenging for rules-based methods, which are local in nature. The output of the depth one (as well as depth zero) tautomer generation algorithm is formally dependent on the initial input tautomer. That is, if a compound possesses N tautomers Tk (k = 1, ..., N), then the sets of tautomers generated from Tk for different k are not guaranteed to be the same. In practice, the formal dependence of the results on the initial tautomer is not important. What is important is the invariant generation of the high-relevance tautomers (those with relative energies within the 5 kcal/mol window). Even though the latter is not guaranteed either, in practice the output of our depth one algorithm varies only in very high energy tautomers, and all of the high-relevance tautomers were generated without gaps in all of our tests. Figure 12 shows an infrequent case in which one of 10 input tautomers (the set was initially generated by the depth one algorithm) failed to produce one of the tautomers (the highest-energy one) from the relevant 5 kcal/mol region. The remaining nine inputs did generate the same set of three relevant tautomers (although the total number of tautomers was indeed dependent on the input). Remarkably, the prediction of the unusual zwitterionic form 6a as the most stable tautomer in this case actually agrees with the prediction by Katrizky and co-workers,1 which was made at a time when accurate quantum-chemical calculations for molecules of this size could not be conducted. The molecules studied in Table 1 are relatively small. As the number of tautomers grows significantly with the number of detected active atoms, it is important to verify that the depth one algorithm does not produce an unmanageable number of tautomers for larger molecules, such as practically important drug-like molecules. This could occur, for example, if the underlying MPP predicted a large number of atoms to be active that are in reality nonactive. Figure 13 gives evidence that our depth one algorithm together with our MPP behave reasonably. Four drug-like molecules (among them a singly charged anion and two singly charged cations) are shown as

Table 1. Numbers of Tautomers Generated by Various Methods for the Structures Shown in Figure 10a depth one + PM7 compound

relevant

depth one

generated

1 2 3 4 5 6 7 8

2/5 3/3 2/2 2/2 1/1 2/3 2/2 1/2

9 9 14 4 6 10 10 35

8 4 5 2 2 8 9 4

Tautomerizer

missing generated 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0

5 1 2 1 4 1 1 1

missing 0/0 2/2 2/2 1/1 0/0 2/3 2/2 0/1

Relevant tautomers are tautomers with DFT energies within ϵ kcal/ mol from the lowest energy. The data for ϵ = 2.5 and 5.0 kcal/mol are shown in the format Aϵ=2.5/Bϵ=5.0, where appropriate. Since the relevant tautomers were picked from the pool generated by the depth one method, it is posited for the purposes of this table that this method misses no relevant tautomers. Therefore, missing tautomers are those that are present in the list of relevant tautomers but not found by the specified scheme. The cis−trans RN−H forms are counted as separate tautomers. a

a reference (i.e., we treat it as an exhaustive tautomer enumerator), and this is our justification for reporting the number of “missing” tautomers. Of course, there is no guarantee that the depth one method will find all of the highly populated tautomers, but nevertheless, we believe that the sets it generates are typically comprehensive. The full sets of tautomers generated by the depth one algorithm for each 2685

DOI: 10.1021/acs.jcim.8b00955 J. Chem. Inf. Model. 2019, 59, 2672−2689

Article

Journal of Chemical Information and Modeling

Figure 12. Study of the invariance of the depth one/PM7 algorithm with respect to the input tautomer. The structures of input tautomers 6a−j are shown horizontally. The tautomers generated from each of the initial tautomers 6a−j are arranged vertically by their relative energies computed at the M06-2X-D3/cc-pVTZ(-f) level of theory in conjunction with the locally modified PBF implicit solvation model. The bar separating relevant from irrelevant tautomers is set at 5 kcal/mol.

methods, so this should not be viewed as a significant drawback. Besides, the generation and entry of new training data, which require human participation, have been reasonably automated, and by construction our method circumvents the need to manage cumbersome and error-prone proton shift patterns present in other methods. Having identified active protonation and deprotonation sites, we proposed two methods for the subsequent generation of tautomers in a systematic way with minimal bias and to postprocess the results to filter out species expected to have low population. Our most basic method (the depth zero algorithm followed by PM7 filtering) requires up to several minutes of execution time for a typical large drug-like molecule. For example, execution of the sorting variant of our algorithm (the interpolation variant being slightly more expensive) took 330 s on a single CPU for the drug dasatinib, the molecule of which comprises 59 atoms. A depth zero calculation on each of the smaller molecules from Figure 10 on a single CPU took about a minute. This is negligible time if the results of such tautomer generation are to be used in a subsequent DFT-based pKa or similar calculation. Most of the computation time is spent for force field (FF) cleanup of the generated protonated and deprotonated structures, computation of FF atomic charges, and computation of the overlap L across the interpolation set. Only a small fraction of time is spent for the semiempirical calculations. Importantly, the algorithm can be trivially parallelized across atoms since calculations on the individual atoms are independent. We have not yet attempted this implementation, but it is clear that execution of the depth zero method can take mere seconds for even large drug-like molecules on a handful of CPUs. Unfortunately, the depth zero algorithm does not always produce a desirable set of tautomers. A much better method as

examples. No more than one or two tautomers are expected by a human to be relevant for each of these structures. The number of tautomers generated in each case is small (the largest number is six), and this number can be further reduced upon the PM7 filtering.

6. CONCLUSION We have described a new, general, physics-based method to generate tautomers. It is based on the identification of active protonation and deprotonation sites from estimated micropKa’s, in contrast to other approaches that may rely on the use of hard-coded rules of proton distribution, previously observed tautomerization patterns from a known chemical space, or human input. Our approach is grounded in the computational prediction of physical properties in aqueous solution and is built around physical descriptors and precomputed and experimental pKa data in water, which serve as training data. As opposed to other physics-based methods that are more relevant to the gas phase,31 the use of pKa data makes our method applicable for generating tautomers relevant in solution. Importantly, our method aims to avoid using Lewis structures. Regrettably, full independence from Lewis structures could not be achieved, as the concepts of bonds and formal charges are implicitly present in several stages of our algorithm via intermediate structures generated by molecular mechanics methods. Nevertheless, the troubles related to mesomeric variations are virtually eliminated, and the serious problem of having to account for alternative resonance structures is absent from our scheme. A disadvantage of our method is that it relies on training data, which necessitates preparation and refinement by a human. However, human involvement is common in empirical 2686

DOI: 10.1021/acs.jcim.8b00955 J. Chem. Inf. Model. 2019, 59, 2672−2689

Article

Journal of Chemical Information and Modeling

Figure 13. Numbers of tautomers generated by the depth one algorithm for drug-like molecules that are actually not expected to have a large number of tautomers. The purpose of the test was to make sure that an application of the method to such molecules produces a fairly small number of tautomers.



far as the completeness of tautomer generation goes is the depth one algorithm, which is several times more expensive. The execution time of the depth one algorithm is approximately T1 = T0Nact, where T0 is the execution time for the depth zero algorithm for the given molecule and Nact is the number of active atoms at the depth zero level. Assuming that Nact is no more than 10 for a typical drug-like molecule, the generation of tautomers at the more reliable depth one level (in combination with PM7 filtering) will take several minutes on a multi-CPU system. Even this increase in computational cost of tautomer generation is still a small fraction of the overall cost of a typical DFT-based workflow to which the tautomers serve as an input. Our approach depends critically on the quality of the micropKa predictor (MPP) used to identify active sites. In this work we presented a new method designed specifically for this algorithm to yield a robust yet efficient estimation of the micro-pKa of every atom in a molecule. It should be possible to speed up the generation of tautomers by designing more efficient MPPs that are not only faster but also more accurate, thus avoiding the inclusion of unnecessary active sites that generate spurious high-energy tautomers, perhaps using trained neural networks. We leave the exploration of these possibilities for future work. We note that the first attempts to use neural networks for predicting pKa’s have been recently reported.69,70 An exciting application of this work is the generation of tautomers to be used in the computation of macro-pKa’s. We have already applied the methods described here to our own in-house macro-pKa prediction workflow. The workflow itself will be presented elsewhere, but it is already possible to point out that the depth zero and depth one tautomer generation methods have proved to be more promising and more dependable than the rules-based methods that we have employed.

ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.8b00955. Tautomers generated by the depth one method for compounds 1−8, with low-energy (relevant) tautomers indicated (PDF)



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Art D. Bochevarov: 0000-0002-0376-8432 Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We thank Drs. John Shelley, Ryne Johnston, Mats Svensson, Leif Jacobson, and Karl Leswing for useful discussions and Dr. Daniel Levine for technical assistance with some details of our algorithm implementation.



REFERENCES

(1) Elguero, J.; Marzin, C.; Katritzky, A. R.; Linda, P. The Tautomerism of Heterocycles; Academic Press: New York, 1976. (2) Tautomerism: Concepts and Applications in Science and Technology; Antonov, L., Ed.; Wiley-VCH: Weinheim, Germany, 2016. (3) Sayle, R. A. So you Think you Understand Tautomerism? J. Comput.-Aided Mol. Des. 2010, 24, 485−496. (4) Warr, W. A. Tautomerism in Chemical Information Management Systems. J. Comput.-Aided Mol. Des. 2010, 24, 497−520. (5) Guasch, L.; Yapamudiyansel, W.; Peach, M. L.; Kelley, J. A.; Barchi, J. J., Jr.; Nicklaus, M. C. Experimental and Chemoinformatics

2687

DOI: 10.1021/acs.jcim.8b00955 J. Chem. Inf. Model. 2019, 59, 2672−2689

Article

Journal of Chemical Information and Modeling Study of Tautomerism in a Database of Commercially Available Screening Samples. J. Chem. Inf. Model. 2016, 56, 2149−2161. (6) Milletti, F.; Storchi, L.; Sforna, G.; Cross, S.; Cruciani, G. Tautomer Enumeration and Stability Prediction for Virtual Screening on Large Chemical Databases. J. Chem. Inf. Model. 2009, 49, 68−75. (7) Glavatskikh, M.; Madzhidov, T.; Baskin, I. I.; Horvath, D.; Nugmanov, R.; Gimadiev, T.; Marcou, G.; Varnek, A. Visualization and Analysis of Complex Reaction Data: The Case of Tautomeric Equilibria. Mol. Inf. 2018, 37, 1800056. (8) Pospisil, P.; Ballmer, P.; Scapozza, L.; Folkers, G. Tautomerism in Computer-aided Drug Design. J. Recept. Sig. Transd. 2003, 23, 361−371. (9) Nemeria, N.; Korotchkina, L.; McLeish, M. J.; Kenyon, G. L.; Patel, M. S.; Jordan, F. Elucidation of the Chemistry of Enzymebound Thiamin Diphosphate Prior to Substrate Binding: Defining Internal Equilibria Among Tautomeric and Ionization States. Biochemistry 2007, 46, 10739−10744. (10) Katritzky, A. R.; Hall, C. D.; El-Gendy, B. E.-D. M.; Draghici, B. Tautomerism in Drug Discovery. J. Comput.-Aided Mol. Des. 2010, 24, 475−484. (11) Ramakrishnan, A.; Chourasiya, S. S.; Bharatam, P. V. Azine or hydrazone? The dilemma in amidinohydrazones. RSC Adv. 2015, 5, 55938−55947. (12) Waluk, J. Spectroscopy and Tautomerization Studies of Porphycenes. Chem. Rev. 2017, 117, 2447−2480. (13) Rauf, M. A.; Hisaindee, S.; Saleh, N. Spectroscopic Studies of Keto-enol Tautomeric Equilibrium of Azo Dyes. RSC Adv. 2015, 5, 18097−18110. (14) Graton, J.; Berthelot, M.; Gal, J.-F.; Girard, S.; Laurence, C.; Lebreton, J.; Le Questel, J.-Y.; Maria, P.-C.; Nauš, P. Site of Protonation of Nicotine and Nornicotine in the Gas Phase: Pyridine or Pyrrolidine Nitrogen? J. Am. Chem. Soc. 2002, 124, 10552−10562. (15) Colasurdo, D. D.; Pila, M. N.; Iglesias, D. A.; Laurella, S. L.; Ruiz, D. L. Tautomerism of Uracil and Related Compounds: A Mass Spectrometry Study. Eur. J. Mass Spectrom. 2018, 24, 214−224. (16) Borbulevych, O.; Martin, R. I.; Tickle, I. J.; Westerhoff, L. M. XModeScore: a Novel Method for Accurate Protonation/tautomerstate Determination Using Quantum-Mechanically Driven Macromolecular X-ray Crystallographic Refinement. Acta Crystallogr. 2016, D72, 586−598. (17) Bax, B.; Chung, C.; Edge, C. Getting the Chemistry Right: Protonation, Tautomers and the Importance of H Atoms in Biological Chemistry. Acta Crystallogr. 2017, D73, 131−140. (18) Haranczyk, M.; Gutowski, M. Combinatorial- Computationalchemoinformatics (C3) Approach to Finding and Analyzing Lowenergy Tautomers. J. Comput.-Aided Mol. Des. 2010, 24, 627−638. (19) Greenwood, J. R.; Calkins, D.; Sullivan, A. P.; Shelley, J. C. Towards the Comprehensive, Rapid, and Accurate Prediction of the Favorable Tautomeric States of Drug-like Molecules in Aqueous Solution. J. Comput.-Aided Mol. Des. 2010, 24, 591−604. (20) Will, T.; Hutter, M. C.; Jauch, J.; Helms, V. Batch Tautomer Generation with MolTPC. J. Comput. Chem. 2013, 34, 2485−2492. (21) Kochev, N. T.; Paskaleva, V. H.; Jeliazkova, N. AmbitTautomer: An Open Source Tool for Tautomer Generation. Mol. Inf. 2013, 32, 481−504. (22) Guasch, L.; Sitzmann, M.; Nicklaus, M. C. Enumeration of Ring−Chain Tautomers Based on SMIRKS Rules. J. Chem. Inf. Model. 2014, 54, 2423−2432. (23) Urbaczek, S.; Kolodzik, A.; Rarey, M. The Valence State Combination Model: A Generic Framework for Handling Tautomers and Protonation States. J. Chem. Inf. Model. 2014, 54, 756−766. (24) Martin, Y. C. Let’s not Forget Tautomers. J. Comput.-Aided Mol. Des. 2009, 23, 693−704. (25) Porter, W. R. Warfarin: History, Tautomerism and Activity. J. Comput.-Aided Mol. Des. 2010, 24, 553−573. (26) Thalheim, T.; Vollmer, A.; Ebert, R.-U.; Kü hne, R.; Schüürmann, G. Tautomer Identification and Tautomer Structure Generation Based on the InChI Code. J. Chem. Inf. Model. 2010, 50, 1223−1232.

(27) Shelley, J. C.; Cholleti, A.; Frye, L. L.; Greenwood, J. R.; Timlin, M. R.; Uchimaya, M. Epik: a Software Program for pKa Prediction and Protonation State Generation for Drug-like Molecules. J. Comput.-Aided Mol. Des. 2007, 21, 681−691. (28) Chemicalize; ChemAxon: Budapest, Hungary, 2018; https:// chemaxon.com/products/chemicalize. (29) James, C. A.; Weininger, D. Daylight Theory Manual, version 4.9; Daylight Chemical Information Systems: Laguna Niguel, CA, 2011. (30) Cruz-Cabeza, A. J.; Schreyer, A.; Pitt, W. R. Annular Tautomerism: Experimental Observations and Quantum Mechanics Calculations. J. Comput.-Aided Mol. Des. 2010, 24, 575−586. (31) Pracht, P.; Bauer, C. A.; Grimme, S. Automated and Efficient Quantum Chemical Determination and Energetic Ranking of Molecular Protonation Sites. J. Comput. Chem. 2017, 38, 2618−2631. (32) Tadić, J. M.; Xu, L. Ab Initio and Density Functional Theory Study of Keto−Enol Equilibria of Deltic Acid in Gas and Aqueous Solution Phase: A Bimolecular Proton Transfer Mechanism. J. Org. Chem. 2012, 77, 8621−8626. (33) Burcu Arslan, N.; Ö zdemir, N.; Dayan, O.; Dege, N.; Koparir, M.; Koparir, P.; Muğlu, H. Direct and Solvent-Assisted Thione−Thiol Tautomerism in 5-(Thiophen-2-yl)-1,3,4-oxadiazole-2(3H)-thione: Experimental and Molecular Modeling Study. Chem. Phys. 2014, 439, 1−11. (34) Cieplak, P.; Bash, P.; Singh, U. C.; Kollman, P. A. A Theoretical Study of Tautomerism in the Gas Phase and Aqueous Solution: a Combined Use of State-of-the-art Ab Initio Quantum Mechanics and Free Energy-perturbation Methods. J. Am. Chem. Soc. 1987, 109, 6283−6289. (35) Schröder, D.; Buděsí̌ nský, M.; Roithová, J. Deprotonation of pHydroxybenzoic Acid: Does Electrospray Ionization Sample Solution or Gas-Phase Structures? J. Am. Chem. Soc. 2012, 134, 15897−15905. (36) Perrin, D. D.; Dempsey, B.; Serjeant, E. P. pKa Prediction for Organic Acids and Bases; Chapman and Hall: London, 1981. (37) Fraczkiewicz, R.; Lobell, M.; Göller, A. H.; Krenz, U.; Schoenneis, R.; Clark, R. D.; Hillisch, A. Best of Both Worlds: Combining Pharma Data and State of the Art Modeling Technology To Improve in Silico pKa Prediction. J. Chem. Inf. Model. 2015, 55, 389−397. (38) Yu, H. S.; Watson, M. A.; Bochevarov, A. D. Weighted Averaging Scheme and Local Atomic Descriptor for pKa Prediction Based on Density Functional Theory. J. Chem. Inf. Model. 2018, 58, 271−286. (39) Bochevarov, A. D.; Watson, M. A.; Greenwood, J. R.; Philipp, D. M. Multiconformation, Density Functional Theory-Based pKa Prediction in Application to Large, Flexible Organic Molecules with Diverse Functional Groups. J. Chem. Theory Comput. 2016, 12, 6001− 6019. (40) Citra, M. J. Estimating the pKa of Phenols, Carboxylic Acids and Alcohols from Semi-empirical Quantum Chemical Methods. Chemosphere 1999, 38, 191−206. (41) Jensen, J. H.; Swain, C. J.; Olsen, L. Prediction of pKa Values for Druglike Molecules Using Semiempirical Quantum Chemical Methods. J. Phys. Chem. A 2017, 121, 699−707. (42) Helgaker, T.; Jørgensen, P.; Olsen, J. Molecular ElecronicStructure Theory; John Wiley & Sons: Chichester, U.K., 2000. (43) Valters, R. E.; Flitsch, W. Ring−Chain Tautomerism; Plenum Press: New York, 1985. (44) Mchedlov-Petrossyan, N. O.; Cheipesh, T. A.; Vodolazkaya, N. A. Acid−base Dissociation and Tautomerism of Two Aminofluorescein Dyes in Solution. J. Mol. Liq. 2017, 225, 696−705. (45) Malde, A. K.; Stroet, M.; Caron, B.; Visscher, K. M.; Mark, A. E. Predicting the Prevalence of Alternative Warfarin Tautomers in Solution. J. Chem. Theory Comput. 2018, 14, 4405−4415. (46) Bochevarov, A. D.; Harder, E.; Hughes, T. F.; Greenwood, J. R.; Braden, D. A.; Philipp, D. M.; Rinaldo, D.; Halls, M. D.; Zhang, Z.; Friesner, R. A. Jaguar: A High-performance Quantum Chemistry Software Program with Strengths in Life and Materials Sciences. Int. J. Quantum Chem. 2013, 113, 2110−2142. 2688

DOI: 10.1021/acs.jcim.8b00955 J. Chem. Inf. Model. 2019, 59, 2672−2689

Article

Journal of Chemical Information and Modeling (47) Klicić, J. J.; Friesner, R. A.; Liu, S.-Y.; Guida, W. C. Accurate Prediction of Acidity Constants in Aqueous Solution via Density Functional Theory and Self-Consistent Reaction Field Methods. J. Phys. Chem. A 2002, 106, 1327−1335. (48) Vanommeslaeghe, K.; MacKerell, A. D., Jr. Automation of the CHARMM General Force Field (CGenFF) I: Bond Perception and Atom Typing. J. Chem. Inf. Model. 2012, 52, 3144−3154. (49) Roos, K.; Wu, C.; Damm, W.; Reboul, M.; Stevenson, J. M.; Lu, C.; Dahlgren, M. K.; Mondal, S.; Chen, W.; Wang, L.; Abel, R.; Friesner, R. A.; Harder, E. D. OPLS3e: Extending Force Field Coverage for Drug-Like Small Molecules. J. Chem. Theory Comput. 2019, 15, 1863−1874. (50) Grimme, S.; Antony, J.; Ehrlich, S.; Krieg, H. A Consistent and Accurate Ab Initio Parametrization of Density Functional Dispersion Correction (DFT-D) for the 94 Elements H−Pu. J. Chem. Phys. 2010, 132, 154104. (51) Lewis, G. N. The Atom and the Molecule. J. Am. Chem. Soc. 1916, 38, 762−785. (52) Stewart, J. J. P. Optimization of Parameters for Semiempirical Methods VI: More Modifications to the NDDO Approximations and Re-optimization of Parameters. J. Mol. Model. 2013, 19, 1−32. (53) Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer Series in Statistics; Springer: New York, 2001. (54) Anslyn, E. V.; Dougherty, D. A. Modern Physical Organic Chemistry; University Science Books: Sausalito, CA, 2006. (55) Orgován, Z.; Ferenczy, G. G.; Steinbrecher, T.; Szilágyi, B.; Bajusz, D.; Keserű , G. M. Validation of Tautomeric and Protomeric Binding Modes by Free Energy Calculations. A Case Study for the Structure Based Optimization of D-amino Acid Oxidase Inhibitors. J. Comput.-Aided Mol. Des. 2018, 32, 331−345. (56) Marten, B.; Kim, K.; Cortis, C.; Friesner, R. A.; Murphy, R. B.; Ringnalda, M. N.; Sitkoff, D.; Honig, B. New Model for Calculation of Solvation Free Energies: Correction of Self-Consistent Reaction Field Continuum Dielectric Theory for Short-Range Hydrogen-Bonding Effects. J. Phys. Chem. 1996, 100, 11775−11788. (57) Cortis, C. M.; Friesner, R. A. An Automatic Three-Dimensional Finite Element Mesh Generation System for the Poisson−Boltzmann Equation. J. Comput. Chem. 1997, 18, 1570−1590. (58) Cortis, C. M.; Friesner, R. A. Numerical Solution of the Poisson−Boltzmann Equation Using Tetrahedral Finite-element Meshes. J. Comput. Chem. 1997, 18, 1591−1608. (59) Stewart, J. J. P. MOPAC2016; Stewart Computational Chemistry: Colorado Springs, CO, 2016. (60) Gimadiev, T. R.; Madzhidov, T. I.; Nugmanov, R. I.; Baskin, I. I.; Antipin, I. S.; Varnek, A. Assessment of Tautomer Distribution Using the Condensed Reaction Graph Approach. J. Comput.-Aided Mol. Des. 2018, 32, 401−414. (61) Zhao, Y.; Truhlar, D. G. The M06 Suite of Density Functionals for Main Group Thermochemistry, Thermochemical Kinetics, Noncovalent Interactions, Excited States, and Transition Elements: Two New Functionals and Systematic Testing of Four M06-class Functionals and 12 other functionals. Theor. Chem. Acc. 2008, 120, 215−241. (62) Hansen, P. S.; Kamounah, F.; Zhiryakova, D.; Manolova, Y.; Antonov, L. 1,1′,1″-(2,4,6-Trihydroxybenzene-1,3,5-triyl)triethanone Tautomerism Revisited. Tetrahedron Lett. 2014, 55, 354−357. (63) Ivanova, D.; Deneva, V.; Nedeltcheva, D.; Kamounah, F. S.; Gergov, G.; Hansen, P. E.; Kawauchi, S.; Antonov, L. Tautomeric Transformations of Piroxicam in Solution: a Combined Experimental and Theoretical Study. RSC Adv. 2015, 5, 31852−31860. (64) Kawauchi, S.; Antonov, L. Description of the Tautomerism in Some Azonaphthols. J. Phys. Org. Chem. 2013, 26, 643−652. (65) Mardirossian, N.; Head-Gordon, M. Thirty Years of Density Functional Theory in Computational Chemistry: an Overview and Extensive Assessment of 200 Density Functionals. Mol. Phys. 2017, 115, 2315−2372.

(66) Kanal, I. Y.; Keith, J. A.; Hutchison, G. R. A Sobering Assessment of Small-molecule Force Field Methods for Low Energy Conformer Predictions. Int. J. Quantum Chem. 2018, 118, e25512. (67) Hidalgo, A.; Giroday, T.; Mora-Diez, N. Thermodynamic Stability of Neutral and Anionic PFOAs. Theor. Chem. Acc. 2015, 134, 124. (68) Kim, V.; Piatkowski, L.; Pszona, M.; Jäger, R.; Ostapko, J.; Sepioł, J.; Meixner, A. J.; Waluk, J. Unusual Effects in Single Molecule Tautomerization: Hemiporphycene. Phys. Chem. Chem. Phys. 2018, 20, 26591−26596. (69) Li, M.; Zhang, H.; Chen, B.; Wu, Y.; Guan, L. Prediction of pKa Values for Neutral and Basic Drugs based on Hybrid Artificial Intelligence Methods. Sci. Rep. 2018, 8, 3991. (70) Zhou, T.; Jhamb, S.; Liang, X.; Sundmacher, K.; Gani, R. Prediction of Acid Dissociation Constants of Organic Compounds Using Group Contribution Methods. Chem. Eng. Sci. 2018, 183, 95− 105.

2689

DOI: 10.1021/acs.jcim.8b00955 J. Chem. Inf. Model. 2019, 59, 2672−2689