Multiconformation, Density Functional Theory ... - ACS Publications

Nov 10, 2016 - version of our computer program Jaguar pKa, a density functional ... An application of Jaguar pKa to proton sponges, the pKa of which a...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/JCTC

Multiconformation, Density Functional Theory-Based pKa Prediction in Application to Large, Flexible Organic Molecules with Diverse Functional Groups Art D. Bochevarov,* Mark A. Watson, and Jeremy R. Greenwood Schrödinger, Inc., 120 West 45th Street, New York, New York 10036, United States

Dean M. Philipp Schrödinger, Inc., 101 SW Main Street, Suite 1300, Portland, Oregon 97204, United States S Supporting Information *

ABSTRACT: We consider the conformational flexibility of molecules and its implications for micro- and macro-pKa. The corresponding formulas are derived and discussed against the background of a comprehensive scientific and algorithmic description of the latest version of our computer program Jaguar pKa, a density functional theory-based pKa predictor, which is now capable of acting on multiple conformations explicitly. Jaguar pKa is essentially a complex computational workflow incorporating research and technologies from the fields of cheminformatics, molecular mechanics, quantum mechanics, and implicit solvation models. The workflow also makes use of automatically applied empirical corrections which account for the systematic errors resulting from the neglect of explicit solvent interactions in the algorithm’s implicit solvent model. Applications of our program to large, flexible organic molecules representing several classes of functional groups are shown, with a particular emphasis in illustrations laid on drug-like molecules. It is demonstrated that a combination of aggressive conformational search and an explicit consideration of multiple conformations nearly eliminates the dependence of results on the initially chosen conformation. In certain cases this leads to unprecedented accuracy, which is sufficient for distinguishing stereoisomers that have slightly different pKa values. An application of Jaguar pKa to proton sponges, the pKa of which are strongly influenced by steric effects, showcases the advantages that pKa predictors based on quantum mechanical calculations have over similar empirical programs.

I. INTRODUCTION Despite a thorough general understanding of the physical processes that determine the protonation equilibria of organic molecules in aqueous solutions, a widely applicable and robust computational prediction of the negative logarithm of the protonation equilibrium constant, pKa, remains a challenge.1−7 The pKa of a chemical compound is an important characteristic in drug discovery8−10 and materials science11−13 applications. Numerous studies have shown that it is possible to accurately compute pKa using approaches based on either quantum mechanics14−18 or cheminformatics.19,20 However, a large number of such studies focus on a narrow set of functional groups or families of chemical structures,14−16,18,20−29 and it is not clear if the accuracy of pKa prediction claimed in these works would persevere when using the same approaches for other types of structures that inevitably come up in practical applications. There is also a concern about using somewhat different computational protocols for predicting pKa of different classes of compounds14,15 or using different experimental proton solvation free energy values in the models that are meant to generate absolute pKa,2,14,23,30,31 without the need for empirical adjustments. These issues raise doubts about the © 2016 American Chemical Society

availability of a truly universal computational method for pKa prediction. Computational methods that predict pKa can be divided into two broad classes: methods that utilize atomic coordinates (three-dimensional structures) and methods that utilize atom connectivity information (two-dimensional structures). The first class of methods, which we will be calling 3D-methods, deals with conformations explicitly. The use of 3D structures, and consequently conformations, can be an advantage and a disadvantage for pKa predictions. An advantage is that such an approach, by virtue of having access to atomic coordinates, can be made responsive to steric, conformational, and other subtle effects such as intramolecular hydrogen bonds or π−π interactions, all of which might have a significant impact on pKa. A disadvantage is that a thorough and accurate accounting for conformations is difficult and computationally expensive to perform, so that a failure to do this part of the workflow properly or a temptation to use a less computationally expensive conformational search (CS) might lead to large Received: August 15, 2016 Published: November 10, 2016 6001

DOI: 10.1021/acs.jctc.6b00805 J. Chem. Theory Comput. 2016, 12, 6001−6019

Article

Journal of Chemical Theory and Computation

The literature generally converges to the opinion that the accuracy targeted by pKa predictions should be below 1.0 pKa and preferably as low as 0.5 units of mean unsigned error.14,16,37 Is this a reasonable aspiration? While it is claimed that accuracy for modern experimental pKa measurements lies within 0.1 pKa units,20,38,39 such assertions are made when referring to measurements conducted repeatedly in the same laboratory. When analyzing experimental pKa data published in the literature, we often observe that experimental measurements conducted on the same compound in different laboratories produce discrepancies as large as 0.5 pKa units, even in recent publications (see, for example, Supporting Information in ref 39). We believe that differences in measurement techniques, sample purity and sample preparation protocols, solvent purity and content, and even temperature may all contribute to the observed variations. Because the underlying parametrization and comparison of the final results is performed with experimental pKa values collected from various experimental groups, it would be unrealistic to expect to consistently achieve an error of less than about 0.5 units with theoretical pKa predictions, regardless of the inherent accuracy of the theoretical methods or protocols used. This paper has the following organization. The section that comes next provides the basics of the DFT-based pKa prediction method. In Sections III−IV we discuss the main recent computational innovations of our program − the shell model and the treatment of conformations, respectively. In Section VI, we show applications and benchmarks and wrap up the discussion with a summary in Section VII.

pKa errors for conformationally sensitive structures. The second class of methods that deals only with atom connectivities and ignores the conformational issues entirely can be paradoxically more stable when handling conformationally sensitive structures. This can happen either because in such methods the 3D effects are factored in the underlying extensive parametrization, or because the error, even if large, generated by these methods for the same type of structures is not affected by the essentially random initial conformation and tends to be more systematic. Even if pKa can be evaluated accurately on sets of conformationally flexible molecules with diverse functional groups, this might be still insufficient for achieving a good agreement with the experimental pKa in more challenging cases. Such challenging systems that require additional treatment from the point of view of the pKa predictor are molecules containing multiple functional groups which can be protonated or deprotonated, molecules that exist in different tautomeric forms (that is, forms that differ by the localization of a single hydrogen atom or proton), and “exotic” molecules - those containing unusual (for pKa prediction) chemical elements such as transition metal atoms and rare functional groups. There is of course a separate and extensive research field of predicting protonation constants of side chains in proteins.32−35 In these systems a very large number of acidic and basic functional groups coexist, and this creates an entirely different challenge in comparison with all the challenges that we have mentioned above. In this work we will consider only molecules that have a small number (typically fewer than 10) of protonatable and deprotonatable functional groups. From the above discussion, it becomes clear that the number and the scope of issues facing a computational pKa predictor aspiring to be widely or universally applicable are considerable. In the course of many years, we have been pursuing the development of a computational pKa predictor meant to be applicable to numerous types of practically important chemical structures. Our approach is based on the density functional theory (DFT)36 and works with three-dimensional representations of molecules. It is a complex workflow incorporating developments from the fields of cheminformatics, molecular mechanics, implicit solvation models, and, obviously, quantum mechanics. A set of empirical parameters used to partly account for close range effects that are not covered adequately by our solvation model and by our thermodynamic cycle adjust the socalled “raw” pKa values to those directly comparable to the experiment. A scheme that we call the “shell model” adjusts the use of these empirical parameters for different functional groups and actually covers the whole chemical space, albeit with progressively decreasing accuracy for rarer and more exotic species. In this work, we review and discuss the latest developments of our program, the most significant of which are the shell model and an explicit accounting for multiple conformational forms. These improvements enable a much greater coverage of the chemical space and a higher accuracy and stability of pKa predictions. The progress is illustrated by a large number of diverse applications, emphasizing drug-like molecules because those tend to be especially challenging owing to their larger atom count and greater conformational flexibility, and because an accurate pKa prediction of such molecules is known to be particularly valuable. We show that in certain cases the accuracy achieved can be so high as to correctly describe fine steric effects and correctly predict small differences in pKa of stereoisomers.

II. METHOD The basic methodology and the computational details underlying our pKa prediction method have not changed since the work in which the method was introduced.16 Here we briefly review the main points of that methodology and concentrate on the improvements that have built upon the original implementation over the years. All the scientific and technological improvements described below have been implemented in the computational program Jaguar pKa40 which uses the program Jaguar40,41 as its DFT engine. In the present section we assume that there is only one conformation in each protonation state. Our method is grounded in the thermodynamic cycle42 shown in Figure 1, which permits expression of the sought-after free energy of acid dissociation ΔGa through other free energies A− which can be either computed (ΔGg, ΔGAH sol , ΔGsol) or taken + from experiment (ΔGHsol):

Figure 1. Thermodynamic cycle used in our Jaguar pKa program. The subscript (g) denotes species in the gas phase, while the subscript (aq) denotes species in the aqueous phase. 6002

DOI: 10.1021/acs.jctc.6b00805 J. Chem. Theory Comput. 2016, 12, 6001−6019

Article

Journal of Chemical Theory and Computation −

+

AH A H ΔGa = ΔGg − ΔGsol + ΔGsol + ΔGsol

molecules have been used in an attempt to improve the accuracy of pKa prediction for simple molecules,56,57 but it remains to be worked out how to apply such protocols for all types of chemistries in an automated fashion. We would like to emphasize that, in principle, a high accuracy in pKa prediction might be also achievable with a fully nonempirical approach, if one uses some of the largest basis sets, the most accurate levels of theory and solvation models (possibly in combination with adding explicit water molecules), together with a proper entropic treatment; however, such approaches would be impractical for all but the smallest and simplest chemical structures, due to their very high computational cost. However, recent works that use novel or sophisticated treatments to achieve high accuracy are worth paying attention to.33,58−65 The errors produced in computation of bond dissociation energies and solvation energies in our protocol have been shown to be systematic and to depend linearly on the functional group.16 This observation permitted development of a practical scheme that corrects the “raw” pKa obtained directly from eq 2.2 using a pair of linear parameters a, b optimized for each functional group:

(2.1)

The meaning of the free energies from this equation is made clear by Figure 1. The quantities pKa and ΔGa are linked by the fundamental thermodynamic relation:

pK a =

ΔGa 2.303RT

(2.2)

This thermodynamic cycle has been used in numerous other works dedicated to pKa predictions (see, for example, publications2,37,43 and references therein). Since quantum chemical methods are not designed to compute properties of + systems without electrons, the experimental GHsol value of −259.5 kcal/mol is utilized.42 A number of different and allegedly more accurate values for the solvation of the proton have been reported, (see, for example, refs 44 , 45, and references therein), and there might be a reason for opting for one of these values instead. However, choosing a different value + of GHsol would not change the results produced by our workflow because in our workflow the computed ΔGa, which depends + linearly on ΔGHsol , would be effectively adjusted by linear empirical parameters for the generation of the final results (vide inf ra) and thus would merely translate into an adjustment of these empirical parameters. Therefore, we keep to the value of + GHsol employed in ref 16 for consistency with this earlier work.

pK afinal = a pK a + b

(2.3)

This optimization requires a training set consisting of sample molecules with experimental pKa values for each functional group. As a result, the authors of ref 16 achieved a dramatic reduction of pKa errors across a large area of the practically important chemical space. The idea of adjusting preliminary pKa values through a linear regression has been used in other pKa prediction protocols.66−68 The workflow’s quantum chemical calculations are performed as follows. Geometry optimizations of the protonated and deprotonated forms are carried out in either the gas or the solution phase (following user’s setting) with the B3LYP/631G*level of theory. Once the optimal geometries are available, accurate single point calculations are performed in gas and solution phases using the B3LYP functional, with cc-pVTZ+ on the deprotonated atoms which bear formal negative charge, and cc-pVTZ on the rest of the atoms. In evaluating solvation energies, our program uses the PBF implicit solvation model46−48 which has been shown to produce accurate solvation energies of neutral and ionic species. This solvation model solves the Poisson−Boltzmann equation explicitly, using a set of specially designed atomic radii that are vital for generation of high-quality molecular surfaces. The final solvation energies are corrected by a set of empirical parameters based on SMARTS patterns (see Section III for a discussion of the shell model, which serves as a mechanism for selecting appropriate empirical parameters). These parameters compensate for the first-shell interaction with the solvent which is missing from the implicit solvent model and are absolutely necessary for generating accurate solvation energies for ionic species. The single point energies go into constructing the A− values of ΔGg, GAH sol , and Gsol via the following approximate formulas:



A The values ΔGg, ΔGAH sol , and ΔGsol are evaluated in our protocol using DFT methods. The latter two free energies are evaluated with an implicit solvent model which solves the Poisson− Boltzmann equation.46−48 Experience shows48 that pKa values coming directly from eqs 2.1 and 2.2 can be a few pKa units away from the experimentally measured values, even for the simplest, most rigid molecules. Inherent deficiencies of DFT, basis sets, and the implicit solvation model are blamed for these large errors, which unfortunately cannot be corrected for all types of functional groups simultaneously by replacing or modifying these underlying components. For instance, when we replaced the B3LYP functional, standard to our pKa computation protocol, by the M06-2X functional49 (which has been shown to yield more accurate energetics in many cases)49−51 and reoptimized all the dependent parameters, this has not produced any systematic improvement to the rootmean-square deviations (RMSD’s) of the parameter fitting and to final pKa values across a large spectrum of functional groups (data shown in the Supporting Information). A good way of convincing oneself of the impracticality of using eq 2.2 directly for all functional groups is to realize that at 298 K a mere 1.36 kcal/mol of error in ΔGa translates into 1 pKa unit of error on the logarithmic scale.1,2,40 It is not to be expected that one can achieve an error significantly lower than 1.36 kcal/mol in calculation of ΔGa, using DFT methods, across all types of functional groups and structures, no matter what thermodynamic cycle is used.2 This is because errors of bond dissociation energies produced by DFT methods and solvation energy errors are often above one and sometimes over several kcal/ mol.45,52−55 Thus, eq 2.2, used directly, might yield accurate pKa predictions (with an error as low as 0.1 pKa unit) due to a + cancelation of errors or a special choice of the value of ΔGHsol. However, this typically only works for one or a handful of functional groups at a time, with all other parameters of the computational protocol being equal.14,15 Explicit water



ΔGg ≈ EgA − EgAH + AH AH ΔGsol ≈ Esol −

(2.4) (2.5)



A A ΔGsol ≈ Esol

6003

5 RT − T ΔS(H+) 2

(2.6) DOI: 10.1021/acs.jctc.6b00805 J. Chem. Theory Comput. 2016, 12, 6001−6019

Article

Journal of Chemical Theory and Computation

Figure 2. Functional groups that have an explicit parametrization in Jaguar pKa. The hydrogen atoms marked in red indicate the sites of deprotonation, while the atoms marked in blue indicate the sites of protonation. Alk denotes alkyl groups, Ar denotes aryl groups, and Het stands for heterocycles. G is any group of atoms, and R typically begins with a carbon atom. Unmarked H atoms can be replaced by essentially any substituents without affecting the recognition of the functional group.

In formula 2.4, the terms (5/2)RT and TΔS(H+) are the enthalpic and entropic contributions for H+, respectively, under ideal gas conditions. The first contribution is equal to 1.48 kcal/ mol, and the second is 7.76 kcal/mol, assuming room temperature and 1 atm of pressure. These are the standard approximations and values, and they have been used in a number of other works.14,37,69,70 In all three eqs 2.4−2.6 we neglect the entropic contributions for the species AH and A−, assuming that these contributions cancel out or at least can be partly absorbed by the empirical corrections from formula 2.3. This assumption, deemed reasonable for conformationally rigid molecules, is reconsidered in other context in Section IV. We also avoid computing zero point vibrational energies (ZPVE), and therefore enthalpies, in all our calculations because ZPVE are computationally expensive to generate. The overall ZPVE contributions to ΔGa do not cancel out, but they are approximately constant for the same functional group16 and are therefore conveniently included in the empirical parameters a, b. Within a parametrization for a single functional group, the parameter a can be interpreted as compensating for the deficiencies in the variable components of eq 2.3, namely solvation and gas phase energies, whereas the parameter b effectively absorbs any inaccuracies in the postulated constants

as the approximately constant vibrational and entropic terms neglected in eqs 2.4−2.6. Now that all the free energies are computed, the final pKa can be assembled using eqs 2.2 and 2.3 and then output for each set of the matched parameters a, b. There are approximately 100 specific functional groups (the exact number depends on the way one divides some of the groups into subgroups) in the current version of the Jaguar pKa program for which independent parameters a, b have been trained. Figure 2 shows these functional groups. The total number of experimental pKa data points encompassing all these groups is 1134. Note that in some cases functional groups “overlap”; that is, either the group is a subgroup of another group, or the compound can be thought of as belonging to both groups without clear precedence. As an example of the first case, consider a substituted 4-hydroxycoumarin which belongs to both the 4-hydroxycoumarin group and to the Het−OH group (a heterocycle linked to hydroxyl), with the first being more specific and therefore preferable. As an example of the second case consider 8-quinolinethiol which will match both the Ph-SH and Het-SH patterns, with it not being clear, at least without inspection of the respective training sets, which pattern one would prefer. In practice one can use an algorithm to choose the “best” set of parameters for a given compound. In the following section we will describe one such algorithm

+

GHsol = −259.5 kcal/mol and TΔS(H+) = 7.76 kcal/mol, as well 6004

DOI: 10.1021/acs.jctc.6b00805 J. Chem. Theory Comput. 2016, 12, 6001−6019

Article

Journal of Chemical Theory and Computation

Figure 3. An illustration of the macro-pKa, micro-pKa, and nano-pKa terminology. Macro-pKa describes and encompasses (possibly) numerous protonation channels. Micro-pKa describes a single protonation channel characterized by (possibly) numerous conformations. Nano-pKa describes the protonation equilibrium between a single protonated conformation and a single deprotonated conformation. When micro-pKa’s and macro-pKa’s take multiple conformations into consideration (and are thus defined through nano-pKa’s), they are referred to as conformationally aware micro-pKa and macro-pKa, respectively.

need to compute the micro-pKa of all of these sites for their subsequent inclusion into the formula for macro-pKa . Computing micro-pKa for just one such group will suffice, upon the conjecture that the micro-pKa’s of the rest of the groups is the same. Jaguar pKa can identify symmetrically equivalent sites automatically by analyzing the molecule’s SMARTS pattern,71 compute the micro-pKa of one of them, and report the macro-pKa, accounting for the micro-pKa’s of the symmetrically equivalent groups that it did not compute explicitly. This saves computational time and frees the user from the burden of indentifying symmetry “by eye” and assembling the macro-pKa “manually”. In the existing implementation of Jaguar pKa, the user must decide which protonation/deprotonation process is being considered in the pKa prediction and then pick the functional groups thought to be involved in this process. This decision making process should be straightforward to automate in a future version of Jaguar pKa. When the functional groups are selected, the program automatically identifies them, matches them against the SMARTS patterns representing the functional groups from Figure 2, and in this way selects the empirical parameters a, b for use at the end of the calculation. Then it generates the corresponding protonated and deprotonated species and launches all necessary calculations. Conformational treatment of the generated species is optional, although greatly recommended, and is discussed in Section IV.

actually used in Jaguar pKa, although alternative algorithms should not be hard to formulate. The discussion in the above two paragraphs makes it clear that an association of the predicted pKa with the protonation/ deprotonation of a specific functional group in the given molecule plays a major role in our approach with such a value generated for a single process thus being a micro-pKa (for the distinction between micro- and macro-pKa see Figure 3). If the molecule does not contain other protonatable/deprotonatable groups, the micro-pKa predicted by the program should correspond directly to the macro-pKa observed experimentally. If several protonatable/deprotonatable functional groups are present, their micro-pKa’s can be computed independently and then combined into a macro-pKa using an elementary algebraic equation. In practical calculations often only one functional group should be considered, even if several protonatable/ deprotonatable groups are present in the molecule. This is because it can be obvious from the chemical intuition which functional group is primarily responsible for the observed macro-pKa. For instance, if the molecule contains both phenolic hydroxyl and aliphatic carboxyl groups, and we are interested in predicting the pKa1 constant, it is clear that we should only compute the micro-pKa of the carboxyl group. The micro-pKa of the phenolic hydroxyl group should be several logarithmic units away from that of the aliphatic carboxyl group, and as such provides a totally negligible contribution to the total effective pKa1. If there are two or more symmetrically equivalent protonation/deprotonation sites in the molecule, there is no 6005

DOI: 10.1021/acs.jctc.6b00805 J. Chem. Theory Comput. 2016, 12, 6001−6019

Article

Journal of Chemical Theory and Computation

Figure 4. Shell model that was used for classification of a protonation of a benzodiazepene derivative. Only a part of the model could be displayed in this image. The shells that could not be displayed, for the economy of space, are marked by the “...” symbol. The full model is available in the Supporting Information. The nitrogen atom to be protonated in the benzodiazepene molecule is shown by the gray circle. The shells marked in gray are those that match the protonated benzodiazepene functional group.

III. SHELL MODEL As we have seen in the previous section, an application of empirical parameters via eq 2.3 is necessary for accurate prediction of pKa. The selection of parameters a, b, specific for each functional group, is carried out with the help of the shell model which is elucidated in this section. The shell model is a tree-like graph that organizes SMARTS patterns and possibly different types of data associated with each of these patterns, in a hierarchical order of “shells”, which are essentially the graph’s nodes. The shell model does not have to be a tree, but in application to our pKa prediction scheme it usually is, and we will assume it to be such in the rest of the text. The tree is organized from top to bottom, with the SMARTS patterns ranging from most generic (at the top) to very specific (at the bottom). The presence of the most generic possible pattern guarantees that any given molecule will match at least one pattern. With the hierarchical organization in place, it is possible to automate the classification of a given molecular structure in terms of its association with a specific type of chemistry. Once the association is established, one can automatically apply settings and parameters belonging to the matched pattern. Therefore, one passes a structure to the computer program as an input and obtains, as an output, settings and parameters tailored for the treatment of this particular kind of molecule. A shortened version of the shell model, together with its application to the classification of a benzodiazepine protonation, is shown in Figure 4. The actual shell model used in the current version of Jaguar pKa is more extensive, as it comprises over a hundred shells based on the patterns from Figure 2, and would be difficult to put on a single page. It is, however, available in the Supporting Information. If the model from Figure 4 is used in conjunction with the pKa prediction described in Section II, then each shell shown in the image has a SMARTS pattern and parameters a, b associated with it. The shells marked in gray are those that matched the benzodiazepine functional group. In our implementation, the shells’ SMARTS patterns correspond to protonated functional groups and are matched to the structure that has already been protonated. We chose the protonated patterns as the greater

number of atoms leaves less room for ambiguity when matching. Each set of parameters a, b corresponding to the matched shell leads to its own pKa prediction using eq 2.3. Of course, as the matching becomes more and more specific, as when going from the top generic atom shell to the benzodiazepenes shell in Figure 4, the parameters a, b are deemed to be progressively more relevant and therefore more accurate. So it is probably the pKa prediction corresponding to the most specific pattern that the user is going to prefer in most cases. Now it is appropriate to discuss the parametrization of the shells. All shells can be classified into elementary and derivative. An elementary shell is the shell that is an end vertex of the treelike structure; it is directly connected only to shells that have a less specific SMARTS pattern. All shells that are not elementary are derivative. Thus, in Figure 4 “sulfonic acid”, “benzodiazepine”, and “conjugated carboxylic acid” are examples of elementary shells, and “sp2-like O”, “imidazole as acid”, and “imine” are examples of derivative shells. Only elementary shells need to be explicitly parametrized through specific training set data points in our implementation. Derivative shells obtain their parameters a, b through a simple automatic model that extends their fitting to all of the training set points for the underlying elementary shells. Again, using Figure 4 as an example, parameters of the “imine” shell are constructed from the training set data for the elementary shells “benzodiazepine”, “hydrazone”, and the other elementary shells belonging to “imine” but not explicitly shown in the figure. The parameters of the higher-lying “sp2-like aliphatic N” shell are constructed from all the training set data of its underlying elementary shells, namely “barbituric acid”, “hydrazone”, “benzodiazepine”, and other elementary shells grouped under “sp2-like aliphatic N”, and not explicitly shown in the figure. The parametrization of an elementary shell is carried out as follows. One chooses several chemical structures (typically more than four and fewer than 20) that satisfy the following criteria: (i) they contain a single functional group matching the elementary shell’s SMARTS pattern 6006

DOI: 10.1021/acs.jctc.6b00805 J. Chem. Theory Comput. 2016, 12, 6001−6019

Article

Journal of Chemical Theory and Computation (ii) their experimental pKa’s are known with a high degree of trustworthiness, measured in water, at or around 298 K, and preferably in the absence of salts and other solvents (iii) the structures contain either a single functional group or several functional groups but in the latter case under conditions where the other groups are certainly not interfering with the protonation/deprotonation process of the principal functional group (iv) the structures are preferably simple and nonflexible (v) the structures undergo no tautomeric equilibria in their solutions If several similar experimental pKa values exist for the given structure they are averaged, and the result is associated with the structure. The idea behind selection rules (i)−(v) is that we would like to attribute the measured pKa of the compound to the protonation/deprotonation processes around one functional group, free of other complicating factors such as effects from other functional groups or the presence of alternative conformational and tautomeric forms. Once the compounds for the “training” set are prepared in accordance with rules (i)−(v), their “raw” pKa’s are computed following the thermodynamic cycle from Figure 1. The parameters a, b are the coefficients of the linear fit of the raw pKa values to the corresponding experimental values. Typically, there is good to very good correlation (the r2 coefficient is 0.8 and higher), at which point the parameters a, b are associated with the shell and the parametrization is declared complete. However, there are sometimes cases when the correlation is poor. This is a sign that either there is something wrong in the interpretation of the experimental results or in the experimental results themselves or that the physics of the protonation/deprotonation process is more complex than we have heretofore assumed. In such cases we carefully re-examine the experimental pKa’s. Then, if we still have no reason to doubt the correctness of the experimental measurements, we split the shell into two or more other subshells, possibly adding more data points, and repeat the parametrization procedure for each of the subshells, until the satisfactory linear correlation is obtained for each of the subshells. The shell which is a “parent” of the subshells thus becomes a derivative shell, and its parameters a, b are obtained simply from a linear fit of all the individual data points comprising its elementary shells. The correlations thus obtained for derivative shells are necessarily worse than correlations obtained for elementary shells. This is why pKa predictions produced by the parameters of the matched elementary shells are preferable to those generated by the parameters of the matched derivative shells. Derivative shells are, however, indispensable for coverage of all chemical space because there may be cases when a given compound is not matched by any of the elementary shells present in the current shell model. However, a match might be achieved by one of the derivative shells and certainly by the most generic shell, which matches all compounds. As in the situation above, the pKa value coming from the most specific derivative shell must be preferred. The parametrization procedure is somewhat laborious and not necessarily objective. The person who performs the parametrization must decide whether the training set should include a particular data point or not and also how the shell will be split into subshells. In the future, we would like to work toward an implementation of a data-scouting computer algorithm that would ease the burden of the parametrization on the developer and perform the procedure in a more objective way.

IV. TREATMENT OF CONFORMATIONS As the original paper on Jaguar pKa indicated, 16 the conformational flexibility of molecules is a major issue for quantum chemical pKa prediction. This is because the micropKa’s of individual conformations might be dispersed across a broad range of values, and preferring any particular conformation over the other conformations without taking the corresponding energies into account is unjustifiable. The results shown in the following section clearly demonstrate that the accuracy of pKa prediction in the absence of any special treatment of the conformational problem depends strongly on the conformational flexibility of the molecule and on its input conformation. In this section we discuss how this problem can be addressed and describe a solution recently implemented in Jaguar pKa. The simplest way to deal with multiple conformations is to assume that only one conformation, identical for the protonated and the deprotonated forms, is dominant and is therefore responsible for the observed pKa. The next level of approximation is to recognize that the protonated and the deprotonated forms of the same species might have different dominant conformations, as is expected for structures that are capable of forming intramolecular hydrogen bonds. Ultimately, one can accept that more than one protonated and more than one deprotonated conformation might exist in appreciable quantities in solution, and all these species can contribute to the observed pKa. Whichever of these approximations is adopted for practical realization, there should be an independent mechanism to select one or more significant conformations prior to subjecting them to quantum chemical calculations. Since there are numerous conformational sampling methods with settings potentially resulting in different sets of conformations, one can in principle have a large number of ways to implement conformationally aware quantum mechanical pKa predictions. Early versions of Jaguar pKa approached the conformational problem from the conceptually and computationally more accessible side: by assuming that only one conformation per protonation state contributed to pKa. The most energetically favorable protonated and deprotonated states were selected by the associated molecular mechanics program MacroModel.72 This program employs the OPLS2005 force field73−75 with settings that ensure a reasonable compromise between the breadth of the CS and its computational cost. A new and more accurate version of the force field, OPLS3,76 has not been yet used in Jaguar pKa systematically. An investigation into the possible improvements for pKa prediction results due to a transition from OPLS2005 to OPLS3 is reserved for a future work. Our numerous applications of this methodology to large and flexible molecules convinced us that in some cases such a basic treatment of conformations was inadequate for achieving an acceptable error of less than 1 pKa unit. Another problem that we observed was that the final pKa prediction depended, in some cases significantly, on the initial choice of the conformation that was passed to MacroModel. It was possible to target a higher accuracy by generating conformations with a custom protocol, prior to performing the pKa prediction itself. One could then even statistically average the pKa’s computed for individual conformations. However, such practices were not standardized, and their effects on the final results have not been systematically investigated. Clearly, an improvement to the 6007

DOI: 10.1021/acs.jctc.6b00805 J. Chem. Theory Comput. 2016, 12, 6001−6019

Article

Journal of Chemical Theory and Computation

both, in the summations can lead to large errors in prediction of Ka. Eq 4.3 can be a working formula for empirical approaches that do not have to deal with conformations.20,77 It can also be used in quantum mechanical pKa calculations if we assume that the protonation states {P} and {D} are each represented by a single conformation.23 It might be tempting to use eq 4.3 for conformations in the same way one uses it for protonation states, by interpreting {P} as a series of protonated conformations and {D} as a series of deprotonated conformations. However, it is easy to see that the formula essentially breaks down and becomes dysfunctional when applied to conformations. Eq 4.2 can still be applied to conformations, provided we assume that we can come up with a reasonable definition of a “concentration” of a specific conformation when there is a continuum of conformations. However, formula 4.3 ceases to be valid for conformations as soon as we replace Kij in it by the corresponding micro-pKa’s computed on the sets {P} and {D} with the thermodynamic cycle from Figure 1 (or, in fact, with any other similar thermodynamic cycle or empirical formula used in practice). To convince oneself of the inapplicability of the naive interpretation of formula 4.3 for conformations, one can consider a simple case of the dissociation of phenol. The deprotonated form is represented by only one conformation (M = 1), whereas the protonated form is represented in principle by an infinite number of conformations characterized by the differerent dihedral angles CCOH. There is a continuum of these protonated conformations, all having similar energy. Note that nothing compels us to consider only conformations occupying a minimum on the potential energy surface. Since all the conformations have similar energy, their Ki1 constants will be similar, leading to a nonsensical result in which Ka tends to zero (and the respective pKa to infinity) with the increasing number of conformations considered. IV.B. Conformationally Aware Macro-pKa. An appropriate way to treat conformations with our thermodynamic cycle would be to properly take into consideration the conformational flexibility of the molecules in each protonation channel by including the corresponding entropic contributions

handling of conformations was needed, and a more thorough study of the influence of the CS settings on the accuracy of pKa prediction was desirable. IV.A. Traditional Macro-pKa. In order to derive and explain the formulas that we use to compute conformationally aware macro-pKa values, it is best to start with the well-known formula for computing macro-pKa from micro-pKa’s, assuming that each proton dissociation channel is represented by a single protonated and a single deprotonated form. Then we generalize this formula to multiple conformations, which enables us to compute a macro-pKa involving multiple protonation channels with any number of conformations. Suppose a molecule in its aqueous solution exists as a set of “protonated” tautomers {P} composed of N species P1, P2, ..., PN, each bearing total charge C, and a set of “deprotonated” tautomers {D} composed of M species D1, D2, ..., DM, each bearing total charge C − 1. For the simplicity of discussion we will assume, without losing generality, that the ordering of these two sets of protonation states goes from the most populated to the least populated forms. Other protonation states with charges C + 1, C − 2, etc., might exist, but we are not concerned about them here because our goal is to compute the macro-pKa for the transition from C to C − 1. Once this formula is found it can be equally applied to other charge transitions. By definition, for the dissociation equilibrium P ⇌ D + H+

(4.1)

we have Ka =

([D1] + [D2 ] + ... + [DM ])[H+] [D][H+] = [P] [P1] + [P2] + ... + [PN ] (4.2)

where [P] and [D] are the total equilibrium concentrations of the forms for P and D, respectively. Expanding this expression into M terms with the numerators [D1][H+], [D2][H+], and so on, and then dividing both the numerator and the denominator in each of these terms by the numerator, we cast eq 4.2 into an actionable form M

Ka =

∑ j=1

1 N



A in the terms ΔGg, ΔGAH sol , and ΔGsol in eq 2.1. This path would require fundamental and nontrivial changes in the construction of our workflow, and therefore we decided to pursue an alternative and conceptually much simpler approach that would still allow us to treat the problem of the conformations. Let us assume that we are dealing with a single protonation channel, that is, there is only one protonated tautomer and one deprotonated tautomer, each characterized by a number of conformations. Once we discuss the situation with a single protonation channel we will extend the conclusions to multiple protonation channels. For the clarity of the following discussion it appears apposite to introduce the notion of the nano equilibrium constant (and the corresponding nano-pKa’s), which is the equilibrium constant characterizing a protonation equilibrium between a single protonated conformation and a single deprotonated conformation. Thus, nanoconstants characterize protonation equilibria between individual conformations in a single protonation channel. Microconstants describe a single protonation channel (possibly) involving multiple conformations. Finally, macroconstants describe (possibly) multiple proto-

1

∑i = 1 K

ij

(4.3)

where

K ij =

[Dj ][H+] [P] i

(4.4)

is the microdissociation constant for the deprotonation channel Pi ⇌ Dj + H+

(4.5)

Microequilibrium constants 4.4 can be readily computed by Jaguar pKa. Importantly, eq 4.3 “converges” with the number of dissociation channels. It is easy to see from eq 4.3 that negligible dissociation channels, characterized either by very low concentrations of the protonated form or the deprotonated form, will contribute negligibly to Ka. So that for practical purposes we can effectively truncate the summations in eq 4.3 by introducing the “effective” number of deprotonated and protonated forms, Me and Ne, respectively. Obviously, failure to include highly populated forms from the sets {P} and {D}, or 6008

DOI: 10.1021/acs.jctc.6b00805 J. Chem. Theory Comput. 2016, 12, 6001−6019

Article

Journal of Chemical Theory and Computation N conf

nation channels and should be directly comparable to equilibrium constants derived from the experiment. The pKa from formula 2.3, computed using the thermodynamic cycle that involves a single deprotonated conformation and a single protonated conformation and further adjusted by the empirical parameters a, b, could serve as a definition of nano-pKa. However, as we have seen, the nano-pKa defined through formula 2.3 would have a serious drawback of not accounting for the entropy of the conformations, especially in treatments that consider conformations explicitly. The missing entropic prefactor eΔSconf/R can be actually included through the following ansatz: K ijnano =

λ Dj λ Pi

Here, Kfinal is the equilibrium constant extracted from pKfinal ij a , which is defined in eq 2.3. Next, λDj is the probability of finding the deprotonated conformation Dj in the conformational ensemble {D} comprising M deprotonated conformations. Similarly, λPi is the probability of finding the protonated conformation Pi in the ensemble {P} comprising N protonated conformations. The λ factors account for the different populations of the deprotonated and the protonated conformations associated with the nanoconstant. We will use eq 4.6 as a definition of the nanoequilibrium constant. The corresponding nano-pKa is obviously computed by taking the negative logarithm of Knano ij . One way to define the probabilities λDj and λPi is through the conformational energies:

M conf

K amicro

M

l

(4.8)

Here, k is the Boltzmann constant, T is the absolute temperature, and EDj and EPi are the energies of the conformations Dj and Pi, respectively, computed in either the gas or the solution phase, depending upon the particular computational protocol. An alternative way starts with concentrations: λ Dj =

λ Pi =

[Dj ] [D1] + [D2 ] + ... + [DM ] [P] i [P1] + [P2] + ... + [PN ]

(4.9)

(4.10)

The numerators and the denominators of formulas 4.9 and 4.10 can be multiplied by [H+]/[P1] and [D1]/[H+], respectively, and, after some algebraic rearrangement and equating pKa with pKfinal a , the expressions for λDj and λPi become directly connected with the empirical corrections from formula 2.3: M conf

final

λ Dj = 1/ ∑ 10 pK a1j l

=



1

N conf 1 j = 1 ∑i = 1 K ijnano

(4.13)

Here, Mconf and Nconf, as above, refer to the number of deprotonated and protonated conformations, and the summations go over conformations. Notice that owing to the λ factors the generalized formula 4.13 behaves correctly regardless of the number of conformations one takes. In the case of a single protonated and a single deprotonated conformation in each protonation channel, 4.13 is equivalent to the earlier formula 4.3. When there are several conformations considered for either protonated and/or deprotonated forms, the nanoequilibrium are properly weighed by the entropic λ factors. constants Knano ij This prevents the kind of collapse that formula 4.3 suffered from, when used earlier for conformations in lieu of tautomers. As a simple example of the correct behavior of the conformationally aware formula 4.13, consider Q (possibly symmetrically equivalent) conformations with exactly the same energy making up the deprotonated form (Mconf = Q), and a single protonated conformation (Nconf = 1). In such a case the earlier formula 4.3 for Ka, used for conformations, reduces to K11 + K12 + ... + K1Q, where K1j are all the same because of the isoenergeticity of the deprotonated conformations. This result is nonsensical because Ka depends on the number of equivalent conformations Q. The generalized and conformationally aware formula 4.13, however, does not have such problems, as Kmicro a reduces to Knano + Knano + ... + Knano 11 12 1Q . The λ-factor of each isonenergetic deprotonated conformation is λDj = 1/Q, irrespective of whether one uses formula 4.7 or formula 4.11, final = (1/Q)Kfinal are leading to Knano 1j 1j ; and, as all constants K1j nano nano nano equivalent, the sum K11 + K12 + ... + K1Q turns into Kfinal 11 ,

(4.7)

N

λ Pi = e−E Pi / kT /∑ e−E Pl / kT

(4.12)

The summation limits Mconf and Nconf are the number of deprotonated and protonated conformations, respectively. Formulas 4.11 and 4.12 are particularly convenient in practice as λDj and λPi thus defined are computed directly from pKa values output by the Jaguar pKa program, with the empirical corrections included. We tested both definitions of the λ factors in some of our calculations and concluded that the final pKa’s produced by both approaches are very similar. We chose eqs 4.11 and 4.12 as our working formulas, and therefore they were used in all the multiconformational pKa calculations reported in this work. Now that formulas 4.6, 4.11, and 4.12 can be used for computing the nanoequilibrium constant for a single protonation channel, we can move on to define the conformationally aware microequilibrium constant for a single deprotonation channel, which should be computed from nanoequilibrium constants. For this purpose one can adopt the methodology expressed in relations 4.1−4.3, this time viewing P1, P2, ..., PN not as protonated tautomers but as protonated conformations, and, analogously, viewing D1, D2, ..., DM not as deprotonated tautomers but as deprotonated conformations. This permits us to define the conformationally aware microdissociation constant:

(4.6)

l

− pK a ifinal 1

l

K ijfinal

λ Dj = e−E Dj / kT /∑ e−E Dl / kT

final

λ Pi = 1/ ∑ 10 pK al1

− pK a1final l

(4.11) 6009

DOI: 10.1021/acs.jctc.6b00805 J. Chem. Theory Comput. 2016, 12, 6001−6019

Article

Journal of Chemical Theory and Computation Table I. pKa Values Predicted by Jaguar pKa Using Different Computational Protocolsa

index

R1

1b 66 67 69 109 89 68 88 111 110 14d 70 1a 14a 112 113 MUE outliers

CN CN CN CN CN CN CN CN CN CN CN CN Cl Cl Cl Cl

R2−R5 R3 = F R2 = F R4 R5 R3 R5 R5 R4 R2 R2

= = = = = = = =

CF3 CF3 F CF3 CF3, R3 = F CF3, R3 = F F, R3 = F F, R3 = F

R2 = F, R3 = F R2 = F, R3 = F

R6

X

exp. pKa

no CS

CS

1 conf

10 conf

CH3 CH3 CH3 CHF2 CH3 CH3 CH2F CH2F CH3 CH3 CH3 CH2F CH3 CH3 CH3 CH3

O O O O O O O O O O O O O O S S

9.7 8.1 7.4 7.3 7.3 7.0 6.7 6.3 5.9 5.8 5.8 5.1 9.8 6.3 9.0 6.1

9.1 6.6 6.0 7.3 6.6 6.8 4.4 4.7 5.5 3.8 5.1 3.2 9.2 5.1 7.8 5.3 1.07 8

9.0 7.7 6.0 7.3 5.4 6.8 5.0 6.1 5.8 5.2 5.8 4.3 9.4 6.2 9.5 6.1 0.56 3

9.4 8.1 7.6 7.4 6.3 7.4 7.2 6.2 5.9 5.9 6.5 5.5 9.5 6.6 8.9 6.2 0.29 1

9.7 8.2 7.4 7.7 7.0 7.3 7.3 6.3 6.0 6.0 6.3 4.7 9.8 6.4 9.0 6.3 0.20 0

“no CS” stands for no conformational search performed, with the calculation being conducted on the input conformation. In “CS”, regular conformational search was performed on the input conformation. In “N conf”, an aggressive CS was performed, N (e.g., N = 1, 10) conformations were selected, and a geometry optimization in solution phase was conducted. The structure indices follow the numbering from ref 78. The predicted pKa is considered an outlier if its difference in value is equal to or larger than 1.0 pKa units, in comparison with the experimental measurement.

a

rendering Kmicro independent of the number of considered a isoenergetic conformations Q. In practical calculations one needs to make a decision about which sets of conformations {P} and {D} to use. The Jaguar pKa program lets the user choose the settings for the MacroModel component, controlling the maximum number of conformations and the energy window from which these conformations will be selected. For the selection of more relevant conformations we also relaxed a number of constraints originally used in the MacroModel conformational search for reasons of efficiency. This increased the overall computational time slightly but resulted in higher accuracy of the final pKa predictions, as the results in the following section indicate. Once the conformations are selected by MacroModel they are passed to Jaguar pKa for further refinement in DFT geometry optimizations. At this point the user has a choice of whether to optimize the initial structures coming from MacroModel in the gas phase, which is computationally less expensive, or in the solution phase. The following section compares the quality of results obtained with both of these types of optimization. Finally, it is time to consider multiple protonation channels and the corresponding formulas for the conformationally aware macro-pKa. This task is trivial, because the formula for the macroequilibrium constant has already been presented in eq 4.3, which now should use conformationally aware microconstants Kijmicro for each protonation channel involving dissociation of the protonated tautomer P i into the deprotonated tautomer Dj:

M taut

K amacro

=



1

N taut 1 j = 1 ∑i = 1 K ijmicro

(4.14)

taut

Here, M is the number of deprotonated tautomers, and Ntaut is the number of protonated tautomers and the summations go over tautomers. Formula 4.14 in conjunction with the conformationally aware microdissociation constant 4.13 is a generalization of the well-known formula 4.3. To summarize, this section defines formulas for computing the conformationally aware macro-pKa from conformationally aware micro-pKa’s (formula 4.14). The conformationally aware micro-pKa is defined through nano-pKa’s (formula 4.13), and the nano-pKa’s, in their turn, are defined through relation 4.6 and are computed from a thermodynamic cycle that involves a pair of individual conformations at a time, with further corrections compensating for the neglect of explicit solvation effects and entropic contributions to free energies.

V. PARALLELIZATION Jaguar pKa calculations, especially those involving larger molecules and several conformations, can be computationally expensive. Multiple individual DFT calculations, including geometry optimizations, are performed in every pKa prediction. Geometry optimizations in the solution phase are particularly time-consuming. Therefore, it is important that one should be able to perform a Jaguar pKa calculation on multiple central processing units (CPU). Many individual DFT calculations in our computational protocols are completely independent (especially when acting on multiple conformations), which helps achieve good distributed parallelization. Conformational 6010

DOI: 10.1021/acs.jctc.6b00805 J. Chem. Theory Comput. 2016, 12, 6001−6019

Article

Journal of Chemical Theory and Computation

Figure 5. Differences in pKa values obtained using two distinct conformations A and B as input for cyclic amidines from ref 78. In plot (a) the x-axis refers to the structures listed in Table I, and the y-axis shows the pKa differences for different protocols. The ideal result for every structure would be zero difference in pKa prediction, and the corresponding point on the plot would lie on the horizontal orange line. All protocols involving conformational search included solution phase optimizations. The “no conformational search” protocol (red line) used optimizations in the gas phase. The conformations A and B for all the structures are schematically shown in diagram (b) and have RMSD around 0.90 Å and larger.

preliminary selection of conformations by MacroModel, and the type of the subsequent geometry optimizations performed on these conformations with the DFT method. Most calculations consider a single protonation channel. Exceptions are symmetrically equivalent protonation channels, which are accounted for automatically, and cases where we combined microequilibrium constants for two protonation channels into a macroconstant using eq 4.14. VI.A. Cyclic Amidines. One class of drug-like molecules the pKa’s of which have received close attention in recent publications is cyclic amidines.78−81 These are typically βsecretase (BACE) inhibitors, and their pKa’s lie in a broad range of biologically important values. In Table I we present detailed Jaguar pKa results on a series of 16 cyclic amidines from ref 78. It is evident from this table that the results systematically and dramatically improve upon transition from the simplest protocol, in which no CS was performed, to the sophisticated protocol which involved an aggressive CS, up to 10 conformations per protonation form (Mconf, Nconf = 10), and geometry optimization in implicit solvent with the PBF method. Both the mean unsigned errors (MUE’s) and the number of outliers go down. For stereoisomers 66 and 67 marked by a substantial difference in their experimental pKa values equal to 0.7 units, Jaguar pKa was able to correctly reproduce the individual pKa, as well as the experimental difference. Another pair of stereoisomers, 89 and 109, has a small difference in experimental pKa’s equal to 0.3 units, which approaches the limits of the theoretical pKa prediction. We therefore do not consider it a problem that Jaguar pKa was not able to distinguish this pair of stereoisomers correctly (predicting an inverse difference of 0.3 units with the most accurate protocol). Overall, the extremely accurate result provided for this test set of amidines by the most sophisticated model likely borders on the level of the noise and capabilities of DFT to describe the dissociation equilibria. Besides the diminishing errors, another criterion that could show improvement in the protocols is a lower sensitivity of the results to the input conformation. Ideally, pKa prediction should not depend on the initial conformation specified by the user, and CS should find the lowest energy conformations starting from any input conformation. In practice, however, the

search, which precedes the DFT calculations, can also be performed in parallel using the standard MacroModel options. Recently, we have been able to complete distributed Jaguar pKa calculations maximizing the number of individual DFT calculations, with up to 100 CPUs utilized in one pKa prediction. As an example of timings relevant to practical calculations consider Jaguar pKa calculations performed on the drug atenolol having molecular formula C14H22N2O3. A protocol involving 10 conformations, gas phase geometry optimizations, and an aggressive conformational search applied to this molecule took approximately 1 day, 5 h, and 2 min to complete on 1 CPU from the AMD Opteron Processor 6274 chip. A calculation with exactly the same settings, except carried out on 10 CPUs, took approximately 3 h 45 min to complete, showing a 7.7 times speedup. Finally, an analogous calculation but with 20 CPUs took approximately 2 h 17 min to complete, resulting in a 12.6 times speedup. Our Jaguar DFT benchmarks on geometry optimizations and vibrational frequency calculations performed with the above-mentioned AMD processors and with Intel Xeon E5-160 processors show that the latter is up to a factor of 4.0 faster than the former, although the scaling with the number of CPUs remains approximately the same. We have not measured the performance of the Xeon processors in application to our pKa prediction protocols, but we expect the improvements of the timings by at least a factor of 2 to 3, for the equivalent number of CPUs, in comparison with the abovementioned timings on the AMD chips. All timings given in this section refer to wall clock time.

VI. ILLUSTRATIVE RESULTS A truly universal pKa predictor should achieve a good accuracy on diverse sets of functional groups and molecule types. While we acknowledge that for some special types of structures the results currently produced by Jaguar pKa are less accurate than desirable (see the discussion in Section VII), we believe that the accuracy achieved across the chemical space representing the larger and conformationally flexible drug-like molecules is very good. This section shows Jaguar pKa predictions on several different functional groups in drug-like molecules as a function of the number of conformations taken, the thoroughness of the 6011

DOI: 10.1021/acs.jctc.6b00805 J. Chem. Theory Comput. 2016, 12, 6001−6019

Article

Journal of Chemical Theory and Computation

Table II. Standard Deviations Computed for the Set of 15 pKa Values Obtained in 15 Separate Jaguar pKa Calculations That Used Different and Distinct Input Conformationsa

gas phase optimization

solution phase optimization

structure

no CS

reg CS

1 conf

5 conf

10 conf

no CS

1 conf

5 conf

10 conf

(2S,3R)-7a (2R,3R)-7a 3R-6a outliers

2.092 0.692 1.770 9

2.011 0.853 0.343 8

0.423 0.052 0.024 0

0.091 0.048 0.032 0

0.099 0.041 0.034 0

0.972 0.609 0.519 8

0.171 0.563 0.159 1

0.049 0.054 0.051 0

0.056 0.067 0.047 0

a The RMSD range was 0.92 to 4.37 Å for all relevant pairs. The ideal result would be zero deviation. Results for three molecules from ref 80 using different Jaguar pKa protocols are presented. The total number of outliers with each computational protocol is determined for the set of 45 pKa predictions. An outlier is defined as a pKa prediction deviating from the experimental value by 1.0 pKa units or more.

The greatest variability in the pKa predictions is observed with the least sophisticated Jaguar pKa protocol, that is, the protocol avoiding the CS, with the subsequent geometry optimization in the gas phase. In this protocol, standard deviations as large as 2.092 were recorded for the (2S,3R)-7a molecule, which is the result of the scatter from pKa 0.8 to 8.1, depending on the initial conformation, while the experimental pKa is 7.6. The next least sophisticated protocol, the one that avoids the CS but performs geometry optimization with the implicit solvation model, shows a marked reduction in the standard deviations for all three molecules. A further reduction of the deviation is seen for the protocols involving the aggressive CS. These results are in qualitative agreement with the results shown in Figure 5 and in Table I for related cyclic amidines. Even though 1 conformation was sufficient to achieve what appears to be convergent results in Figure 5 and in Table I, the same protocol with 1 conformation is not satisfactory in the more detailed investigation reported in Table II. Take for instance the molecule (2R,3R)-7a, for which the largest standard deviation was recorded in a 1-conformation protocol (0.563). For this molecule, one outlier with pKa = 10.0 was detected (the experimental value is 7.8), even though the remaining 14 predictions in the set lie close to one another and to the experimental value. Another example is the set of 15 pKa predictions for the molecule (2S,3R)-7a, that used a 1-conformation protocol involving a gas-phase optimization. Its significantly large standard deviation of 0.423 follows from the scatter of pKa values between 6.9 and 8.3 (while the experimental value is 7.6) that is common in this set. These large variations disappear when one uses 5 conformations per protonation species, regardless of the subsequent geometry optimization protocol, and give standard deviations that are modest and acceptable. For the largest 5-conformation protocol standard deviation of 0.091 [(2S,3R)-7a, gas phase optimization], the individual pKa values ranged from 7.2 to 7.6, depending on the initial conformation. The stability achieved for 5 conformations

result might be sensitive to the CS parameters used and in some cases fail to arrive at a virtually identical set of conformations. Figure 5 shows that indeed, as we go from the least sophisticated protocol (no CS) to the most sophisticated protocol (aggressive CS and 10 conformations per protonation form), the pKa prediction results become less dependent on the initial structure. The aggressive CS (part of the protocols for the lower two plots in Figure 5) leads to dramatic improvement in the quality of results. From this figure it appears as though the number of conformations taken does not have a pronounced effect on the sensitivity to the initial structure, provided that aggressive CS is performed. This is true for this current test set and is probably attributed to the fact that in this case both the protocols that involve the aggressive CS are already extremely accurate and approach the level of the numerical noise that would be hard (and unnecessary) to transcend. However, the results obtained on other test sets (vide infra) clearly indicate that the greater number of conformations in combination with aggressive CS leads to an increased stability of the results vis-à-vis the choice of the input conformation. A more detailed investigation of the importance of the conformational treatment was undertaken for three cyclic amidines from the work of Rombouts et al.80 Three structures, representative of the BACE inhibitors discussed in that work, were essentially randomly picked for assessment of our pKa prediction protocols. For each of the three structures, Table II shows standard deviations computed for a set of 15 different pKa predictions that used different and distinct conformations as input (RMSDs computed for all pairs of the relevant conformations ranged from 0.92 to 4.37 Å). In an ideal protocol, pKa predictions would not depend on the initial conformation. Then all 15 calculations here would produce exactly the same pKa, and the corresponding standard deviation would be zero. In reality, not all the pKa values are the same. 6012

DOI: 10.1021/acs.jctc.6b00805 J. Chem. Theory Comput. 2016, 12, 6001−6019

Article

Journal of Chemical Theory and Computation

Table III. pKa Values Predicted with Various Jaguar pKa Computational Protocols on a Test Set of Tertiary Amines Extracted from Ref 82a

structure index 76 78 46 51 79 80 81 82 83 84 85 86 87 88 89 90 91 94 95 96 97 98 99 100 101 103 104 MUE outliers

gas phase optimization

solution phase optimization

A

B

n

R

no CS

reg CS

1 conf

5 conf

10 conf

no CS

1 conf

5 conf

10 conf

exp. pKa

A9 A8 A9 A9 A8 A8 A8 A5 A5 A5 A6 A6 A11 A11 A10 A10 A12 A13 A2 A2 A7 A14 A1 A1 A3 A4 A4

B2 B7 B7 B6 B5 B4 B3 B2 B5 B5 B2 B1 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2

0 0 1 1 0 0 0 1 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0

OCH3 OCH3 OCH3 OCH3 OCH3 OCH3 H OCH3 OCH3 H OCH3 OCH3 H OCH3 H OCH3 OCH3 OCH3 OCH3 H OCH3 OCH3 OCH3 H OCH3 OCH3 H

9.43 7.94 8.23 9.02 7.81 8.49 8.68 6.60 7.34 7.80 7.96 8.50 8.33 8.76 8.72 8.61 8.24 8.20 8.15 6.54 8.56 7.78 8.51 8.36 7.98 7.83 7.67 0.69 6

8.13 8.23 8.68 9.30 9.04 8.34 8.62 7.96 7.03 7.22 7.91 8.04 8.12 9.05 8.33 9.17 9.41 8.42 7.69 8.22 8.21 7.07 8.32 8.57 7.88 8.00 7.80 0.56 5

8.98 7.99 8.69 9.17 8.44 8.44 8.64 7.99 7.49 7.59 7.75 7.81 8.60 8.82 8.85 8.38 9.45 8.53 3.87 6.66 7.56 6.98 9.10 8.64 9.58 8.13 7.86 0.69 4

9.00 7.85 8.58 9.01 8.23 8.49 8.72 8.29 7.50 7.86 8.39 7.88 8.48 8.29 8.65 8.32 9.30 7.95 7.00 7.36 7.75 6.94 8.49 8.39 8.55 7.93 7.89 0.56 3

9.11 7.90 8.69 9.13 8.31 9.43 8.66 8.18 7.79 7.78 8.25 7.89 8.61 8.83 8.62 8.32 9.27 8.00 7.11 7.71 6.47 6.93 8.53 8.38 8.35 7.96 7.90 0.51 2

9.38 6.19 8.72 9.65 8.10 9.17 5.20 8.34 6.85 6.95 7.60 8.05 9.42 8.49 8.77 8.17 8.11 7.50 8.44 7.68 8.21 6.92 7.68 6.69 7.74 7.29 6.98 0.93 8

9.08 7.40 8.69 9.13 8.44 8.44 8.64 7.99 7.49 7.11 7.37 8.63 8.60 8.82 8.85 7.87 9.45 8.53 3.87 7.02 7.56 5.18 9.10 8.27 9.58 8.56 7.86 0.76 7

8.73 7.48 8.15 9.43 7.81 8.51 8.49 7.62 7.06 7.42 7.58 8.21 8.21 7.67 7.95 8.68 8.60 7.93 7.83 7.70 7.58 6.17 8.05 7.84 7.76 7.79 7.75 0.73 6

8.70 7.50 8.47 9.33 7.89 8.49 8.64 7.68 7.41 7.48 7.58 8.20 7.67 7.95 8.68 8.05 9.60 7.93 7.83 7.70 7.58 6.17 8.05 7.84 7.36 7.75 7.73 0.70 6

9.9 8.5 9.2 9.1 8.6 9.3 9.3 8.5 8.4 8.4 6.5 7.5 8.9 8.9 9.0 9.0 9.0 8.5 8.0 8.5 8.1 6.0 8.6 8.6 8.4 8.2 8.0

a Mean unsigned errors (MUE) and the number of outliers are also reported for each protocol. An outlier is defined as a pKa prediction deviating from the experimental value by 1.0 pKa units or more.

test set of large and flexible tertiary amines from ref 82. Two micro-pKa calculations were conducted for the protonation of each of the nitrogen atoms of the piperidine ring (compounds with A5 substituent) and then manually combined using formula 4.14 to produce the final macro-pKa’s. The present test set of tertiary amines possesses MUE’s that are somewhat larger than those seen for cyclic amidines. As in the previous tests, the quality of the results (as inferred from the value of the MUE and the number of outliers) improves when performing a more aggressive conformational search and when utilizing a greater number of conformations. However, geometry optimizations in the solution phase produced larger

cannot be said to noticeably improve upon further extension to 10 conformations. This is especially evident in the set characterized by the lowest standard deviation of 0.032 (3R6a, 5 conformations, gas phase optimization), where a remarkable stability was achieved with the variation of the predicted pKa values for 15 individual calculations using different input conformations lying between 9.5 and 9.6 (with the experimental pKa being 9.6). VI.B. Tertiary Amines. The tertiary amino group is a ubiquitous structural motif in drug-like compounds, partly owing to the variability of its micro-pKa near the blood brain barrier. In Table III we present Jaguar pKa predictions using a 6013

DOI: 10.1021/acs.jctc.6b00805 J. Chem. Theory Comput. 2016, 12, 6001−6019

Article

Journal of Chemical Theory and Computation Table IV. pKa Values Predicted with Various Jaguar pKa Computational Protocols on a Test Set of Proton Spongese

gas phase optimization

solution phase optimization

index

no CS

reg CS

1 conf

5 conf

10 conf

no CS

1 conf

5 conf

10 conf

Epik

Marvin

exp. pKa

S1 S2 S3 S4 S5 S6 MUE outliers

2.01 11.19 7.41 8.95 8.44 10.03 1.44 3

4.11 11.28 4.42 9.83 8.43 10.05 0.50 0

4.28 11.31 4.18 9.80 8.46 10.08 0.52 0

4.28 11.28 4.06 9.83 8.54 10.07 0.55 1

4.28 11.28 4.06 9.83 8.54 10.07 0.55 1

1.90 11.22 7.64 8.82 8.36 4.28 2.44 4

4.86 11.19 4.39 9.43 8.41 10.02 0.53 0

4.86 11.25 4.32 9.45 8.50 9.88 0.55 1

4.86 11.25 4.32 9.46 8.49 9.88 0.55 1

4.46 5.54 5.84 6.02 1.86 4.60 3.86 5

4.25 5.01 4.37 5.44 1.49 5.16 3.61 4

4.61a 12.10b 4.62c 10.27c 7.49d 9.97c

a

Reference 96. bReference 97. cReference 98. dReference 99. eMean unsigned errors (MUE) and the number of outliers are also reported for each protocol. An outlier is defined as a pKa prediction deviating from the experimental value by 1.0 pKa units or more. All experimental pKa data refer to measurements in water at room temperature.

conformational search are similar for all the protocols and show MUE’s of around 0.5 pKa units. As these compounds are expected to possess very few conformations, it is not surprising that even the regular conformational search predicts pKa very well and that an increase in the number of conformations does not lead to further improvement. This good performance aside, what is a truly remarkable achievement of the DFT-based method is that it reproduces the unusual experimental pKa trends perfectly. There is an enormous jump in pKa (7.49 units) when going from compound S1 to a similar but highly strained derivative S2. Jaguar pKa predicts this transition with almost quantitative accuracy (7.00 units using the 10-conformation gas phase optimization protocol and 6.39 units using the 10-conformation solution phase optimization protocol). In contrast, the empirical pKa prediction program Epik89−91 fails to describe the very large experimental difference of 7.49 pKa units, giving instead a difference amounting to only 1.08 pKa units. This modest increase of pKa is presumably due to the simulated electron-donating effect of the methyl groups. If we assume that the electronic effect is described by Epik correctly, the rest of the experimental pKa difference is caused by the steric strain effects, which Epik, and likely all other empirical pKa predictors, cannot account for unless they are specifically trained, perhaps on a case by case basis, to do so. For example, the empirical pKa predictor Marvin92 shows results very similar to Epik’s, failing to recognize the influence of the steric repulsion. It is important to emphasize that Epik in itself is a highly accurate pKa predictor, as has been demonstrated in its application to diverse drug-like compounds.39 Therefore, Epik’s performance on proton sponges is atypical and only serves to highlight the advantages that DFT-based programs may have when predicting pKa of compounds displaying unusual sterical strain effects. An earlier version of Marvin, applied to typical drug-like

MUE’s than those obtained with gas phase optimizations. The number of outliers for geometry optimizations in the solution phase was also quite large and did not go down with an increase in the number of conformations taken, in contrast to geometry optimizations in the gas phase, where the trend was as expected. In all other tests of the conformational protocols that we performed (those shown and not shown in this work), including those on tertiary amines, geometry optimizations in solution produced results that were generally better or at least no worse than gas phase optimizations. Therefore, we conclude that the somewhat disappointing performance of solution phase optimizations in Table III is an uncommon case. A very interesting class of tertiary amines from the point of view of their pK a properties is the so-called proton sponges.83−85 These compounds display unusually high basicities as a consequence of a sterical strain leading to a repulsion of unpaired electrons on tertiary nitrogen atoms, which can be relieved by binding a proton. A proton sponge may possess a pKa of around 12, whereas a related sterically nonhindered amine may have a dramatically different pKa of around 5. DFT-based pKa predictors should excel for these types of compounds, in contrast to empirical pKa predictors which, failing to properly estimate or even recognize sterical strain, should see essentially no difference in pKa for sterically strained versus sterically nonstrained compounds. Proton sponges and their pKa’s have been studied with DFT-based methods before.86−88 Here, we have applied Jaguar pKa to a series of proton sponges, as well as to the related 1,8diaminonaphthalene and one of its derivatives that avoids the repulsion of the lone pairs of electrons. The results are presented in Table IV. The Jaguar pKa predictions performed without conformational search are inaccurate, again stresssing the importance of the necessity to perform conformational search for more stable results. The results obtained with 6014

DOI: 10.1021/acs.jctc.6b00805 J. Chem. Theory Comput. 2016, 12, 6001−6019

Article

Journal of Chemical Theory and Computation

Table V. pKa Values Predicted with Various Jaguar pKa Computational Protocols on a Test Set of Guanidine Structures Extracted from Ref 94a

structure index

R1

R2

gas phase optimization R3

R4

1a

Cl

H

H

H

1b

Cl

H

H

CH2CHF2

2j

H

H

H

CH3

2o

H

H

H

CH2CHF2

2p

H

H

H

CH2CF3

3

Cl

Cl

H

CH2CHF2

4

H

H

OCH3

CH2CHF2

6

OCH3

Cl

H

CH2CHF2

7

Cl

H

Cl

CH2CHF2

MUE

outliers

solution phase optimization

no CS

reg CS

1 conf

5 conf

10 conf

no CS

1 conf

5 conf

10 conf

8.62 9.36 9.59 7.02 8.13 8.52 10.47 10.78 10.84 8.42 9.21 9.46 7.36 8.39 8.75 6.17 7.48 7.95 9.53 10.06 10.20 6.68 7.87 8.30 5.00 6.58 7.17 1.60 0.69 0.38 7 3 0

8.70 9.43 9.65 6.93 8.06 8.46 10.58 10.87 10.91 7.98 8.87 9.17 6.65 7.85 8.27 6.23 7.53 7.99 9.07 9.71 9.90 7.19 8.27 8.64 5.26 6.78 7.35 1.68 0.76 0.44 8 2 0

8.70 9.42 9.65 6.95 8.08 8.48 10.75 11.00 11.02 7.98 8.87 9.17 6.69 7.88 8.30 6.21 7.51 7.98 9.10 9.73 9.92 6.56 7.78 8.22 5.29 6.81 7.37 1.75 0.82 0.49 8 2 0

8.70 9.42 9.65 6.85 8.03 8.45 10.49 10.80 10.85 8.04 8.88 9.16 7.64 8.58 8.89 6.09 7.45 7.94 8.99 9.67 9.88 7.41 8.38 8.70 6.00 7.33 7.81 1.50 0.62 0.32 8 1 0

8.70 9.42 9.65 7.43 8.45 8.80 10.49 10.80 10.85 8.61 9.34 9.56 7.65 8.57 8.88 6.80 7.93 8.33 9.23 9.87 10.06 6.65 7.84 8.27 6.11 7.41 7.89 1.34 0.50 0.23 7 1 0

8.77 9.48 9.70 6.01 7.36 7.85 10.50 10.81 10.86 7.73 8.68 9.00 7.75 8.69 9.01 5.00 6.58 7.17 8.23 9.06 9.34 7.46 8.47 8.82 3.26 5.25 6.01 2.11 1.08 0.72 8 5 3

8.80 9.50 9.72 8.01 8.89 9.19 10.61 10.89 10.93 9.37 9.94 10.10 7.97 8.86 9.16 7.35 8.39 8.74 9.11 9.74 9.92 7.81 8.47 9.05 5.72 7.14 7.65 1.00 0.30 0.23 6 0 0

8.80 9.50 9.72 7.54 8.52 8.86 10.28 10.63 10.70 8.75 9.47 9.69 7.93 8.84 9.14 6.93 8.04 8.44 9.09 9.75 9.95 7.34 8.40 8.76 6.10 7.39 7.86 1.22 0.36 0.10 7 0 0

8.80 9.50 9.72 7.69 8.68 9.01 10.28 10.63 10.70 8.86 9.58 9.80 7.84 8.76 9.08 7.09 8.20 8.59 9.15 9.82 10.03 7.68 8.60 8.91 6.13 7.45 7.93 1.13 0.28 0.11 7 0 0

exp. pKa 9.9

8.9

10.6

9.7

9.2

8.5

10.2

8.9

7.8

a

Mean unsigned errors (MUE) and the number of outliers are also reported for each protocol. An outlier is defined as a pKa prediction deviating from the experimental value by 1.0 pKa units or more. The predicted pKa values and their statistical summaries in plain font correspond to the “guanidine” shell of the shell model, while the figures in italics are associated with the “amidine” shell of the shell model, and the figures in bold font refer to the “partly substituted amidine” shell of the shell model. All experimental pKa data refer to measurements in water at room temperature.

structures, has also demonstrated average errors much lower than those shown in Table IV for proton sponges.93 Likewise, we conclude that Marvin’s performance on proton sponges does not reflect its typical accuracy on more common types of structures. Going back to the Table IV entries, there is also an extraordinary change in experimental pKa for two structurally very similar compounds, S2 and S3, which differ only by a CH2 group. The strain in S3, localized in the CH2CH2 group, causes the lone pairs of electrons on the nitrogen atoms to avoid repulsion, so there is not as much energy gain upon protonation as in S2. Again, Jaguar pKa describes the experimental pKa drop correctly qualitatively and quantitatively, while Epik and Marvin are insensitive to this effect. The

influence of the steric effects on pKa for the rest of the compounds from Table IV is similarly correctly reproduced by Jaguar pKa but not by Epik and Marvin. VI.C. Guanidines. The application of Jaguar pKa to a test set of guanidines from ref 94, apart from illustrating the use of Jaguar pKa computational protocols, elucidates a salient point about the use of the shell model in practice. Table V summarizes the results obtained for these guanidines. Predictably, structures with minimal conformational flexibility (1a and 2j in Table V) are either not sensitive or only marginally sensitive to the number of conformations used in the calculations. More flexible structures always display some 6015

DOI: 10.1021/acs.jctc.6b00805 J. Chem. Theory Comput. 2016, 12, 6001−6019

Article

Journal of Chemical Theory and Computation

Table VI. pKa Values Predicted with Various Jaguar pKa Computational Protocols on a Test Set of Guanidine Structures of the Type X−N = C(NHR) (NHR′), Where X Is Non-Aryle

gas phase optimization index G1 G2 G3 G4 MUE outliers

solution phase optimization

no CS

reg CS

1 conf

5 conf

10 conf

no CS

1 conf

5 conf

10 conf

exp. pKa

11.39 11.42 4.93 6.53 9.11 9.74 9.75 10.09 0.66 0.90 1 1

11.15 11.23 6.08 7.41 9.86 10.31 10.76 10.86 0.92 1.41 2 3

11.06 11.17 6.16 7.47 9.76 10.24 10.96 11.02 0.94 1.43 2 3

11.15 11.24 6.16 7.47 10.59 10.88 9.45 9.88 0.79 1.32 1 3

11.10 11.21 6.16 7.47 10.61 10.89 9.48 9.92 0.79 1.33 1 3

9.88 10.26 6.77 7.94 10.43 10.75 10.62 12.27 1.44 2.13 4 3

10.34 10.61 5.93 7.30 9.22 9.82 10.60 10.74 0.81 1.27 1 3

10.34 10.61 5.93 7.30 9.25 9.87 9.81 10.18 0.62 1.14 1 3

10.33 10.58 5.93 7.30 9.23 9.86 9.78 10.16 0.61 1.14 1 3

11.00a 5.75b 8.76c 8.68d

a

Reference 100. bReference 101. cReference 102. dReference 103. eMean unsigned errors (MUE) and the number of outliers are also reported for each protocol. An outlier is defined as a pKa prediction deviating from the experimental value by 1.0 pKa units or more. The predicted pKa values and their statistical summaries in plain font correspond to the “guanidine” shell of the shell model, while the figures in italics are associated with the “amidine” shell of the shell model. All experimental pKa data refer to measurements in water at room temperature.

Upon reviewing our training set of guanidines used for generating the a, b parameters of the “guanidine” shell, we observed that the training set was not particularly representative of the structures from Table V, beyond both sets being guanidines. In particular, the training set, comprising 8 molecules, included only one structure with the ArN = C(NHR) (NHR′) moiety, with the rest of the structures having the sp2 nitrogen atom connected to an aliphatic carbon atom or hydrogen atom. From chemical intuition (compare pKa’s of aliphatic and aromatic amines) we expect that the aromatic ring directly bonded to the sp2 nitrogen atom would have a strong influence on the protonation properties of that atom. Our training sets for amidines and partly substituted amidines were even less representative of the guanidines from Table V. So the fact that the amidine shells were producing more accurate results was likely a coincidence. The problem comes down to the scarcity of experimental results that can be used for training the shell model. However, as the shell model is trained using more experimental data, with greater diversity of structures, we expect to encounter fewer cases where a less specific shell yields more accurate pKa’s. It would be important to verify that there is nothing inherently wrong with the parametrization of our “guanidine” shell and that the disappointing result produced with this shell in Table V was due to a bad representation in the training set. For this purpose, we constructed a small test of “ordinary” guanidines and applied Jaguar pKa to it. The results are

variation in the predicted pKa values, although all the structures from Table V contain very few rotatable bonds. Another observation is that the quality of the results depends strongly on the association of the functional group with a particular shell. Five shells match the R″N = C(NHR) (NHR′) functional group: “guanidine”, “amidine”, and “partly substituted amidine”, “protonation of sp2-like aliphatic nitrogen”, and “protonation of generic atom”. The latter two shells are considered overly generic and are of little interest when more specific shells are matched. Therefore, we will only discuss the first three shells. The shell of “guanidine” produces the largest errors, the shell of “amidine” is more accurate, and the shell of “partly substituted amidine” is the most accurate, producing very low errors for multiconformational calculations that employ solution phase optimization. The problem is that the “guanidine” shell possesses the lowest RMSD for the linear fit of its parameters and, as such, is currently automatically preferred by the Jaguar pKa program. The shell that produces the best results (“partly substituted amidine”) actually comes third in the ranking of the shells, because its parameter fitting has an RMSD (0.746) larger than the RMSD of the two other shells, “guanidine” (with RMSD = 0.320) and “amidine” (with RMSD = 0.662). Usually the shell that has the smallest linear fit RMSD is more specific, is more relevant than the other shells to the structure under investigation, and produces the most accurate results. This is, however, not true for the test set from Table V. 6016

DOI: 10.1021/acs.jctc.6b00805 J. Chem. Theory Comput. 2016, 12, 6001−6019

Article

Journal of Chemical Theory and Computation

Failing to do so may have grave consequences for the accuracy of pKa prediction. As we have seen from the illustrative calculations, Jaguar pKa might happen to give very good pKa predictions even without a prior conformational search (Tables III, V, and VI). The results of such calculations are expected to depend heavily on the input conformations (see Figure 5) and are therefore not trustworthy. The use of multiple conformations consistently reduces either the number of outliers, or the mean unsigned error, or both, with the exception of rigid or relatively rigid molecules, where explicit consideration of a single conformation is enough for good results. This is the main achievement of this work, showing how the quality of DFTbased pKa predictions can be improved systematically in challenging cases of drug-like flexible molecules. It cannot be firmly concluded from the present data which Jaguar pKa optimization protocols, those involving gas phase or solution phase optimizations, lead to more accurate results. For guanidines, solution phase optimizations performed clearly better (see Tables V and VI) but not for tertiary amines (see Table III), where gas phase optimizations gave more accurate results. The other test sets (Tables I, II, and IV and Figure 5) were either inconclusive or lacked data for proper comparison. Errors due to an inadequate treatment of conformational space are of course only one type of error that the quantum chemical pKa prediction may suffer from. Reducing this particular error, for example by such means as proposed in this work, is expected to improve the quality of pKa predictions across all chemical space, with the unimportant exception of small molecules where the errors are already low. However, other types of errors not related to the problem of treatment of conformations exist and are not uncommon to show up in practical applications. Errors due to insufficient specificity of the parameters a, b may present themselves when working with exotic structures or even with structures which do not share the essential structural elements with the structures from the relevant training set (as was illustrated by the results in Table V). Regular updating of the shell model with the newest literature data is therefore important; but even if such updates become frequent and the parametrization covers explicitly an ever growing fraction of the chemical space, there is still a problem of creating shells and assigning structures to the training sets, actions which are currently based on human, subjective decisions. Developing an algorithm that would create training sets, set up shells, and decide on the corresponding SMARTS patterns in a more objective manner will be a major improvement. Another type of error appears when acting on zwitterions. The existing shell model does contain shells and parameters specific to common zwitterionic structures, such as NR+3 ...COO−, but the coverage of the zwitterionic chemical space is not extensive, at least in part because of the shortage of the experimental data that could be used for parametrization. Our tests show that the parameters from the generic shell applied to zwitterions not explicitly covered by the shell model yield errors substantially larger than those observed for nonzwitterionic structures. Both the parameters of the implicit solvation model in application to zwitterions and the corresponding parameters of the shell model in the pKa protocol might need to be revisited and improved. Errors arising due to multiple possible tautomers for the given structure are culprits behind a number of outliers: either the user fails to act on the dominant tautomer, or more than one tautomer coexist in solution and contribute to the

assembled in Table VI. Each structure from Table VI matched at least two shells - “guanidine” and “amidine”. This should be sufficient for the intended verification, as the “guanidine” shell was performing significantly worse than the “amidine”, when applied to ArN = C(NHR) (NHR′) compounds. The results obtained with these guanidines, which should be closer in structure to those from the guanidine training set, do confirm that the “guanidine” shell performs clearly better than the “amidine” shell. These results corresponding to the bestperforming shell generally agree with the previous conclusions. The molecules with low conformational flexibility (as G2 in this test set) have pKa’s that are almost independent of the number of conformations considered in the computational protocol, although taking more conformations does gradually lower the error. The protocol that forgoes conformational search and performs geometry optimizations in vacuum gives a surprisingly good result, although the analogous protocol but with optimization in solution performs much worse. The multiconformational protocols with the solution phase optimization give the best results.

VII. CONCLUSIONS AND OUTLOOK This work discusses a computational method based on DFT calculations, aimed at accurate pKa predictions applicable to large, flexible organic molecules. We argue that while empirical methods that employ structure−activity relationships possess undeniable advantages of speed and robustness, methods that depend on quantum chemical energies should capture more physics of complex proton dissociation processes and be more suitable for describing equilibria involving steric and conformational effects. We propose a computational workflow implemented in the program Jaguar pKa which, in comparison with an earlier workflow, treats conformations better on several levels: an initial selection of conformations, a subsequent quantum mechanical optimization of their structures, and a statistical averaging of these conformations contributing to the observed pKa value. Our accounting for conformational entropy and the theory behind its inclusion into micro-pKa’s appears to be novel. As justification of the proposed treatment, we present applications for the new conformationally aware approach to several sets of drug-like molecules. The results show clearly that a better treatment of conformations leads to more accurate results, with mean unsigned errors as low as 0.1−0.2 pKa units in some cases. Other pKa prediction protocols based on quantum chemical methods were able to show a comparable accuracy only for small molecules spanning an almost trivial conformational space14,15,17,37,95 or in application to a narrow chemical space. 37 Errors obtained for more extended, conformationally flexible structures are typically much larger.7,23 Therefore, we believe that the very high accuracy obtained in the present work for large, drug-like molecules with complex conformational landscapes is first of its kind and represents significant progress in the field of quantum chemical pKa prediction. Not only can very low errors be reached, but we also demonstrate for the first time how the errors grow smaller with a better conformational search conducted prior to quantum mechanical calculations and with a greater number of conformations used explicitly in quantum mechanical calculations. We recommend conducting conformational searches prior to performing a Jaguar pKa calculation or using a built-in MacroModel conformational search protocol, even though this procedure requires additional computational resources. 6017

DOI: 10.1021/acs.jctc.6b00805 J. Chem. Theory Comput. 2016, 12, 6001−6019

Article

Journal of Chemical Theory and Computation

(14) Liptak, M. D.; Shields, G. C. J. Am. Chem. Soc. 2001, 123, 7314− 7319. (15) Liptak, M. D.; Gross, K. C.; Seybold, P. G.; Feldgus, S.; Shields, G. C. J. Am. Chem. Soc. 2002, 124, 6421−6427. (16) Klicić, J. J.; Friesner, R. A.; Liu, S.-Y.; Guida, W. C. J. Phys. Chem. A 2002, 106, 1327−1335. (17) Casasnovas, R.; Fernández, D.; Ortega-Castro, J.; Frau, J.; Donoso, J.; Muñoz, F. Theor. Chem. Acc. 2011, 130, 1−13. (18) Jerome, S. V.; Hughes, T. F.; Friesner, R. A. J. Phys. Chem. B 2014, 118, 8008−8016. (19) Lee, A. C.; Yu, J.-y.; Crippen, G. M. J. Chem. Inf. Model. 2008, 48, 2042−2053. (20) Fraczkiewicz, R.; Lobell, M.; Göller, A. H.; Krenz, U.; Schoenneis, R.; Clark, R. D.; Hillisch, A. J. Chem. Inf. Model. 2015, 55, 389−397. (21) Rebollar-Zepeda, A. M.; Galano, A. Int. J. Quantum Chem. 2012, 112, 3449−3460. (22) Saracino, G. A. A.; Improta, R.; Barone, V. Chem. Phys. Lett. 2003, 373, 411−415. (23) Ho, J.; Coote, M. L.; Franco-Pérez, M.; Gómez-Balderas, R. J. Phys. Chem. A 2010, 114, 11992−12003. (24) Harding, A. P.; Popelier, P. L. A. Phys. Chem. Chem. Phys. 2011, 13, 11264−11282. (25) Harding, A. P.; Popelier, P. L. A. Phys. Chem. Chem. Phys. 2011, 13, 11283−11293. (26) Griffiths, M. Z.; Alkorta, I.; Popelier, P. L. A. Mol. Inf. 2013, 32, 363−376. (27) Ugur, I.; Marion, A.; Parant, S.; Jensen, J. H.; Monard, G. J. Chem. Inf. Model. 2014, 54, 2200−2213. (28) Yu, H.; Wondrousch, D.; Yuan, Q.; Lin, H.; Chen, J.; Hong, H.; Schüürmann, G. Chemosphere 2015, 138, 829−836. (29) Sastre, S.; Casasnovas, R.; Muñoz, F.; Frau, J. Phys. Chem. Chem. Phys. 2016, 18, 11202−11212. (30) Crugeiras, J.; Ríos, A.; Maskill, H. J. Phys. Chem. A 2011, 115, 12357−12363. (31) Feng, S.; Bagia, C.; Mpourmpakis, G. J. Phys. Chem. A 2013, 117, 5211−5219. (32) Alexov, E.; Mehler, E. L.; Baker, N.; Baptista, A. M.; Huang, Y.; Milletti, F.; Nielsen, J. E.; Farrell, D.; Carstensen, T.; Olsson, M. H. M.; Shen, J. K.; Warwicker, J.; Williams, S.; Word, J. M. Proteins: Struct., Funct., Genet. 2011, 79, 3260−3275. (33) Andersson, M. P.; Jensen, J. H.; Stipp, S. L. S. PeerJ 2013, 1, e198. (34) Stanton, C. L.; Houk, K. N. J. Chem. Theory Comput. 2008, 4, 951−966. (35) Meyer, T.; Knapp, E.-W. J. Chem. Theory Comput. 2015, 11, 2827−2840. (36) Koch, W.; Holthausen, M. C. A Chemist’s Guide to Density Functional Theory; Wiley-VCH Verlag: Weinheim, Germany, 2001. (37) Bryantsev, V. S.; Diallo, M. S.; Goddard, W. A., III J. Phys. Chem. A 2007, 111, 4422−4430. (38) Morgenthaler, M.; Schweizer, E.; Hoffmann-Röder, A.; Benini, F.; Martin, R. E.; Jaeschke, G.; Wagner, B.; Fischer, H.; Bendels, S.; Zimmerli, D.; Schneider, J.; Diederich, F.; Kansy, M.; Müller, K. ChemMedChem 2007, 2, 1100−1115. (39) Settimo, L.; Bellman, K.; Knegtel, R. M. Pharm. Res. 2014, 31, 1082−1095. (40) Bochevarov, A. D.; Harder, E.; Hughes, T. F.; Greenwood, J. R.; Braden, D. A.; Philipp, D. M.; Rinaldo, D.; Halls, M. D.; Zhang, Z.; Friesner, R. A. Int. J. Quantum Chem. 2013, 113, 2110−2142. (41) Schrödinger Release 2016-2: Jaguar, version 9.2; Schrödinger, LLC: New York, NY, 2016. (42) Lim, C.; Bashford, D.; Karplus, M. J. Phys. Chem. 1991, 95, 5610−5620. (43) Jang, Y. H.; Sowers, L. C.; Ç aǧin, T.; Goddard, W. A., III J. Phys. Chem. A 2001, 105, 274−280. (44) Tissandier, M. D.; Cowen, K. A.; Feng, W. Y.; Gundlach, E.; Cohen, M. H.; Earhart, A. D.; Coe, J. V.; Tuttle, T. R., Jr. J. Phys. Chem. A 1998, 102, 7787−7794.

observable pKa with the user operating Jaguar pKa maybe aware of only one conformer. Automatically generating, scoring, and acting on different tautomeric forms in one pKa prediction workflow will be another major improvement. Finally, there is a similar problem of multiple protonation channels. In a molecule containing several protonatable or deprotonatable functional groups, it is not always clear a priori, even for the experts, which of those groups is or are responsible for the observable pKa value. Significant mispredictions may result when the user applies Jaguar pKa to an irrelevant protonation channel or fails to account for all relevant protonation channels. The number and scope of the remaining problems associated with quantum chemical pKa prediction suggest that much more scientific and engineering work is needed before very accurate and robust pKa prediction covering all practically important aspects of protonation processes becomes available. We would like to believe that the present work makes an important step in the direction of that necessary progress.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jctc.6b00805. Hierarchical organization of the current shell model used for classification of the functional groups as well as average errors of parametrization produced by different parametrization models (TXT)



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Art D. Bochevarov: 0000-0002-0376-8432 Notes

The authors declare no competing financial interest.

■ ■

ACKNOWLEDGMENTS The authors would like to thank ChemAxon, Ltd. for the permission to publish Marvin pKa results. REFERENCES

(1) Alongi, K. S.; Shields, G. C. Annu. Rep. Comput. Chem. 2010, 6, 113−138. (2) Ho, J.; Coote, M. L. Theor. Chem. Acc. 2010, 125, 3−21. (3) Eckert, F.; Diedenhofen, M.; Klamt, A. Mol. Phys. 2010, 108, 229−241. (4) Liao, C.; Nicklaus, M. C. J. Chem. Inf. Model. 2009, 49, 2801− 2812. (5) Ho, J. Aust. J. Chem. 2014, 67, 1441−1460. (6) Settimo, L.; Bellman, K.; Knegtel, R. M. A. Pharm. Res. 2014, 31, 1082−1095. (7) Kaupmees, K.; Trummal, A.; Leito, I. Croat. Chem. Acta 2014, 87, 385−395. (8) Rajapakse, H. A.; et al. Bioorg. Med. Chem. Lett. 2010, 20, 1885− 1889. (9) Lanevskij, K.; Japertas, P.; Didziapetris, R.; Petrauskas, A. J. Pharm. Sci. 2009, 98, 122−134. (10) Sprous, D. G.; Palmer, R. K.; Swanson, J. T.; Lawless, M. Curr. Top. Med. Chem. 2010, 10, 619−637. (11) Cheng, J.; Sprik, M. J. Chem. Theory Comput. 2010, 6, 880−889. (12) Bryantsev, V. S. Chem. Phys. Lett. 2013, 558, 42−47. (13) Gallus, D. R.; Wagner, R.; Wiemers-Meyer, S.; Winter, M.; Cekic-Laskovic, I. Electrochim. Acta 2015, 184, 410−416. 6018

DOI: 10.1021/acs.jctc.6b00805 J. Chem. Theory Comput. 2016, 12, 6001−6019

Article

Journal of Chemical Theory and Computation (45) Kelly, C. P.; Cramer, C. J.; Truhlar, D. G. J. Phys. Chem. B 2006, 110, 16066−16081. (46) Cortis, C. M.; Friesner, R. A. J. Comput. Chem. 1997, 18, 1570− 1590. (47) Cortis, C. M.; Friesner, R. A. J. Comput. Chem. 1997, 18, 1591− 1608. (48) Marten, B.; Kim, K.; Cortis, C.; Friesner, R. A.; Murphy, R. B.; Ringnalda, M. N.; Sitkoff, D.; Honig, B. J. Phys. Chem. 1996, 100, 11775−11788. (49) Zhao, Y.; Truhlar, D. G. Theor. Chem. Acc. 2008, 120, 215−241. (50) Silva, P. J.; Ramos, M. J. Comput. Theor. Chem. 2011, 966, 120− 126. (51) Valdes, H.; Pluhácǩ ová, K.; Pitonák, M.; Ř ezác,̌ J.; Hobza, P. Phys. Chem. Chem. Phys. 2008, 10, 2747−2757. (52) Zhang, I. Y.; Wu, J.; Luo, Y.; Xu, X. J. Comput. Chem. 2011, 32, 1824−1838. (53) Johnson, E. R.; Clarkin, O. J.; DiLabio, G. A. J. Phys. Chem. A 2003, 107, 9953−9963. (54) Chandra, A. K.; Uchimaru, T. Int. J. Mol. Sci. 2002, 3, 407−422. (55) Marenich, A. V.; Cramer, C. J.; Truhlar, D. G. J. Chem. Theory Comput. 2013, 9, 609−620. (56) Kelly, C. P.; Cramer, C. J.; Truhlar, D. G. J. Phys. Chem. A 2006, 110, 2493−2499. (57) Marenich, A. V.; Ding, W.; Cramer, C. J.; Truhlar, D. G. J. Phys. Chem. Lett. 2012, 3, 1437−1442. (58) Casasnovas, R.; Ortega-Castro, J.; Donoso, J.; Frau, J.; Muñoz, F. Phys. Chem. Chem. Phys. 2013, 15, 16303−16313. (59) Kromann, J. C.; Larsen, F.; Moustafa, H.; Jensen, J. H. PeerJ 2016, 4, e2335. (60) Churakov, S. V.; Labbez, C.; Pegado, L.; Sulpizi, M. J. Phys. Chem. C 2014, 118, 11752−11762. (61) Uddin, N.; Choi, T. H.; Choi, C. H. J. Phys. Chem. B 2013, 117, 6269−6275. (62) Miguel, E. L. M.; Silva, P. L.; Pliego, J. R. J. Phys. Chem. B 2014, 118, 5730−5739. (63) Jackson, V. E.; Felmy, A. R.; Dixon, D. A. J. Phys. Chem. A 2015, 119, 2926−2939. (64) Marín-Luna, M.; Alkorta, I.; Elguero, J. New J. Chem. 2015, 39, 2861−2871. (65) Xue, X.-S.; Wang, Y.; Yang, C.; Ji, P.; Cheng, J.-P. J. Org. Chem. 2015, 80, 8997−9006. (66) Klamt, A.; Eckert, F.; Diedenhofen, M.; Beck, M. E. J. Phys. Chem. A 2003, 107, 9380−9386. (67) Brown, T. N.; Mora-Diez, N. J. Phys. Chem. B 2006, 110, 9270− 9279. (68) Farràs, P.; Teixidor, F.; Branchadell, V. Inorg. Chem. 2006, 45, 7947−7954. (69) Casasnovas, R.; Frau, J.; Ortega-Castro, J.; Salvà, A.; Donoso, J.; Muñoz, F. Int. J. Quantum Chem. 2010, 110, 323−330. (70) Rossini, E.; Knapp, E.-W. J. Comput. Chem. 2016, 37, 1082− 1091. (71) James, C. A.; Weininger, D.; Delany, J. Daylight Theory Manual, Version 4.9; Daylight Chemical Information Systems, Inc.: Laguna Niguel, CA, 2011. (72) Schrö dinger Release 2016-1: MacroModel, version 11.2; Schrödinger, LLC: New York, NY, 2016. (73) Jorgensen, W. L.; Tirado-Rives, J. J. Am. Chem. Soc. 1988, 110, 1657−1666. (74) Jorgensen, W. L.; Maxwell, D. S.; Tirado-Rives, J. J. Am. Chem. Soc. 1996, 118, 11225−11236. (75) Shivakumar, D.; Williams, J.; Wu, Y.; Damm, W.; Shelley, J.; Sherman, W. J. Chem. Theory Comput. 2010, 6, 1509−1519. (76) Harder, E.; et al. J. Chem. Theory Comput. 2016, 12, 281−296. (77) Csizmadia, F.; Tsantili-Kakoulidou, A.; Panderi, I.; Darvas, F. J. Pharm. Sci. 1997, 86, 865−871. (78) Hilpert, H.; et al. J. Med. Chem. 2013, 56, 3980−3995. (79) Tresadern, G.; Delgado, F.; Delgado, O.; Gijsen, H.; Macdonald, G. J.; Moechars, D.; Rombouts, F.; Alexander, R.; Spurlino, J.; Gool,

M. V.; Vega, J. A.; Trabanco, A. A. Bioorg. Med. Chem. Lett. 2011, 21, 7255−7260. (80) Rombouts, F. J. R.; et al. J. Med. Chem. 2015, 58, 8216−8235. (81) Ginman, T.; et al. J. Med. Chem. 2013, 56, 4181−4205. (82) Johansson, A.; et al. J. Med. Chem. 2016, 59, 2497−2511. (83) Alder, R. W.; Bowman, P. S.; Steele, W. R. S.; Winterman, D. R. Chem. Commun. (London) 1968, 723−724. (84) Alder, R. W. Chem. Rev. 1989, 89, 1215−1223. (85) Llamas-Saiz, A. L.; Foces-Foces, C.; Elguero, J. J. Mol. Struct. 1994, 328, 297−323. (86) Margetić, D.; Ishikawa, T.; Kumamoto, T. Eur. J. Org. Chem. 2010, 2010, 6563−6572. (87) Ozeryanskii, V. A.; Gorbacheva, A. Y.; Pozharskii, A. F.; Vlasenko, M. P.; Tereznikov, A. Y.; Chernov’yants, M. S. Org. Biomol. Chem. 2015, 13, 8524−8532. (88) Belding, L.; Stoyanov, P.; Dudding, T. J. Org. Chem. 2016, 81, 6−13. (89) Schrödinger Release 2016-2: Epik, version 3.6; Schrödinger, LLC: New York, NY, 2016. (90) Shelley, J. C.; Cholleti, A.; Frye, L. L.; Greenwood, J. R.; Timlin, M. R.; Uchimaya, M. J. Comput.-Aided Mol. Des. 2007, 21, 681−691. (91) Greenwood, J. R.; Calkins, D.; Sullivan, A. P.; Shelley, J. C. J. Comput.-Aided Mol. Des. 2010, 24, 591−604. (92) Marvin: 16.7.18.0. ChemAxon, 2016. http://www.chemaxon. com (accessed Nov 14, 2016). (93) Balogh, G. T.; Tarcsay, Á .; Keserű , G. M. J. Pharm. Biomed. Anal. 2012, 67−68, 63−70. (94) Peters, J.-U.; Lübbers, T.; Alanine, A.; Kolczewski, S.; Blasco, F.; Steward, L. Bioorg. Med. Chem. Lett. 2008, 18, 262−266. (95) Trummal, A.; Rummel, A.; Lippmaa, E.; Koppel, I.; Koppel, I. A. J. Phys. Chem. A 2011, 115, 6641−6645. (96) Alexander, A. J.; Kebarle, P. Anal. Chem. 1986, 58, 471−478. (97) Hibbert, F.; Hunte, K. P. P. J. Chem. Soc., Perkin Trans. 2 1983, 1895−1899. (98) Hibbert, F.; Simpson, G. R. J. Chem. Soc., Perkin Trans. 2 1987, 613−615. (99) Hibbert, F.; Simpson, G. R. J. Chem. Soc., Perkin Trans. 2 1987, 243−246. (100) Maillard, M. C.; Perlman, M. E.; Amitay, O.; Baxter, D.; Berlove, D.; Connaughton, S.; Fischer, J. B.; Guo, J. Q.; Hu, L.-Y.; McBurney, R. N.; Nagy, P. I.; Subbarao, K.; Yost, E. A.; Zhang, L.; Durant, G. J. J. Med. Chem. 1998, 41, 3048−3061. (101) Chernyshev, V. M.; Chernysheva, A. V.; Starikova, Z. A. Heterocycles 2010, 81, 2291−2311. (102) Goto, T.; Kishi, Y.; Takahashi, S.; Hirata, Y. Tetrahedron 1965, 21, 2059−2088. (103) Tarasova, E. V.; Chernyshev, V. M.; Chernysheva, A. V.; Abagyan, R. S. Russ. J. Appl. Chem. 2011, 84, 400−406.

6019

DOI: 10.1021/acs.jctc.6b00805 J. Chem. Theory Comput. 2016, 12, 6001−6019