Combining Theoretical and Data-Driven Approaches To Predict Drug

Sep 12, 2017 - Our study assessed these techniques for a previously studied set of relatively simple drug compounds (average molecular weight 300) and...
21 downloads 11 Views 2MB Size
Article pubs.acs.org/crystal

Combining Theoretical and Data-Driven Approaches To Predict Drug Substance Hydrate Formation Carl J. Tilbury,† Jie Chen,*,‡ Alessandra Mattei,‡ Shuang Chen,‡ and Ahmad Y. Sheikh‡ †

Department of Chemical Engineering, University of California Santa Barbara, Santa Barbara, California 93106, United States Solid State Chemistry, AbbVie Inc., 1401 Sheridan Road, North Chicago, Illinois 60064, United States



S Supporting Information *

ABSTRACT: Hydrates represent a very significant fraction of pharmaceutical molecular crystals and can be leveraged to simplify downstream processing for formulations such as wet granulation and hot-melt extrusion. In silico methods to predict hydrate formation can guide experimental screening and evaluate residual risk of selected forms. Both solution mixing thermodynamics and relative propensities of hydrogen bond formation can be used for virtual screening. Our study assessed these techniques for a previously studied set of relatively simple drug compounds (average molecular weight 300) and a new set of more complex AbbVie-pipeline compounds (average molecular weight 550). Although solution thermodynamics have been shown to successfully discriminate hydrate formation for the set of smaller drug molecules, this technique did not provide successful screening for the more complex set of AbbVie compounds tested. A single-differential hydrogen bond propensity (SD-HBP) score, which accounts for only the strongest donoracceptor pairing in both anhydrate and hydrate forms, also provides little utility. We therefore developed a multidifferential hydrogen bond propensity (MD-HBP) score that considers the competitive effect of multiple donoracceptor interactions in each form. Additionally, the MD-HBP score utilizes solid-state conformations (estimated here via COSMO-RS theory) to strengthen the data-driven analysis of the solid-state and ensure more accurate description of possible hydrogen bond networks in anhydrate and hydrate solid forms. This quantitative MD-HBP score performed well at differentiating between hydrate-forming and non-hydrate-forming compounds for both sets of compounds; thus, it can be applied more broadly in solid form development.



INTRODUCTION Discovering as many as possible solid forms and selecting the right form with optimal chemical/physical properties is of paramount importance in drug development. Hydrates, where water molecule(s) crystallize together with the drug molecule in the crystal lattice, represent a common family of solid forms seen in approximately one-third of organic compounds.1 As part of solid-form screening, it is necessary to discover all stable hydrates and understand the process conditions leading to their formation. This has important implications in the solid form selection, the manufacturing process design and control strategy, as well as the drug product design. Hydrates may have significant differences in key physical properties, such as stability and bioavailability.2−4 In some other cases, hydrates can offer advantages to design more robust downstream processes. For example, hot melt extrusion is a platform used to manufacture amorphous solid dispersions, a widely used formulation for poorly water-soluble drugs to improve bioavailability. For many drug compounds, hydrates melt at a lower temperature than the anhydrous/unsolvated form; this reduction in melting temperature allows the hot melt extrusion to operate at a lower temperature to dissolve the drug substance solids in the polymer matrix and reduce the risk of chemical degradation. Hydrates are also preferable for wet granulation © XXXX American Chemical Society

formulations. While experimental hydrate screenings are routinely conducted in industry,3 it is difficult to gauge if sufficient conditions have been tested. A robust in silico tool to predict existence of hydrates can help conduct more targeted experimental screens and increase the probability of discovering the predicted hydrates. There are two existing common approaches to hydrate prediction. One approach is to calculate mixing thermodynamics of active pharmaceutical ingredient (API) water solutions. Abramov showed5 that both excess free energies and enthalpies of mixing (Gex and Hex, respectively), predicted using COSMO-RS theory (COnductor-like Screening MOdel for Real Solvents),6,7 were able to discriminate successfully between hydrate-forming and non-hydrate-forming compounds, for a set of 61 organic drug molecules. Free energies of mixing probe the degree to which APIwater interactions are favorable, with more negative quantities indicating greater probability of hydrate formation. The free energy of mixing approximates close range interactions that may form in an amorphous solid state.5,8 A limitation of this method is the inability to account Received: April 11, 2017 Revised: September 11, 2017 Published: September 12, 2017 A

DOI: 10.1021/acs.cgd.7b00517 Cryst. Growth Des. XXXX, XXX, XXX−XXX

Crystal Growth & Design

Article

for differences in anhydrate/hydrate crystalline interactions,5,8 where molecular orientation and/or conformation can be principally important. A second approach (available within Mercury9), predicts the propensity of forming specific hydrogen bonds.10−13 This approach is data-driven and consists of analyzing structures within the Cambridge Structural Database14 (CSD) that contain functional groups consistent with the target API molecule. These data are usually used to compare polymorphs,11,15,16 by evaluating patterns of hydrogen bonds and their propensity in relation to the list of predicted supramolecular synthons.17 Since only the molecular structure is required to predict hydrogen bonding propensities, the approach can be employed to determine the likelihood of hydrate formation without knowing the crystal structure.16,18 Compared to the approach of solution mixing thermodynamics, hydrogen bond propensities provide more information on orientation-dependent interactions in the solid state, which are present in most hydrate forms.19 Thus, this approach may describe better the crystalline network of hydrogen bonds, but there is no prediction of dispersive interactions, which represent a significant contribution in organic molecular crystals. Note that although hydrogen bonding propensities do not directly translate into an actual physical quantity, one would expect them to correlate with interaction strength, within a given system. Since the propensity for two (or more) molecules to cocrystallize typically depends on the relative strength of homoand heterosynthons that may form between the constituent molecules,20 any tools developed for the prediction of hydrates can have broader utility for solvates and cocrystals. In fact, constituents of the two aforementioned modeling approaches have been applied previously to study solvate and cocrystal formation.5,8,21 Our investigation focuses on predicting hydrates where water becomes part of the hydrogen bonding network within the lattice. We are not considering cases where dispersive interaction matching (or void stabilization22) may be sufficient to predict formation as can be the case for some solvates and cocrystals. This paper presents an evaluation of both COSMO-RS solution thermodynamics and the typical interpretation of hydrogen bond propensity (HBP) calculations and then introduces a combined approach with greater fidelity that corrects for various shortcomings. This combined approach leverages COSMO-RS to calculate conformations that best represent the solid state and uses the HBP calculation data to approach a network description of the solid-state hydrogen bonding landscape with special emphasis on ensuring hydrogen bonds are correctly accounted (both orientation and donating/ accepting groups). Predictive power is compared via application to two data sets: the 50 nonproprietary molecules reported by Abramov5 (hereafter, “Abramov dataset”) and 26 proprietary AbbVie molecules that underwent full experimental hydrate screening (hereafter, “AbbVie dataset”). On average, the molecular size and complexity of the AbbVie data set are far greater than the Abramov data set, providing a test of whether in silico modeling techniques are useful tools for such compounds.

portions of the molecule, but instead of functional groups these portions are surface segments of defined charge density, derived from a quantum mechanical COSMO calculation.26 The surface charge density distribution is determined by considering the molecule as a whole, which, therefore, accounts for the environment of each functional group. As such, COSMO-RS theory can avoid a principal drawback of GCMs, where molecular environmental effects may not be correctly accounted for; this drawback may be significant for complex drug molecules.8,23 The COSMO calculation26 is both the primary input required for COSMO-RS thermodynamic predictions and the largest computational expense involved. It represents a dielectric continuum solvation model where the molecule is treated as embedded in a perfect conductor, which induces a polarization of the molecule, thus producing a closer approximation to the solid-state electronic environment.23 This virtual conductor calculation also forms a convenient reference state.8 COSMO-RS theory utilizes COSMO calculation data to predict solution thermodynamics for “Real Solvents”. The calculated screening charge density for a molecule is used to construct its σ-profile, which corresponds to the distribution of particular charge densities among various segments comprising a grid over its molecular surface.6 The liquid state is then treated as an ensemble of closely packed molecules (retaining their COSMO structure) and has an overall solution σ-profile corresponding to the combined distribution of the surface charge density within the mixture. To determine chemical potentials, one considers segment−segment interactions, which are divided into various contributions:8 • Electrostatic effects that increase the chemical potential for charge segment mismatch • Hydrogen bonding that decreases the chemical potential for opposite charge segments (above a threshold) • Van der Waal’s interactions that are dependent on surface areas rather than charge densities • Combinatorial effects that represent entropic contributions from molecular size A single segment then has a chemical potential arising from statistical averaging of possible segment−segment interactions, according to the solution σ-profile; the σ-potential describes this affinity as a function of the single segment’s charge density.7 Integrating the solution σ-potential over the σ-profile of a molecule, therefore, provides the liquid-phase chemical potential of that molecule. With such values for each species in solution, macroscopic thermodynamic quantities can be calculated, such as Gex and Hex. Thus, COSMO-RS theory is able to predict macroscopic thermodynamic properties of mixtures containing complex drug molecules, by utilizing statistical mechanics of segment−segment interactions following a COSMO calculation. These calculations are rapid (order seconds to minutes), once the COSMO surface of each molecule in solution has been determined. A significant assumption is that all segments of one molecule are capable of interacting with all segments of another;6 i.e., steric hindrance is not considered, but may be important for large, complex molecules with concave regions. Nonetheless, the approach permits quantitative calculation of thermodynamic quantities and is capable of describing interactions between various functional moieties, being based on a molecular quantum mechanical calculation.



SOLUTION MIXING THERMODYNAMICS USING COSMO-RS THEORY COSMO-RS theory was developed by Klamt et al.6,7,23,24 as an alternative to Group Contribution Methods (GCMs), such as UNIFAC,25 for calculating thermodynamic data of liquids. COSMO-RS calculations rely on contributions from distinct B

DOI: 10.1021/acs.cgd.7b00517 Cryst. Growth Des. XXXX, XXX, XXX−XXX

Crystal Growth & Design

Article

Figure 1. (a) The number of atoms for each studied molecule, segregated into the two data sets. Blue dots indicate hydrate-forming compounds, and red triangles indicate non-hydrate-forming; (b) the ratio of the molecular hydrogen bond donors and acceptors for each molecule in the two data sets (a ratio of 0 implies no donating groups exist, e.g., Torcetrapib). Example D/A ratios: lactose D/A = 0.73 (8 donors, 11 acceptors); citric acid D/A = 0.57 (4 donors, 7 acceptors).



HYDROGEN BOND PROPENSITY ANALYSIS

regresses a model, using various chemical descriptors, against this subset that predicts the formation propensity of hydrogen bonds, according to observed donoracceptor pairings. The application of an acceptable model to the initial set of functional groups (representing the target API molecule and water) predicts relative propensities of hydrogen bonds for each possible donoracceptor pair. With APIAPI and APIwater propensities predicted for each donor/acceptor group, one can compare results to provide insight on hydrate formation probability. One possibility of comparing propensities is to subtract the maximum API water propensity from the maximum APIAPI propensity; a more negative score indicates more favorable hydrate formation. This single-differential HBP score (subsequently termed SD-HBP score) is essentially a hydrate-specific version of the previously developed16,27 multi-component (MC) score. Note, however, that the differential is inverted (i.e., the SD-HBP score

14

The CSD currently contains over 850 000 crystal structures; as such, it provides a tremendous wealth of data on hydrogen bonds in the solid state. The key determinant in whether an API molecule will crystallize as an anhydrate or hydrate form will typically be a difference in preference between potential APIAPI and APIwater hydrogen bonds. If APIwater hydrogen bonds are significantly more favorable, then the hydrate form should show an increased probability of occurrence. Such relative hydrogen bond propensities (HBP) can be calculated Mercury,9 based on the methodology by Galek et al.10−13 First, the set of functional groups is established from the API molecule of interest (in addition to water). The HBP analysis then determines a subset of crystal structures within the CSD that contain these functional groups and analyzes hydrogen bonds present in those crystal structures. Next, it C

DOI: 10.1021/acs.cgd.7b00517 Cryst. Growth Des. XXXX, XXX, XXX−XXX

Crystal Growth & Design

Article

COSMOconf) was used to perform a conformer search, geometry optimization based on density functional theory (DFT), and COSMO calculation at various levels of theory (COSMOtherm is parametrized to predict thermodynamic data for the SVP, TZVP, and TZVPD-FINE basis sets).35,36 The conformer search is important when considering potential intramolecular hydrogen bonding, which may have a significant impact on accessible API polar groups, and thus, hydrate formation. This generates a list of favorable molecular conformations (up to 10, or a cutoff in relative energy), alongside a COSMO surface and σ-profile for each conformer. The speed of this conformer search, geometry optimization and COSMO calculation depend on both molecular size and theory level employed. Employing appropriate parallel computing, the SVP theory level takes on the order of minutes to hours, while the TZVP level takes hours to days (the TZVPD-FINE takes minutes to hours and is performed on the output of a TZVP calculation). With COSMO files generated, COSMOthermX is able to quickly (order minutes) calculate thermodynamic properties of solutions, by applying statistical mechanics. Pseudo chemical potentials (μpi ) are reported within the software, to be valid at infinite dilution:

uses maximum homopropensity minus maximum heteropropensity); this altered definition allows the trend of the SD-HBP score to correlate with that of solution thermodynamics (more negative indicates more likely hydrate formation). While the strongest predicted propensities will indeed be the most important,28−31 considering only a single interaction of each type fails to account for the possibility of multiple donor/ acceptor groups being important in forming a stable solidstate structure. For an approach with greater fidelity, the set of relevant interactions to consider should account for all predicted propensities (donoracceptor combinations)16,32 and their distribution of values. HBP models can account for steric hindrance and predict some intramolecular hydrogen bonds,10−13 which may render certain polar groups unavailable for intermolecular hydrogen bonding. The consideration of steric effects and intramolecular hydrogen bonding is important as hydrate formation has been found to correlate with polar surface.19,33 However, the specific molecular conformation is not an input to the HBP calculation. Quantum mechanical COSMO-RS calculations can be performed to generate conformers representative of the solid state, which can guide the development of a more complete description of the intermolecular hydrogen bonding networks possible in anhydrous and hydrated solid forms. Since intramolecular hydrogen bonds and sterically inaccessible polar regions can significantly impact whether intermolecular hydrogen bonds can form in the solid state, it is critical to consider conformations that represent the solid state in order to account for such effects. Finally, for molecules with an imbalance of donating/accepting groups, hydrate formation may be a result of correcting the imbalance and completing hydrogen bonds at certain locations;19 it is the relative sets of all potential interactions in hydrate vs anhydrate crystal forms that may determine which structure is realized. Note that for the case of larger molecules with a higher number of polar groups these complications are magnified.34 The methodology proposed below outlines how to utilize COSMO-RS and HBP calculations, to account for such considerations and produce a higher fidelity model predicting hydrate formation.



μi p = μpure, i + RT ln γi μpure,i is a pure component chemical potential, and γi is its mixture activity coefficient. For an equimolar mixture of API and water (following the same procedure as Abramov5), Gex and Hex are then calculated as follows:

Gex

p = 0.5μHp O + 0.5μAPI − 0.5μpure,H O − 0.5μpure,API 2

2

Hex = 0.5HH2O + 0.5HAPI − 0.5Hpure,H2O − 0.5Hpure,API Here, Hi and Hpure,i are mixture and pure-component enthalpies, respectively. Gex is equivalent to the true mixing free energy minus the ideal entropic component: Gex = Gmix − RT (x1 ln x1 + x2 ln x2). Note the mixing calculation reflects the API molecule in a supercooled liquid state. For compounds where multiple conformations exist, COSMOthermX constructs a Boltzmann-weighted distribution (see COSMOtherm user manual35) of conformers to represent the API components of the mixture (each conformer can have different thermodynamic properties). Since the mixture chemical potentials used to form the Boltzmann weightings depend on solution composition, iteration is required (which the program performs automatically).



HYDROGEN BOND PROPENSITY ANALYSIS The implementation of HBP analysis is described below, using algorithms that detail generation of HBP data, calculation of SD-HBP scores, and calculation of multidifferential HBP (MDHBP) scores. Generation of HBP Data. A single propensity model was built to generate both APIAPI and APIwater interactions, which is important to ensure a fair comparison. (1) Using smiles code for API, add “.[H]-[O]-[H]”, generate structure file (e.g., TmoleX15) and open file in Mercury (2) Autoedit structure to ensure API molecule is represented correctly, run HBP analysis (3) Verify list of donor/acceptor functional groups detected in API molecule/water (4) Generate and select representative sample of CSD crystal structures containing sufficient and even distribution of the functional groups (5) Regress HBP model against selected data set; if needed, adjust sample and repeat to improve model performance (6) Predict propensities using the model Calculation of SD-HBP Score. (1) Determine maximum predicted APIAPI propensity (2) Determine maximum predicted APIwater propensity

METHODS

The Abramov data set (excluding proprietary Pfizer compounds) consists of 50 organic drug molecules. The molecular size ranges from 11 to 70 atoms (average molecular weight of 300); a total of 40 of the molecules form hydrates and 10 are instead non-hydrate forming.5 The AbbVie data set consists of 26 proprietary pipeline compounds that have completed experimental hydrate screening. The molecular size ranges from 34 to 145 atoms (average molecular weight of 550); a total of 14 of the molecules form hydrates and 12 are non-hydrate forming. Figure 1a compares the molecular size distribution between data sets. The ratio of the molecular hydrogen bond donors and acceptors is comparable between the Abramov data set and the AbbVie data set (see Figure 1b for comparison). The number of rotatable (exocyclic) single bonds in a molecule, excluding terminal methyl groups, ranges from 2 to 17, and the main torsion angles within a molecule can range between 179° and −60°. Thus, the AbbVie data set is smaller, but contains more complex and conformational flexible molecules and shows better balance between hydrate-forming and non-hydrate-forming compounds. COSMO-RS. Gex and Hex were calculated for each compound using the COSMOlogic suite of programs.35−38 The first step was to perform COSMO calculations for each molecule. Starting with a molecular SMILES code for each compound, a three-dimensional (3D) structure was generated using TmoleX15, and its molecular geometry was preoptimized at the MOPAC7 AM1 COSMO theory level.37 Next, COSMOthermX (alongside a Linux installation of TURBOMOLE and D

DOI: 10.1021/acs.cgd.7b00517 Cryst. Growth Des. XXXX, XXX, XXX−XXX

Crystal Growth & Design

Article

Figure 2. Receiver operating characteristic curves for Gex and Hex (TZVP theory level) when used as binary classifiers for each data set, with area under curve (AUC) values indicated; the diagonal line represents zero predictive power, TPR = true positive rate, FPR = false positive rate. The red circles indicate choices for cutoff values that maximize combined sensitivity (TPR) and specificity (1 − FPR); the optimal cutoff is subjective in marginal cases.

(3) SD-HBP score = max-(HBPAPI−API) − max-(HBPAPI−water) (a) More negative SD-HBP score indicates more favorable hydrate formation (note this is analogous to the negative of the previously proposed16,27 multiComponent (MC) score Calculation of MD-HBP Score.

(c) Using solid-state conformation, eliminate unavailable donor/acceptor groups from list (i) Analyze intramolecular hydrogen bonding and steric effects, determine if any donor/ acceptor groups are unavailable for intermolecular hydrogen bonding (ii) Note groups involved in intramolecular hydrogen bonds may still be available if they contain additional hydrogens or lone pairs (3) Consider the top tier of available API donor and acceptor groups from the ranked list (a) One should consider both values of predicted propensities, overlap of their upper/lower bounds, and natural breaks in the list of interaction pairings (4) Estimate the anhydrate set of APIAPI intermolecular hydrogen bonds (a) Each API functional group is assumed able to donate and accept just once, to account for likely steric hindrance between multiple API molecules bonding to the same group (b) Method: select strongest propensity APIAPI interaction for most important API D/A group, eliminate other pairings involving those D and A

(1) Rank API donor and acceptor groups by their predicted APIwater HBPs (water functions as the corresponding acceptor or donor, respectively) (2) Eliminate API donor and acceptor groups that are unavailable for intermolecular hydrogen bonding, by analyzing representative solid-state conformations (a) Possible sources of solid-state conformations: COSMO-RS theory, experimental crystal structures, Gaussian,39 solid-state geometry optimization (e.g., CRYSTAL40) (b) COSMO-RS method: (i) Generate most favorable COSMO-conformations(s) using COSMOconf (ii) Select most favorable hydrate conformation using a COSMOtherm mixture calculation (1:1 API/water) E

DOI: 10.1021/acs.cgd.7b00517 Cryst. Growth Des. XXXX, XXX, XXX−XXX

Crystal Growth & Design

Article

Figure 3. Gex and Hex values (TZVP theory level) for each data set, divided into hydrate-forming and non-hydrate-forming compounds (AbbVie data set = purple crosses, Abramov data set = open orange diamonds). For each section (hydrate-forming and non-hydrate-forming), data were sorted in descending Gex and Hex values.

overprioritize intramolecular hydrogen bonding at the expense of keeping donors and acceptors available for intermolecular solid-state interactions. This mitigates the risk of forming intramolecular hydrogen bonds during the energy minimization, whose groups would instead form intermolecular hydrogen bonds in a solid form; essentially, we expect the 10 lowest energy COSMO-conformers to represent possible solid-state conformations. To further mitigate this risk, we perform a COSMOtherm mixture calculation (equimolar API/water) and select the dominant conformer. We expect this method of COSMOconf-generation and COSMOtherm-selection to produce conformers with potential for favorable intermolecular hydrogen bonds. Other conformer-generation methods may be used instead, providing they produce conformations that are sufficiently representative of those likely to exist in the solid state. Guiding our calculation of the MD-HBP score according to the accessible groups on this dominant conformer thus reinforces its validity.

groups, repeat until all top tier D/A’s have pairings, if possible. Note this hierarchical approach is similar to that of Musumeci et al.32 (5) MD-HBP score = ∑D,A HBPAPI−API − HBPAPI−water (a) Sum over the top tier API D/A groups (b) More negative MD-HBP score indicates greater probability of hydrate formation The goal of the MD-HBP score is to approach a network description of the possible solid-state hydrogen bond landscape and determine whether an anhydrous or hydrated solid form provides the better network. Utilizing the full scope of calculated HBPs is able to achieve this, but step 2 in the algorithm above reinforces the accuracy of the estimated network by excluding donors/acceptors that are unavailable for intermolecular hydrogen bonds. Since COSMO-RS evaluates thermodynamic statistics of interacting surface segments and molecular surface interactions should influence solid-state conformations, leveraging COSMO-RS theory to estimate conformations offers a better approach than performing the conformational search for an isolated molecule. The virtual conductor calculation tends to enhance polarization compared to that of an isolated molecule, so COSMOconf-generated conformations should not



RESULTS COSMO-RS. Figure 2 indicates receiver operating characteristic (ROC) curves for both Gex and Hex (TZVP theory level) when used to classify each data set. An ROC curve plots the F

DOI: 10.1021/acs.cgd.7b00517 Cryst. Growth Des. XXXX, XXX, XXX−XXX

Crystal Growth & Design

Article

Figure 4. Receiver operating characteristic curves for SD-HBP and MD-HBP scores when used as binary classifiers for each data set, with area under curve (AUC) values indicated; the diagonal line represents zero predictive power, TPR = true positive rate, FPR = false positive rate. The red circles indicate choices for cutoff values that maximize combined sensitivity (TPR) and specificity (1 − FPR); the optimal cutoff is subjective in marginal cases.

API−API lattice interactions. In effect, water might be trapped as the crystal grows, particularly in void spaces resulting from complex molecular shapes. Crystallization is typically governed by such attachment/detachment kinetics;41,42 while the methods presented here principally provide thermodynamic insight into favorable lattice interactions, they should correlate with formation kinetics. COSMO-RS. Both Gex and Hex quantities show the ability to effectively discriminate between hydrate-forming and nonhydrate-forming compounds for the Abramov data set, as evidenced by the prior work5 and this reproduction. Indeed, results show very close quantitative agreement to the previously calculated values, which confirms the reproducibility of the technique. Certain molecules do show significant differences, however, such as citric acid. We expect such discrepancies to result primarily from a difference in conformation-generation methods, since intramolecular hydrogen bonding may significantly alter the σ-profile of the API molecule by affecting exposed polar regions. As discussed in the Methods on MD-HBP score calculation, it is critical to ensure the conformation is representative of the solid state, which COSMO-RS theory can achieve. Second, predictions were compared at each theory level for which COSMOtherm is parametrized. ROC curves for the SVP and TZVPD-FINE theory levels are contained in the

true positive rate (TPR) against the false positive rate (FPR). The area under curve (AUC) values are indicated, which describe the predictive capacity of each qualifier; AUC = 0.5 corresponds to random chance and AUC = 1.0 a perfect classifier (whose ROC curve would exist as a Heaviside step function). Figure 3 plots the Gex and Hex data for each data set, split into hydrate-forming and non-hydrate-forming compounds. This provides a visual cue on the degree to which each quantity can (or cannot) provide effective discrimination. HBP. Figure 4 shows ROC curves for both SD-HBP and MD-HBP scores calculated across each data set. Figure 5 plots these quantities with the data sets segregated by hydrateforming ability (refer to the Supporting Information for details of these calculations).



DISCUSSION A principal goal of hydrate screening is to determine all possible hydrates. Considering ROC curves, a useful model maximizes the true positive rate; i.e., hydrate-forming compounds are successfully identified; they could then be subjected to more extensive experimental investigation. One general trend that is apparent from the AbbVie data set is that the larger molecules almost exclusively form hydrates. A mechanistic explanation for this effect is a lower degree of desolvation required for an approaching growth unit to form G

DOI: 10.1021/acs.cgd.7b00517 Cryst. Growth Des. XXXX, XXX, XXX−XXX

Crystal Growth & Design

Article

Figure 5. MD-HBP score values for each data set, divided into hydrate-forming and non-hydrate-forming compounds (AbbVie data set = purple crosses, Abramov data set = open orange diamonds). Optimal cutoff values are indicated for each data set. For each section (hydrate-forming and non-hydrate-forming), data were sorted in descending MD-HBP values.

Table 2. Performance Indicators of Hex and Gex as Binary Classifiers for the AbbVie Dataset, at Each Theory Levela

Supporting Information. Table 1 contains AUC values and optimal classifier cutoffs with the TPR and FPR indicated in

Hex

Table 1. Performance Indicators of Hex and Gex as Binary Classifiers for the Abramov Dataset, at Each Theory Levela Hex

AUC cutoff TPR (%) FPR (%)

Gex

SVP

TZVP

TZVPDFINE

0.75 0.05 98 40

0.70 0.26 98 40

0.85 0.24 93 40

SVP

TZVP

TZVPDFINE

0.77 0.17 78 30

0.73 0.20 65 30

0.79 0.29 80 30

AUC cutoff TPR (%) FPR (%)

Gex

SVP

TZVP

TZVPDFINE

0.63 −0.16 57 33

0.54 −0.08 43 17

0.60 −0.62 50 25

SVP

TZVP

TZVPDFINE

0.63 0.20 57 33

0.69 0.29 71 33

0.64 0.20 64 33

a

AUC = area under curve, cutoff in kcal/mol, TPR = true positive rate, FPR = false positive rate.

a

AUC = area under curve, cutoff in kcal/mol, TPR = true positive rate, FPR = false positive rate.

to complex molecules, the higher theory levels (TZVP and TZVPD-FINE) do not provide improved results for the AbbVie data set; COSMO-RS solution thermodynamics appear incapable of classifying this data set according to hydrate formation. A variety of issues may explain this fundamental failure of Gex and Hex to discriminate between hydrate-forming and nonhydrate-forming compounds for the AbbVie data set; these drawbacks may be more significant with increased molecular complexity and size, supporting the difference in success from the Abramov data set. First, the accuracy of thermodynamic quantities predicted using the COSMOtherm software is stated to be 0.5 kcal/mol.43,44 For Gex, this error is comparable to the spread in calculated values across all compounds. Unless the error is systematic, it casts significant doubt on the ability for Gex to be used effectively. Although entropic effects of hydrate crystallization may be significant, we expect entropy of fusion terms to dominate, rather than mixing in solution; this explains why AUC values are not systematically better for Gex than Hex. The calculated Hex values offer significantly larger variation than 0.5 kcal/mol, which suggests this parameter is more likely to be an effective classifier, given the stated accuracy. Indeed, the

each case (these cutoffs maximize combined sensitivity and specificity, but are subjective in marginal cases according to the inherent trade-off). Note that these values should not be directly compared to those reported by Abramov,5 since the proprietary Pfizer compounds are omitted from the data set; broad agreement remains, however. From AUC values alone, it is difficult to select the most effective parameter and theory level. Interestingly, the cheap SVP calculations provide roughly equivalent discrimination ability to the TZVP level at a fraction of the computational expense; although cutoff values are significantly shifted, a binary classifier only requires relative Gex or Hex values. While this suggests cheap theory level screening could be effective, we caution that as molecular complexity rises, higher theory levels may be necessary for an accurate description of solution thermodynamics. The performance of Gex and Hex for the AbbVie data set, however, is far less satisfying. Table 2 indicates AUC values and classifier cutoffs; these quantities fail to offer useful discrimination and are barely better than random chance. Despite our aforementioned caution against applying cheap SVP screening H

DOI: 10.1021/acs.cgd.7b00517 Cryst. Growth Des. XXXX, XXX, XXX−XXX

Crystal Growth & Design

Article

Hex AUC value of 0.85 (Abramov data set) at the highest theory level supports Hex superiority. Second, the Gex and Hex calculations assumed an API/water stoichiometry of 1:1. This ensures the mixing thermodynamics reflect the potential for favorable APIwater interactions, but precludes accounting for hydrate formation where stoichiometry is important, such as completing hydrogen bonds on API donor/acceptor groups that could not otherwise be satisfied. This problem may be exacerbated for larger API molecules. We investigated calculating Gex and Hex at different stoichiometry, without any success. There are three key issues to nonequimolar calculations. First, it compromises the comparison between different compounds and impairs the ability to construct a binary classifier for the full data set. Second, it is difficult to determine the stoichiometry to select a priori; indeed, even where the hydrate stoichiometry was known, there was no observable minimum in free energy. Third, the shapes of the Gex and Hex curves against composition can be extremely sensitive to the theory level employed. For these reasons, we limited our calculations to equimolar mixtures, recognizing the potential inability to describe a hydrate solid form where composition is important. Third, the molecular orientation is important in the solid state, especially for polar interactions. This effect is lost in solution where COSMO-RS theory assumes each segment interacts with one another. While this assumption may be applicable for small molecules or those with equally donating/ accepting character, its validity is questionable for complex molecules or where lattice orientation is critical to describe the solid-state energy landscape. Hydrate formation, in particular, is likely to be significantly influenced by the relative potential of hydrogen bond networks in various solid forms, which are by definition orientation-specific. The greater success of COSMOtherm Gex and Hex for coformer prediction5,8,21 may be that the phenomenon of cocrystallization is more influenced by the general degree to which different molecules can favorably interact. Finally, the ability of each segment to interact with each other neglects effects of steric hindrance and intramolecular hydrogen bonding, either of which may cause polar regions of the molecule to be inaccessible for intermolecular interactions. Affecting such polar regions may significantly alter the σ-profile and predicted thermodynamics. Furthermore, polar regions that are inaccessible to fellow API molecules may be open to smaller water molecules; this disparity in accessibility may skew the solid-state reality from information that can be obtained from COSMOtherm predictions. Again, these effects may be more pronounced for larger API molecules, which could explain the failure of Gex and Hex classifiers for the AbbVie data set, while they show promise for the Abramov data set. HBP. As evidenced by poor ROC AUC values in Figure 4, the SD-HBP score does not produce any useful discrimination between hydrate-forming and non-hydrate-forming compounds, for either data set. The MD-HBP score, however, offers powerful discrimination, not only for the Abramov data set, but also for the AbbVie data set that was not well described by the COSMOtherm Gex and Hex predictors. The success of this MD-HBP score likely derives from its ability to capture various important characteristics. Supramolecular synthons or structural motifs are often formed by known intermolecular interactions that express core features of a crystal structure, specifically hydrogen bonding; this MD-HBP method is able to account for all significant donoracceptor

pairings. Analyzing which donor/acceptor groups are available on a representative solid-state conformation (determined using COSMO-RS techniques) further increases the fidelity of this approach. While hydrogen bond propensities do not directly correspond to a quantitative measure of the interaction strength, relative comparisons should correlate, i.e., those pairings with highest predicted propensity appear most regularly throughout similar crystal structures and thus should represent the strongest supramolecular synthons. The optimal cutoff MD-HBP scores are −0.4 for the AbbVie data set (79% TPR, 25% FPR) and −0.2 for the Abramov data set (80% TPR, 30% FPR), as shown by Figures 4 and 5. The fact that these are both negative implies APIwater hydrogen bonds must significantly outcompete APIAPI hydrogen bonds for hydrate formation to be favorable. This may be due to a disruption of dispersive lattice interactions when water molecules are present. Note that if the dispersive APIAPI interactions represent the strongest anhydrate solid-state synthons (instead of APIAPI hydrogen bonds), the MD-HBP score technique may be less effective as a predictor of hydrate formation. While this data-driven method of analyzing hydrogen bond propensities shows great promise at differentiating between hydrate-forming and non-hydrate-forming compounds, there remain issues in its application. First, determining the set of donors/acceptors to include in the top tier (i.e., those that are important to consider) can be difficult where significant overlap exists between predicted propensities and/or their upper and lower bounds. One can mitigate this issue by conducting a sensitivity analysis, i.e., checking to see how the MD-HBP score would change if the next strongest donor/acceptor were accounted for. A more systematic approach of dividing donors/ acceptors into tiers of importance could be to construct a Gaussian mixture model that can describe natural clusters and breaks within the data45 Second, the MD-HBP score requires a set of APIAPI hydrogen bonds in an anhydrate form. While the method proposed in the algorithm above can provide a rough estimate before anhydrate crystal structures are known, it requires physical intuition, and one should check the resulting MD-HBP score’s sensitivity to assumptions about the APIAPI network. Incorporating established models46 of hydrogen-bond coordination might be able to increase accuracy and guide this intuition. Ultimately, however, a revised APIAPI network could be incorporated following discovery of anhydrate crystal structures. Third, this approach implicitly assumed water to be an equally strong hydrogen-bond donor as it is an acceptor. One could feasibly incorporate a different description of water (e.g., the Della Volpe−Siboni scale47,48) and use this to preferentially select top tier API donor/acceptor groups as appropriate. Finally, this method principally aims to describe lattice hydrates, where the water molecules occupy regular positions and the APIwater hydrogen bonds with highest predicted propensities can be realized. The formation of channel hydrates, on the other hand, may be more related to lattice void space.19 For example, AbbVie Compound 22 forms a channel hydrate, which may explain the failure of Hex and MD-HBP score classifiers to predict its behavior correctly. It would be informative to obtain further data on the types of hydrate49 formed for each compound, to ensure appropriate in silico approaches are applied to capture different physical effects. I

DOI: 10.1021/acs.cgd.7b00517 Cryst. Growth Des. XXXX, XXX, XXX−XXX

Crystal Growth & Design



Article

ORCID

CONCLUSIONS While COSMOtherm Gex and Hex act as effective discriminators for the Abramov data set, they do not show similar predictive power for the AbbVie data set that represents a more complex class of drug molecules. The data-driven hydrogen bond propensity approach also fails to provide effective discrimination when using the simplistic SD-HBP score. However, the MD-HPB score shows predictive power for both data sets. This technique to interpret HBP output data accounts for important donors/acceptors and competitive effects in hydrogen bond formation; it also analyzes whether groups are accessible, guided by molecular conformations determined from COSMOlogic software calculations. In summary, a combined quantum mechanics and data-driven approach was demonstrated to be successful in predicting hydrate formation of organic molecular crystals. It should not be surprising that a description of hydrogen bonding is required to describe lattice hydrate formation, where APIwater synthons must be important. Furthermore, the presented framework can adopt further physical intuition to increase predictive fidelity by better accounting for steric effects and the most significant polar groups present. Improved and increased data (i.e., types of hydrates, stoichiometry, ease of formation etc.) may be able to guide such intuition and refine optimal cutoffs. With significantly expanded data sets, one could also begin to convert MD-HBP scores into quantitative probabilities of hydrate formation. A collaboration between pharmaceutical companies might provide such high-quality data, considering dead/historic compounds that have undergone extensive experimental hydrate screening. This analysis can be adopted early in drug development workflow, before crystal structures are determined. If a compound is predicted to have a strong likelihood of hydrate formation, a thorough experimental screening can be enacted to sufficiently derisk the eventually selected solid form, leading to a more targeted use of experimental resources. Beyond the initial hydrate screening, MD-HBP analysis can be implemented and updated throughout solid form selection and risk assessment. Once anhydrate crystal structures are determined, one could revise the solid-state conformation from the COSMO-RS predictions to ensure accuracy (though it remains advisable to consider possible conformations beyond those experimentally observed). Analyzing voids present within the lattice might provide insight into potential for channel hydrates.22 Finally, applying Full Interaction Maps50 within Mercury9 could validate or suggest refinements to the selection of anhydrate and hydrate hydrogen bonding networks used to calculate the MD-HBP score for each compound.



Carl J. Tilbury: 0000-0002-4736-1239 Jie Chen: 0000-0003-0035-9649 Notes

All authors are employees or former employees AbbVie. The design, study conduct, and financial support for the clinical trial was provided by AbbVie. AbbVie participated in the interpretation of data, review, and approval of the publication. The authors declare no competing financial interest.



(1) Stahly, G. P. Diversity in Single- and Multiple-Component Crystals. The Search for and Prevalence of Polymorphs and Cocrystals. Cryst. Growth Des. 2007, 7, 1007−1026. (2) Threlfall, T. L. Analysis of organic polymorphs. A review. Analyst 1995, 120 (10), 2435−2460. (3) Hilfiker, R.; Blatter, F.; von Raumer, M. Relevance of solid-state properties for pharmaceutical products. Polymorphism in the Pharmaceutical Industry 2006, 1.110.1002/3527607889.ch1 (4) Pudipeddi, M.; Serajuddin, A. T. M. Trends in solubility of polymorphs. J. Pharm. Sci. 2005, 94, 929−939. (5) Abramov, Y. A. Virtual hydrate screening and coformer selection for improved relative humidity stability. CrystEngComm 2015, 17 (28), 5216−5224. (6) Klamt, A. Conductor-like Screening Model for Real Solvents: A New Approach to the Quantitative Calculation of Solvation Phenomena. J. Phys. Chem. 1995, 99, 2224−2235. (7) Klamt, A.; Jonas, V.; Bürger, T.; Lohrenz, J. C. W. Refinement and Parametrization of COSMO-RS. J. Phys. Chem. A 1998, 102, 5074−5085. (8) Loschen, C.; Klamt, A. New Developments in Prediction of Solid-State Solubility and Cocrystallization Using COSMO-RS Theory. In Computational pharmaceutical Solid State Chemistry; Abramov, Y. A., Ed.; John Wiley & Sons: Hoboken, NJ, 2016; Chapter 9, pp 211−233. (9) Macrae, C. F.; Bruno, I. J.; Chisholm, J. A.; Edgington, P. R.; McCabe, P.; Pidcock, E.; Rodriguez-Monge, L.; Taylor, R.; van de Streek, J.; Wood, P. A. it Mercury CSD 2.0 – new features for the visualization and investigation of crystal structures. J. Appl. Crystallogr. 2008, 41, 466−470. (10) Galek, P. T. A.; Fábián, L.; Motherwell, W. D. S.; Allen, F. H.; Feeder, N. Knowledge-based model of hydrogen-bonding propensity in organic crystals. Acta Crystallogr., Sect. B: Struct. Sci. 2007, 63, 768− 782. (11) Galek, P. T. A.; Allen, F. H.; Fabian, L.; Feeder, N. Knowledgebased H-bond prediction to aid experimental polymorph screening. CrystEngComm 2009, 11 (12), 2634−2639. (12) Galek, P. T. A.; Fábián, L.; Allen, F. H. Persistent hydrogen bonding in polymorphic crystal structures. Acta Crystallogr., Sect. B: Struct. Sci. 2009, 65, 68−85. (13) Galek, P. T. A.; Fabian, L.; Allen, F. H. Truly prospective prediction: inter- and intramolecular hydrogen bonding. CrystEngComm 2010, 12 (7), 2091−2099. (14) Allen, F. H. The Cambridge Structural Database: a quarter of a million crystal structures and rising. Acta Crystallogr., Sect. B: Struct. Sci. 2002, 58, 380−388. (15) Abramov, Y. A. Current Computational Approaches to Support Pharmaceutical Solid Form Selection. Org. Process Res. Dev. 2013, 17, 472−485. (16) Galek, P. T. A.; Pidcock, E.; Wood, P. A.; Feeder, N.; Allen, F. H. Navigating the Solid Form Landscape with Structural Informatics. In Computational Pharmaceutical Solid State Chemistry; Abramov, Y. A., Ed.; John Wiley & Sons: Hoboken, NJ, 2016; Chapter 2, pp 15−35. (17) Desiraju, G. R. Supramolecular Synthons in Crystal EngineeringA New Organic Synthesis. Angew. Chem., Int. Ed. Engl. 1995, 34, 2311−2327.

ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.cgd.7b00517. HBP output data; SD-HBP and MD-HBP score calculations (XLSX) COSMO-RS Gex and Hex values at SVP, TZVP, and TZVPD-FINE theory levels; ROC curves for SVP and TZVPD-FINE; TZVPD-FINE (PDF)



REFERENCES

AUTHOR INFORMATION

Corresponding Author

*Phone: +1 847 937 7883. E-mail: [email protected]. J

DOI: 10.1021/acs.cgd.7b00517 Cryst. Growth Des. XXXX, XXX, XXX−XXX

Crystal Growth & Design

Article

A program for the ab initio investigation of crystalline solids. Int. J. Quantum Chem. 2014, 114, 1287−1317. (41) Chernov, A. A.; Rashkovich, L. N.; Yoreo, D. a. J. J. ABC of Kink Kinetics and Density in a Complex Solution. AIP Conf. Proc. 2007, 916, 34−47. (42) Li, J.; Tilbury, C. J.; Joswiak, M. N.; Peters, B.; Doherty, M. F. Rate Expressions for Kink Attachment and Detachment During Crystal Growth. Cryst. Growth Des. 2016, 16, 3313−3322. (43) Klamt, A.; Eckert, F.; Diedenhofen, M. Prediction of the Free Energy of Hydration of a Challenging Set of Pesticide-Like Compounds. J. Phys. Chem. B 2009, 113, 4508−4510 PMID: 19275198.. (44) Klamt, A.; Mennucci, B.; Tomasi, J.; Barone, V.; Curutchet, C.; Orozco, M.; Luque, F. J. On the Performance of Continuum Solvation Methods. A Comment on “Universal Approaches to Solvation Modeling. Acc. Chem. Res. 2009, 42, 489−492 PMID: 19222200.. (45) Fraley, C.; Raftery, A. E. How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis. Comput. J. 1998, 41, 578−588. (46) Galek, P. T. A.; Chisholm, J. A.; Pidcock, E.; Wood, P. A. Hydrogen-bond coordination in organic crystal structures: statistics, predictions and applications. Acta Crystallogr., Sect. B: Struct. Sci., Cryst. Eng. Mater. 2014, 70, 91−105. (47) Della Volpe, C.; Siboni, S. Acid-base surface free energies of solids and the definition of scales in the Good-van Oss-Chaudhury theory. J. Adhes. Sci. Technol. 2000, 14, 235−272. (48) Della Volpe, C.; Siboni, S. Troubleshooting of surface free energy acid-base theory applied to solid surfaces: The case of Good, van Oss and Chaudhury theory. In Acid-Base Interactions: Relevance to Adhesion Science and Technology; Mittal, K. L., Ed.; VSP: Ultrecht, The Netherlands, 2000; Vol. 2, pp 55−91. (49) Mascal, M.; Infantes, L.; Chisholm, J. Water Oligomers in Crystal HydratesWhat’s News and What Isn’t? Angew. Chem., Int. Ed. 2006, 45, 32−36. (50) Wood, P. A.; Olsson, T. S. G.; Cole, J. C.; Cottrell, S. J.; Feeder, N.; Galek, P. T. A.; Groom, C. R.; Pidcock, E. Evaluation of molecular crystal structures using Full Interaction Maps. CrystEngComm 2013, 15 (1), 65−72.

(18) Delori, A.; Galek, P. T. A.; Pidcock, E.; Patni, M.; Jones, W. Knowledge-based hydrogen bond prediction and the synthesis of salts and cocrystals of the anti-malarial drug pyrimethamine with various drug and GRAS molecules. CrystEngComm 2013, 15 (15), 2916−2928. (19) Infantes, L.; Chisholm, J.; Motherwell, S. Extended motifs from water and chemical functional groups in organic molecular crystals. CrystEngComm 2003, 5 (85), 480−486. (20) Gavezzotti, A.; Colombo, V.; Lo Presti, L. Facts and Factors in the Formation and Stability of Binary Crystals. Cryst. Growth Des. 2016, 16, 6095−6104. (21) Abramov, Y. A.; Loschen, C.; Klamt, A. Rational coformer or solvent selection for pharmaceutical cocrystallization or desolvation. J. Pharm. Sci. 2012, 101, 3687−3697. (22) Price, C. P.; Glick, G. D.; Matzger, A. J. Dissecting the Behavior of a Promiscuous Solvate Former. Angew. Chem., Int. Ed. 2006, 45, 2062−2066. (23) Klamt, A. COSMO-RS: From Quantum Chemistry to Fluid PhaseThermodynamics and Drug Design; Elsevier: Amsterdam, The Netherlands, 2005. (24) Klamt, A. The COSMO and COSMO-RS solvation models. Wiley Interdisciplinary Reviews: Computational Molecular Science 2011, 1, 699−709. (25) Fredenslund, A.; Gmehling, J.; Rasmussen, P. Vapor-Liquid Equilibria Using UNIFAC: A Group-Contribution Method; Elsevier: New York, NY, 1977. (26) Klamt, A.; Schuurmann, G. COSMO: a new approach to dielectric screening in solvents with explicit expressions for the screening energy and its gradient. J. Chem. Soc., Perkin Trans. 2 1993, 5, 799−805. (27) Wood, P. A.; Feeder, N.; Furlow, M.; Galek, P. T. A.; Groom, C. R.; Pidcock, E. Knowledge-based approaches to co-crystal design. CrystEngComm 2014, 16 (26), 5839−5848. (28) Etter, M. C. Encoding and decoding hydrogen-bond patterns of organic compounds. Acc. Chem. Res. 1990, 23, 120−126. (29) Etter, M. C. Hydrogen bonds as design elements in organic chemistry. J. Phys. Chem. 1991, 95, 4601−4610. (30) Etter, M. C.; Reutzel, S. M. Hydrogen bond directed cocrystallization and molecular recognition properties of acyclic imides. J. Am. Chem. Soc. 1991, 113, 2586−2598. (31) Abramov, Y. A. Theoretical Hydrogen-Bonding Analysis for Assessment of Physical Stability of Pharmaceutical Solid Forms. In Computational Pharmaceutical Solid State Chemistry; John Wiley & Sons: Hoboken, NJ, 2016; pp 37−56. (32) Musumeci, D.; Hunter, C. A.; Prohens, R.; Scuderi, S.; McCabe, J. F. Virtual cocrystal screening. Chem. Sci. 2011, 2 (5), 883−890. (33) Infantes, L.; Fabian, L.; Motherwell, W. D. S. Organic crystal hydrates: what are the important factors for formation. CrystEngComm 2007, 9 (1), 65−71. (34) van de Streek, J.; Motherwell, S. New software for searching the Cambridge Structural Database for solvated and unsolvated crystal structures applied to hydrates. CrystEngComm 2007, 9 (1), 55−64. (35) COSMOtherm, Version C3.0, Release 16.01; COSMOlogic GmbH & Co. KG, http://www.cosmologic.de. (36) Eckert, F.; Klamt, A. Fast solvent screening via quantum chemistry: COSMO-RS approach. AIChE J. 2002, 48, 369−385. (37) TURBOMOLE 7.1 2016, a development of University of Karlsruhe and Forschungszentrum Karlsruhe GmbH, 1989−2007, TURBOMOLE GmbH, since 2007; available from http://www. turbomole.com. (38) COSMOconf, 4.0, COSMOlogic GmbH & Co KG, http://www. cosmologic.de. (39) Frisch, M. J.; Trucks, G. W.; Schlegel, H. B.; Scuseria, G. E.; Robb, M. A.; Cheeseman, J. R.; Scalmani, G.; Barone, V.; Petersson, G. A.; Nakatsuji, H. Gaussian Inc. 16, revision A. 03; Gaussian Inc.: Wallingford, CT, 2016. (40) Dovesi, R.; Orlando, R.; Erba, A.; Zicovich-Wilson, C. M.; Civalleri, B.; Casassa, S.; Maschio, L.; Ferrabone, M.; De La Pierre, M.; DÁ rco, P.; Noël, Y.; Causà, M.; Rérat, M.; Kirtman, B. CRYSTAL14: K

DOI: 10.1021/acs.cgd.7b00517 Cryst. Growth Des. XXXX, XXX, XXX−XXX