Solvate Prediction for Pharmaceutical Organic Molecules with

Publication Date (Web): February 15, 2019. Copyright © 2019 American Chemical Society. *E-mail: [email protected]. Cite this:Cryst...
0 downloads 0 Views 492KB Size
Subscriber access provided by WEBSTER UNIV

Article

Solvate Prediction for Pharmaceutical Organic Molecules with Machine Learning Dongyue Xin, Nina C. Gonnella, Xiaorong He, and Keith Horspool Cryst. Growth Des., Just Accepted Manuscript • DOI: 10.1021/acs.cgd.8b01883 • Publication Date (Web): 15 Feb 2019 Downloaded from http://pubs.acs.org on February 18, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

Solvate Prediction for Pharmaceutical Organic Molecules with Machine Learning Dongyue Xin*, Nina C. Gonnella, Xiaorong He, Keith Horspool Material and Analytical Sciences, Boehringer Ingelheim Pharmaceuticals, Inc., Ridgefield, Connecticut, USA.

ABSTRACT Methods to predict crystallization behavior for active pharmaceutical ingredients (APIs) can serve as an important guide in small molecule pharmaceutical development. Here we describe solvate formation propensity prediction for pharmaceutical molecules via a machine learning approach. Random forests (RF) and support vector machine (SVM) algorithms were trained and tested with datasets extracted from Cambridge Structural Database (CSD). The machine learning models, requiring only 2D structures as input, were able to predict solvate formation propensity for organic molecules with up to 86 % success rate. Performance of the models was demonstrated with a collection of 20 pharmaceutical molecules.

*E-mail: [email protected]

ACS Paragon Plus Environment

1

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 33

Solvate Prediction for Pharmaceutical Organic Molecules with Machine Learning Dongyue Xin*, Nina C. Gonnella, Xiaorong He, Keith Horspool Material and Analytical Sciences, Boehringer Ingelheim Pharmaceuticals, Inc., Ridgefield, Connecticut, USA.

*E-mail: [email protected]

ACS Paragon Plus Environment

2

Page 3 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

ABSTRACT Methods to predict crystallization behavior for active pharmaceutical ingredients (APIs) can serve as an important guide in small molecule pharmaceutical development. Here we describe solvate formation propensity prediction for pharmaceutical molecules via a machine learning approach. Random forests (RF) and support vector machine (SVM) algorithms were trained and tested with datasets extracted from Cambridge Structural Database (CSD). The machine learning models, requiring only 2D structures as input, were able to predict solvate formation propensity for organic molecules with up to 86 % success rate. Performance of the models was demonstrated with a collection of 20 pharmaceutical molecules.

ACS Paragon Plus Environment

3

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 33

Introduction:

Solid form control and engineering is one of the key steps in small molecule drug development. During crystallization, solvent molecules can be incorporated into a solute crystal lattice and form solvates.1 About a third of organic molecules are estimated to form solvates or hydrates and pharmaceutical molecules have even higher propensities to form solvates.2-6 Formation of solvates can have multiple effects in pharmaceutical development.7, 8 Solvates, and in some cases metastable solid forms caused by desolvation of solvates, can introduce risks and challenges in solid form control that would impact crystallization process development. In less common cases, solvates can be developed to achieve better physiochemical properties compared with anhydrous forms.9 Current approaches heavily rely on experimental screening on a small scale to discover solvate forms but new solvates may sometimes still unexpectedly appear in large scale production.10 Therefore, prediction of solvate formation is highly desirable from the perspective of risk assessment, identification of new solid forms as well as crystallization solvent selection and process development. Multiple energy-based approaches have been reported for solvate prediction. Techniques of crystal structure prediction have significantly advanced in recent years. With better prediction algorithms and more computing power, solvate structure prediction is now possible. However, such predictions are still considered a significant challenge.11,

12

In addition, crystal structure

prediction requires specialty expertise and is usually computationally expensive and time consuming.13 Other methods involving thermodynamics of solute-solvent interactions have also been applied to investigate solvate formation14-16 but those models usually don’t take into account the conformational and molecular interactions in solid state.17

ACS Paragon Plus Environment

4

Page 5 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

As an alternative approach, statistical analysis of existing experimental knowledge can lead to useful structural conclusions on solvate formation. Intermolecular hydrogen bond interactions between solvent and solute molecules have been investigated in several studies for multiple solvate systems.18,

19

Key molecular features for solvate formation could be identified by

systematic analysis of existing solvate and non-solvated crystal structures.4,

20

The correlation

between molecular properties with solvate formation propensity has been utilized to develop statistical models for solvate prediction.4 Machine learning has been applied to solid state property predictions in multiple studies such as crystallinity prediction21, crystallization solvent selection22, cocrystal formation23 and cocrystal properties predictions24. In the area of solvate prediction, a machine learning model led to the discovery of three new solvates but the scope of the study was limited to only carbamazepine.25 In order to develop generally applicable machine learning models, a larger training dataset with high chemical diversity is needed. The main aim of this study is to explore two machine learning algorithms, random forests (RF)26 and support vector machine (SVM)27, for pharmaceutical organic molecule solvate prediction using data from Cambridge Structural Database (CSD)28. Nine organic solvents commonly employed in crystallization of pharmaceuticals with a large number of solvates and non-solvated structures in CSD were investigated. A collection of twenty pharmaceutically relevant molecules were randomly selected from the literature to test the performance of the best models. The results show machine learning models trained with CSD data entries can be applied to predict solvate formation propensities with good performance.

ACS Paragon Plus Environment

5

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 33

Methods

Data extraction from CSD Training and testing datasets were extracted from CSD 2018 (version 5.39) using an approach similar to previous work.4 The CSD database was firstly filtered to only retain entries that are organic, have 3D structures, have molecule weight between 100 and 700 for the heaviest component, have elements within group {C, H, O, N, P, S, F, Cl, Br, I} but not just consists of H and C using functions implemented in CCDC python API version 1.4.0. The remaining entries were further divided into subgroups based on the number of different molecular species in the crystal structure. One molecular species was only counted once even if more than one copy exists in the asymmetric unit so the datasets include crystal structures with Z’ > 1 and solvates with any stoichiometry. Within the single component subgroup, entries were further classified based on crystallization solvent and used as non-solvated structures. Only entries crystallized from a single solvent were used in this study to minimize the variance introduced by solvent activity in solvent mixtures. Solvates were identified in the two component subgroup if one component is a solvent molecule. The non-solvent component in each entry was saved as an individual file with a custom-made script for molecular descriptor calculations. Duplicates were removed to ensure uniqueness of each structure within each dataset. The data collection process excludes all salts so the models developed are only applicable to neutral molecules. The number of solvates and non-solvate-forming entries were balanced by randomly down sampling the class with more structures. Descriptor calculation

ACS Paragon Plus Environment

6

Page 7 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

Three thousand eighty hundred and fifty 2D molecular descriptors were calculated for each entry with Dragon 7.029. Descriptors that are not applicable to all of the molecules or those with constant values for all the entries in both solvate and non-solvate-forming groups for each solvent were excluded in subsequent machine learning method development. A summary of the calculated molecular descriptors were given in in supporting information. Training and testing of machine learning algorithms Machine learning algorithms were implemented with custom-developed scripts using scikitlearn package30 version 0.18.1. The solvate and non-solvate-forming groups for each solvent were randomly split into training and testing datasets with a ratio of 80/20. A robust scaler, which removes the median and scales the data according to the quantile range as implemented in scikit-learn package, was applied to the datasets to standardize the input data prior to training and testing. Since the problem of interest is to determine if a molecule tends to form a solvate in a particular solvent, machine learning models are trained as classifiers to differentiate solvates and non-solvate-forming molecules. In training the machine learning models, solvates were labeled as 1 while nonsolvated structures were labeled as 0. For each machine learning model, trainings were performed using a ten-fold cross validation to optimize the hyperparameters using accuracy of the classifier as the score function. The number of trees in the random forests, penalty parameter C, C and γ were optimized for RF model, SVM model with linear kernel and SVM model with radial basis function (RBF) kernel respectively. The best hyperparameters for each model were summarized in Table S1 (supporting information). After hyperparameter optimization, each model was re-trained with the whole training dataset using the optimal hyperparameters to generate a final model for testing.

ACS Paragon Plus Environment

7

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 33

Class probabilities of test entries can be computed for each machine learning model investigated and solvates are defined with > 0.5 probability. Receiver operating characteristic (ROC) curves can be generated using the probability estimates and the area under the curve (AUC) can be calculated. Feature importance analysis Feature importance of a random forest model can be estimated by averaging the ranking of a feature in each decision tree over randomized trees. A higher ranking of the feature indicates more contribution of the matching feature to the prediction function. These estimates are retrieved from the fitted random forest models as implemented in scikit-learn package. Solvate prediction with two descriptor models Solvate prediction with two descriptor models was performed as described in the original study.4 Molecular descriptors pairs AVS2_H2/nHDon, TRS/nHDon, SM3_H2/Hy and SM3_H2/H-050 were calculated for ethanol, methanol, dichloromethane and chloroform models respectively. The descriptor values were then fed into logistic function using intercept and descriptor coefficients reported in the original work to calculate solvate formation probabilities. The final solvate/non-solvate predictions were obtained by comparing the solvate formation probabilities with the cutoff points described in the original study. The calculated molecular descriptors, probabilities and model performance are summarized in supporting information.

Results and Discussions

The current version of CSD (2018) contains more than 900,000 entries of molecular crystal structures, mostly for organic and organometallic compounds. Entries used in the training and

ACS Paragon Plus Environment

8

Page 9 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

testing of machine learning models were generated via stepwise filtering of CSD (Figure 1) using in-house scripts in a similar approach described in previously published work. 4 Compared with previous works, we introduce more filters to generate datasets that are more relevant for pharmaceutical small molecules. Specifically, the molecular weight of the heaviest component was restricted to 100 – 700 and the elemental composition was restricted to be within {C, H, O, N, P, S, F, Cl, Br, I}, a collection of elements typically found in drug molecules. 31

Organic Have 3D structure Molecular weight of heaviest component Elemental composition

Two components

CSD Number of components in structure

Single component

classify based on solvate classify by crystallization solvent

Solvates

Non-solvated structures

Figure 1. Dataset collection from CSD

In developing machine learning models, the positive and negative entries were balanced via down sampling of the category with more entries. For instance, in case of ethanol, the positive dataset comprises 646 ethanol solvates while a much larger number of non-solvate-forming structures, 6296 structures, exist in CSD. To balance with positive dataset, 650 structures were randomly selected from the EtOH non-solvate-forming group as the negative dataset for EtOH solvate prediction. Training/testing dataset could be generated via random splitting of the positive/negative datasets with an approximately 8:2 split ratio. Table 1 shows the total number of structures from data mining as well as the number of structures in training/testing datasets. Since the number of solvate/non-solvate-forming structures vary significantly for each solvent in CSD, the number of

ACS Paragon Plus Environment

9

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 33

training entries for each solvent ranges from 370 (THF) to 2754 (MeOH) while 92 (THF) – 683 (MeOH) entries were saved for testing. We generated molecular descriptors with Dragon 7 as input for machine learning algorithms and only 2D descriptors were used so that the model is not dependent on molecular conformations.

Table 1. Training and Testing Datasets Solvent

CSD

Training

Testing

Solvate

Nonsolvate

Solvate

Nonsolvate

Solvate

Nonsolvate

Acetone

508

1497

407

410

101

100

ACN

521

1090

417

420

104

105

CHCl3

853

1341

683

680

170

170

DCM

902

1487

722

720

180

180

DMF

543

383

310

307

80

76

EtOAc

246

2338

197

200

49

50

EtOH

646

6296

517

520

129

130

MeOH

1717

3803

1374

1380

343

340

THF

231

248

185

185

46

46

In this study, we investigated two machine learning classification algorithms, random forests (RF)26 and support vector machine (SVM)27, to differentiate solvate and non-solvate-forming structures. Hyperparameter optimization for each model was performed by a grid search approach with ten-fold cross validation using mean prediction accuracy as the measurement for model performance. The final models for testing and further predictions were eventually

ACS Paragon Plus Environment

10

Page 11 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

obtained via re-training each algorithm with the whole training dataset employing the optimized hyperparameters. Table 2 lists the mean cross validation scores and standard deviations of the best models for each solvent after hyperparameter optimizations.

Table 2. Best mean cross validation scores and standard deviations for different models. RF

SVM(Linear)

SVM(RBF)

Mean Scores

Standard Deviation

Mean Scores

Standard Deviation

Mean Scores

Standard Deviation

Acetone

0.81

0.038

0.80

0.049

0.80

0.056

ACN

0.72

0.051

0.71

0.055

0.72

0.057

CHCl3

0.77

0.032

0.74

0.035

0.75

0.025

DCM

0.75

0.019

0.75

0.033

0.76

0.026

DMF

0.78

0.041

0.78

0.041

0.77

0.037

EtOAc

0.79

0.048

0.78

0.065

0.78

0.037

EtOH

0.78

0.024

0.78

0.027

0.74

0.018

MeOH

0.75

0.019

0.75

0.019

0.74

0.023

THF

0.78

0.073

0.77

0.044

0.77

0.057

The performance of the best models on unseen molecules was investigated with the testing datasets. Besides prediction accuracy score, receiver operating characteristic (ROC) curve, which plots the true-positive rates versus false-positive rates at various descrimination threshold for each mode, could also show the model’s ability to differentiate between positive and negative entries. The area under ROC curve (AUC) could be employed as the single numeric measurement to compare the overall performance of different algorithms and a higher AUC

ACS Paragon Plus Environment

11

Crystal Growth & Design

value represents a better classification model.32 Figure 2 shows the ROC curves and AUC values of three best models for ethanol solvate prediction. A larger AUC plus an ROC curve more toward the upper left corner of the plot indicate the RF and SVM (linear) models perform better than SVM (RBF).

1

0.8 True Positive Rate

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 33

SVM(Linear), AUC = 0.87 RF, AUC = 0.86

0.6

SVM(RBF), AUC = 0.83 0.4

0.2

0 0

0.2

0.4 0.6 False Positive Rate

0.8

1

Figure 2. ROC curve and AUC for EtOH solvate prediction. For detailed performance analysis on each class, confusion matrices were generated for each model on each testing dataset. Confusion matrix provides the number of true positive (TP)/negative (TN) and false positive (FP)/negative (FN) entries and enables the calculation of success rates for each class. As an example, in the confusion matrix for RF EtOH solvate prediction model (Table 3), the true negative/positive rates for non-solvated and solvate classes are 102 / (102 + 28) = 78.5 % and 107 / (107 + 22) = 82.9 %. Balanced true negative and true positive rates indicate this model performs equally well for both solvate and non-solvate-forming

ACS Paragon Plus Environment

12

Page 13 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

molecules. The confusion matrics for all models could be found in supporting informaion and no significant bias was observed towards any particular class for any model.

Table 3. Confusion matrix of RF EtOH solvate prediction model on testing dataset. Predicted non-solvate

Predicted solvate

Actual non-solvate

102 (TN)

28 (FP)

Actual solvate

22 (FN)

107 (TP)

TN: true negative; FP: false positive; FN: false negative; TP: true positive.

Prediction accuracy score and AUC for all testing datasets were summarized in Table 4. Overall all three models have a classification success rate between 70 % to 86 % for unseen molecules. Of the nine solvents, acetone, EtOH, EtOAc and DMF solvate prediction models yield the best performance on testing datasets with > 80 % success rate and > 0.85 AUC while ACN models are less accurate on testing dataset. The testing scores are similar to cross validation training scores (Table 4), indicating the models could be applied to unseen molecules with good performance.

Table 4. Testing Score and AUC of Different Models.* RF

SVM(Linear)

SVM(RBF)

Score

AUC

Score

AUC

Score

AUC

Acetone

0.82

0.90

0.82

0.88

0.79

0.87

ACN

0.70

0.77

0.70

0.77

0.69

0.76

CHCl3

0.78

0.82

0.76

0.82

0.75

0.83

DCM

0.77

0.83

0.75

0.83

0.74

0.82

ACS Paragon Plus Environment

13

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 33

DMF

0.86

0.93

0.79

0.88

0.82

0.90

EtOAc

0.84

0.89

0.76

0.84

0.79

0.84

EtOH

0.81

0.86

0.80

0.87

0.76

0.83

MeOH

0.78

0.86

0.77

0.85

0.76

0.85

THF

0.75

0.84

0.78

0.80

0.77

0.80

* Models with best AUC for each solvent were highlighted.

RF Model Random forest algorithm creates an ensemble of fitted decision trees in which each tree is built upon a random subset of descriptors. The overall prediction is an average of all the decision trees to avoid over-fitting.26 One of the key hyperparameters of RF models is the number of decision trees in the random forest. The optimal number of trees was found to be around 220-560 and a larger number of trees would not further increase the accuracy of the prediction. In this study, RF models outperformed SVM models for 7 out of 9 solvents. To investigate the importance of each molecular descriptor, we conducted feature importance analysis for each RF model. The top ten most important features listed in Table 5 are colored based on descriptor types. For EtOH, MeOH and DMF solvate models, hydrogen bond donor descriptors seem to play a significant role while for other models the top features are mainly related to shape and size of the molecule. Those observations indicate that the formation of EtOH, MeOH and DMF solvates has strong correlation to intermolecular hydrogen bonds and hydrophilicity of the molecule. For other solvates, interactions such as dipole-dipole interactions and van de Waals interaction might contribute more significantly to solvate formation than hydrogen bonding and the solvent molecules may more likely act as space fillers. A detailed description of all the descriptors in this table is listed in SI (Table S20).

ACS Paragon Plus Environment

14

Page 15 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

Table 5. RF model top 10 key descriptors. Ranking

EtOH

MeOH

EtOAc

THF

ACN

Acetone

DCM

DMF

CHCl3

1

SAdon

Hy

H_Dz( v)

SM04_ AEA(b o)

Hy

SM03_ AEA(bo )

Pol

MATS 1v

SM02_ EA(ri)

2

P_VSA _ppp_ D

H-050

Eig09_ EA(bo)

SM05_ AEA(b o)

P_VSA _ppp_ D

Eig07_ EA(ri)

Wi_H2

P_VSA _ppp_ D

SM4_H 2

3

Hy

SAdon

Eig06_ AEA(b o)

SM02_ EA(ed)

SAdon

Eig08_ AEA(bo )

EE_H2

Hy

SpAD_ EA(ed)

4

CATS2 D_00_ DD

nHDon

SM05_ AEA(b o)

Eig08_ AEA(b o)

Eig08_ AEA(b o)

Eig07_ EA(bo)

ATS4v

SAdon

SRW06

5

nHDon

P_VSA _ppp_ D

Eig09_ AEA(b o)

SM03_ AEA(e d)

Eig09_ AEA(e d)

SM04_ AEA(bo )

X3

MATS 1p

RDSQ

6

H-050

CATS2 D_00_ DD

SM03_ AEA(b o)

SM06_ AEA(b o)

ATS4v

Eig07_ AEA(bo )

SM2_ H2

CATS2 D_00_ DD

X5

7

SM6_H 2

MATS 1e

Eig10_ EA(bo)

SM6_L

P_VSA _p_2

H_Dz(p )

H_X

H-050

X4

8

SM5_H 2

CATS2 D_03_ DL

Eig09_ EA(ri)

SM11_ AEA(b o)

GGI5

Hy

Eig14_ AEA(e d)

nHDon

MPC02

9

SM02_ EA(ed)

CATS2 D_04_ DL

Eig09_ EA

SM04_ AEA(e d)

MATS 1v

Eig06_ AEA(bo )

HDcpx

ATSC1 s

SM04_ AEA(b o)

10

MWC0 6

CATS2 D_05_ DL

SM04_ AEA(b o)

SM02_ EA(ri)

Eig12_ AEA(e d)

Eig08_ EA(ri)

SpAD _L

MATS 1e

ZM2

Color code: blue, hydrogen bond donor; green, hydrophilic factor; polarity/polarizability; purple, size and branching; brown, connectivity; black, others.

orange,

SVM Model

ACS Paragon Plus Environment

15

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 33

Support vector machine algorithm aims at finding a hyperplane which could maximize class separation in a high dimensional space. Kernel methods enable data mapping into an implicit higher dimensional space to simplify the mathematical treatment for non-linear separated data.33 In this study, linear kernel and radial basis function (RBF) kernel were investigated, allowing distinct decision boundaries to be learned. The performance of SVM algorithms is generally very sensitive to hyperparameters.33 The penalty parameter C, which trades off classification accuracy of a training dataset against the margin of decision boundary, is critical for SVM model with linear kernel. A small C encourages a larger margin and more data points at the boundary are involved in generating the hyperplane. All the optimized SVM (linear) models feature small C, which suggests there are a significant number of misclassified entries near the decision boundaries. For SVM models with RBF kernels, an additional hyperparameter γ needs to be optimized besides parameter C. A combination of large C and small γ was found to be crucial for optimal model performance, suggesting the optimized RBF SVM models have a smooth decision boundary and behave similarly to a linear SVM model, which might account for the similar performance for those two kernel functions. Overall, the results indicate that solvates and non-solvate-forming classes are separable by smooth decision boundaries in SVM models but there is high degree of uncertainty for the entries near the boundaries.

Performance of Solvate Prediction Models on Selected Pharmaceutical Molecules Twenty pharmaceutically relevant small molecules were collected from the literature to test the performance of solvate prediction models. We selected these molecules because detailed experimental solvate screening has been reported for most, if not all, of the solvents investigated

ACS Paragon Plus Environment

16

Page 17 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

in this study. In addition, these molecules vary significantly in size, molecular weight, elemental composition, functional group and molecular flexibility. Several pairs of similar structures, celecoxib/PHA-739521, droperidol/benperidol and LY2806920/LY2624803 were also included to investigate if solvate prediction models are sensitive to subtle structural changes. This collection of structures is representative of the diversity and complexity of pharmaceutical organic molecules. A summary of the experimental screening results could be found in the supporting information (Table S3).

O H2N S O

HO

O HN H N

N

H

N

N

CF3

N N

H H2N

HO Nevirapine

Ethinylestradiol O S N O HN

OH HO

H2N

COOH

CF3

OH HOOC

S

O

HN

N

OH

N

H2N

O

NH2

Cl Cl

O

N

O

O

N

Droperidol

Benperidol

N H

O

Aripiprazole COOMe

CF3

N

OH

O

N

OH

H N

H

O S

N H

H

NH H

N

F PHA-739521

N

N N

TXP N

O

O

O O

O H2N S O

TAK-441

H N

F

N N

O

F

H N

O

N

Gallic acid

N O

O

Sulfathiazole

HO

O

OH -Resorcylic acid

O

Carbamazepine

Celecoxib

Paracetamol

HO

H

H OH

Methyl cholate

Axitinib

N COOH

N

O H2N S O

N N H

S

Olanzapine

HOOC NH Cl Furosemide

N

N

N O

HOOC

N N

N

O

O LY2806920

LY2624803

Figure 3. 20 structures randomly selected from the literature.

ACS Paragon Plus Environment

17

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 33

Solvate formation propensity for each case was predicted with optimized RF models and the results were compared with experimental data (Table 6). If experimental solvate screening data is not available, the entry is labeled with ?; otherwise, + and – are used to differentiate between correct and incorrect predictions. Success rate could be calculated for each solvate prediction model after excluding the ambiguous entries. The EtOH model was found to be most accurate for this collection of molecules, with 17 out of 20 predictions being consistent with experimental screening results (85 % success rate) while DCM and CHCl3 models showed less satisfactory performance (69 % and 59 % success rate). The success rates of the other models are comparable to the corresponding testing scores with accuracy ranging from 70 % to 82 %. Based on these results, it appears that, to a large extent, the optimized RF models could be generally applied to pharmaceutical molecules with good performance.

Table 6. Performance of RF solvate prediction models on 20 pharmaceutical molecules. EtOH

MeOH EtOAc

THF

ACN

Acetone

DCM

DMF

CHCl3

Nevirapine34

+

+

-

+

+

+

-

+

+

Ethinylestradiol35, 36

+

+

+

+

+

+

+

+

-

Celecoxib37

+

+

-

-

+

+

+

+

+

Carbamazepine38, 39

+

+

+

-

+

-

+

+

+

β-Resorcylic Acid40

+

-

+

+

-

+

+

+

+

Sulfathiazole5, 41

+

-

+

-

+

-

?

+

?

Gallic acid42

+

-

+

+

-

+

+

+

+

TAK044143

+

-

+

+

-

+

?

?

?

TXP44

+

+

+

+

?

+

?

?

-

ACS Paragon Plus Environment

18

Page 19 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

Droperidol45

+

+

+

+

+

+

-

+

+

Benperidol46

+

+

-

+

+

+

+

+

-

Aripiprazole47

-

+

+

+

+

+

-

+

+

PHA-73952148

+

-

-

-

+

+

+

-

+

Paracetamol49

+

+

+

+

+

+

+

-

+

Axitinib50

+

+

-

-

-

-

-

-

-

Methyl cholate51

+

+

-

+

+

-

+

+

-

Olanzapine52, 53

+

+

+

-

-

+

-

?

-

Furosemide54

-

+

+

+

+

+

?

+

?

LY280692055

+

+

+

+

+

+

+

+

+

LY262480355

-

+

+

+

+

+

+

+

-

Overall success rate (%)

85

75

70

70

74

80

69

82

59

+, prediction consistent with experimental; -, prediction not consistent with experimental data; ?, no experimental data available. Shaded entries are solvates found experimentally.

Solvate Formation Probability Calculations in Selected Case Studies In RF models, solvate formation probabilities could be estimated for each molecule as the mean predicted solvate formation probabilities of the trees in the random forest. Instead of a simple yes/no classification, probabilities provide a numeric measurement and allow comparisons between different molecules and across various solvents. To illustrate, solvate formation probabilities were calculated for Ethinylestradiol and Celecoxib and compared with experimental data (Figure 4). In case of Ethinylestradiol (Figure 4a), high solvate formation probabilities were predicted for EtOH, MeOH, ACN, acetone and DMF, which is in line with experimental screening results35, 36. In another example, Celecoxib was predicted to form solvates with EtOAc, THF and DMF but probabilities are only marginally above 0.5 for EtOAc and THF,

ACS Paragon Plus Environment

19

Crystal Growth & Design

suggesting high uncertainty for EtOAc and THF (Figure 4b). Indeed, only DMF solvate is experimentally observed for Celecoxib37.

a

1 0.9

Solvate formation probability

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 33

0.8

predicted solvate

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 EtOH MeOH EtOAc

THF

ACN acetone DCM

DMF

CHCl3

ACS Paragon Plus Environment

20

Page 21 of 33

b

1 0.9

Solvate formation probability

0.8

predicted solvate

0.7 0.6 0.5

0.4 0.3 0.2 0.1 0

EtOH MeOH EtOAc

c

THF

ACN acetone DCM

1

DMF

CHCl3

LY2806920

0.9

LY2624803

0.8 Solvate formation probability

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

predicted solvate

0.7 0.6

0.5 0.4 0.3

0.2 0.1 0 EtOH MeOH EtOAc

THF

ACN acetone DCM

DMF

CHCl3

Figure 4. Predicted solvate formation probabilities for a, Ethinylestradiol; b, Celecoxib and c, structurally similar molecules LY2806920 and LY2624803. A 0.5 probability cutoff line is displayed. Experimentally observed solvates are marked with red boxes.

The collection of models could even distinguish between similar structures. As an example, LY2806920 and LY2624803 are similar structures but the models predicted generally higher

ACS Paragon Plus Environment

21

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 33

solvate formation probabilities for LY2624803 and low probability of finding solvates for LY2806920 (Figure 4c). This is consistent with experimental screening results that only LY2624803 was found to form solvate with EtOH and MeOH.

Comparison with Two Descriptor Models The performance of RF models was further compared with the previously reported two descriptor models4 for the twenty pharmaceutical molecules. It was found that RF models developed in this study perform significantly better and yield a higher success rate for EtOH and MeOH solvate prediction (Figure 5). This might be due to the differences in interactions of molecular features and training datasets. RF and SVM models utilize all molecular features for classification so that any interactions between features are implicitly incorporated in the model. In addition, in this study, the training datasets feature exclusively pharmaceutically relevant organic molecules as a result of several molecular property filters applied in the process of data collection from CSD. Since the performance of machine learning models is highly dependent on the scope and quality of training datasets, these filters are critical to ensure trained models are applicable to pharmaceutical molecules. To illustrate, 1264 acetonitrile solvates could be identified but after applying molecular weight and elemental composition filters, only 521 structures were left. In this case, nearly 60 % of acetonitrile solvates are not relevant to pharmaceutical development and models developed with these entries might be biased.

ACS Paragon Plus Environment

22

Page 23 of 33

90

RF model

80

Two-descriptor model

70 Sucess rate (%)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

60 50 40 30 20 10 0 EtOH

MeOH

DCM

CHCl3

Figure 5. Success rate of RF models (red) compared with two-descriptor models (blue) on 20 pharmaceutical molecules.

Conclusion

In summary, we have developed machine learning models based on RF and SVM algorithms to predict solvate formation propensity for organic molecules. The training datasets could be readily extracted from CSD and filtered to retain structures resembling pharmaceutical molecules. Training and testing of machine learning models show that both RF and SVM algorithms could predict solvate formation with good accuracy and RF models generally perform marginally better. Feature importance analysis of RF models suggests the driving force for different solvates might be quite distinctive. Application of the machine learning models is demonstrated with a set of twenty pharmaceutical molecules selected from the literature. Overall, the results show that machine learning models trained with molecules from a publicly available

ACS Paragon Plus Environment

23

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 33

database can be used as practical tools for fast and accurate prediction of solvate formation for pharmaceutical molecules. Future development will focus on expanding the training datasets to include experimental solid form screening data.

ACS Paragon Plus Environment

24

Page 25 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

ASSOCIATED CONTENT Supporting Information The Supporting Information is available free of charge on the ACS Publications website. List of molecular descriptors calculated in this study; optimized hyperparameters for each model; confusion matrices for all models; description of top important features in RF models; summary of experimental screening results for twenty pharmaceutical molecules; calculated molecular descriptors, predicted probabilities and model performance for two descriptor models; calculated molecular descriptors for all datasets.

AUTHOR INFORMATION Corresponding Author *E-mail: [email protected]

NOTES The authors declare no competing financial interests.

ACKNOWLEDGMENT We gratefully acknowledge Mr. David Craska for support in set-up and management of our High Performance Computing resources and Dr. Hong Guo for help with Dragon 7 program.

ACS Paragon Plus Environment

25

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 33

REFERENCES (1)

Brittain, H. G., Polymorphism in Pharmaceutical Solids. 2 ed.; CRC Press: 2009.

(2)

Stahly, G. P., Diversity in Single- and Multiple-Component Crystals. The Search for and

Prevalence of Polymorphs and Cocrystals. Cryst. Growth Des. 2007, 7, 1007-1026. (3)

Boothroyd, S.; Kerridge, A.; Broo, A.; Buttar, D.; Anwar, J., Why Do Some Molecules

Form Hydrates or Solvates? Cryst. Growth Des. 2018, 18, 1903-1908. (4)

Takieddin, K.; Khimyak, Y. Z.; Fábián, L., Prediction of Hydrate and Solvate Formation

Using Statistical Models. Cryst. Growth Des. 2016, 16, 70-81. (5)

Bingham, A. L.; Hughes, D. S.; Hursthouse, M. B.; Lancaster, R. W.; Tavener, S.;

Threlfall, T. L., Over one hundred solvates of sulfathiazole. Chem. Commun. 2001, 603-604. (6)

Price, C. P.; Glick, G. D.; Matzger, A. J., Dissecting the behavior of a promiscuous

solvate former. Angew Chem Int Ed Engl 2006, 45, 2062-2066. (7)

Vippagunta, S. R.; Brittain, H. G.; Grant, D. J., Crystalline solids. Adv. Drug. Deliv. Rev.

2001, 48, 3-26. (8)

Datta, S.; Grant, D. J., Crystal structures of drugs: advances in determination, prediction

and engineering. Nat. Rev. Drug. Discov. 2004, 3, 42-57. (9)

Peterson, M. L.; Hickey, M. B.; Zaworotko, M. J.; Almarsson, O., Expanding the scope

of crystal form evaluation in pharmaceutical science. J. Pharm. Pharm. Sci. 2006, 9, 317-326. (10)

Morissette, S. L.; Almarsson, O.; Peterson, M. L.; Remenar, J. F.; Read, M. J.; Lemmo,

A. V.; Ellis, S.; Cima, M. J.; Gardner, C. R., High-throughput crystallization: polymorphs, salts, co-crystals and solvates of pharmaceutical solids. Adv. Drug. Deliv. Rev. 2004, 56, 275-300. (11)

Price, S. L., Predicting crystal structures of organic compounds. Chem. Soc. Rev. 2014,

43, 2098-2111.

ACS Paragon Plus Environment

26

Page 27 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

(12)

Reilly, A. M.; Cooper, R. I.; Adjiman, C. S.; Bhattacharya, S.; Boese, A. D.;

Brandenburg, J. G.; Bygrave, P. J.; Bylsma, R.; Campbell, J. E.; Car, R.; Case, D. H.; Chadha, R.; Cole, J. C.; Cosburn, K.; Cuppen, H. M.; Curtis, F.; Day, G. M.; DiStasio, R. A., Jr.; Dzyabchenko, A.; van Eijck, B. P.; Elking, D. M.; van den Ende, J. A.; Facelli, J. C.; Ferraro, M. B.; Fusti-Molnar, L.; Gatsiou, C. A.; Gee, T. S.; de Gelder, R.; Ghiringhelli, L. M.; Goto, H.; Grimme, S.; Guo, R.; Hofmann, D. W.; Hoja, J.; Hylton, R. K.; Iuzzolino, L.; Jankiewicz, W.; de Jong, D. T.; Kendrick, J.; de Klerk, N. J.; Ko, H. Y.; Kuleshova, L. N.; Li, X.; Lohani, S.; Leusen, F. J.; Lund, A. M.; Lv, J.; Ma, Y.; Marom, N.; Masunov, A. E.; McCabe, P.; McMahon, D. P.; Meekes, H.; Metz, M. P.; Misquitta, A. J.; Mohamed, S.; Monserrat, B.; Needs, R. J.; Neumann, M. A.; Nyman, J.; Obata, S.; Oberhofer, H.; Oganov, A. R.; Orendt, A. M.; Pagola, G. I.; Pantelides, C. C.; Pickard, C. J.; Podeszwa, R.; Price, L. S.; Price, S. L.; Pulido, A.; Read, M. G.; Reuter, K.; Schneider, E.; Schober, C.; Shields, G. P.; Singh, P.; Sugden, I. J.; Szalewicz, K.; Taylor, C. R.; Tkatchenko, A.; Tuckerman, M. E.; Vacarro, F.; Vasileiadis, M.; VazquezMayagoitia, A.; Vogt, L.; Wang, Y.; Watson, R. E.; de Wijs, G. A.; Yang, J.; Zhu, Q.; Groom, C. R., Report on the sixth blind test of organic crystal structure prediction methods. Acta Crystallogr B Struct Sci Cryst Eng Mater 2016, 72, 439-459. (13)

Price, S. L.; Braun, D. E.; Reutzel-Edens, S. M., Can computed crystal energy landscapes

help understand pharmaceutical solids? Chem. Commun. 2016, 52, 7065-7077. (14)

Loschen, C.; Klamt, A., Computational Screening of Drug Solvates. Pharm Res 2016, 33,

2794-2804. (15)

Loschen, C.; Klamt, A., Solubility prediction, solvate and cocrystal screening as tools for

rational crystal engineering. J. Pharm. Pharmacol. 2015, 67, 803-811.

ACS Paragon Plus Environment

27

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(16)

Page 28 of 33

Abramov, Y. A., Virtual hydrate screening and coformer selection for improved relative

humidity stability. CrystEngComm 2015, 17, 5216-5224. (17)

Tilbury, C. J.; Chen, J.; Mattei, A.; Chen, S.; Sheikh, A. Y., Combining Theoretical and

Data-Driven Approaches To Predict Drug Substance Hydrate Formation. Cryst. Growth Des. 2018, 18, 57-67. (18)

Nangia, A.; R. Desiraju, G., Pseudopolymorphism: occurrences of hydrogen bonding

organic solvents in molecular crystals. Chem. Commun. 1999, 605-606. (19)

Brychczynska, M.; Davey, R. J.; Pidcock, E., A study of methanol solvates using the

Cambridge structural database. New J. Chem. 2008, 32, 1754-1760. (20)

Infantes, L.; Fábián, L.; Motherwell, W. D. S., Organic crystal hydrates: what are the

important factors for formation. CrystEngComm 2007, 9, 65-71. (21)

Wicker, J. G. P.; Cooper, R. I., Will it crystallise? Predicting crystallinity of molecular

materials. CrystEngComm 2015, 17, 1927-1934. (22)

Pillong, M.; Marx, C.; Piechon, P.; Wicker, J. G. P.; Cooper, R. I.; Wagner, T., A

publicly available crystallisation data set and its application in machine learning. CrystEngComm 2017, 19, 3737-3745. (23)

Wicker, J. G. P.; Crowley, L. M.; Robshaw, O.; Little, E. J.; Stokes, S. P.; Cooper, R. I.;

Lawrence, S. E., Will they co-crystallize? CrystEngComm 2017, 19, 5336-5340. (24)

Rama Krishna, G.; Ukrainczyk, M.; Zeglinski, J.; Rasmuson, Å. C., Prediction of Solid

State Properties of Cocrystals Using Artificial Neural Network Modeling. Cryst. Growth Des. 2018, 18, 133-144.

ACS Paragon Plus Environment

28

Page 29 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

(25)

Johnston, A.; Johnston, B. F.; Kennedy, A. R.; Florence, A. J., Targeted crystallisation of

novel carbamazepine solvates based on a retrospective Random Forest classification. CrystEngComm 2008, 10, 23-25. (26)

Breiman, L., Random Forests. Machine Learning 2001, 45, 5-32.

(27)

Burges, C. J. C., A Tutorial on Support Vector Machines for Pattern Recognition. Data

Min. Knowl. Discov. 1998, 2, 121-167. (28)

Allen, F. H., The Cambridge Structural Database: a quarter of a million crystal structures

and rising. Acta Crystallogr B 2002, 58, 380-388. (29)

Dragon, 7.0; Kode Chemoinformatics.

(30)

Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel,

M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E., Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825-2830. (31)

Wu, P.; Nielsen, T. E.; Clausen, M. H., Small-molecule kinase inhibitors: an analysis of

FDA-approved drugs. Drug Discov. Today 2016, 21, 5-10. (32)

Bradley, A. P., The use of the area under the ROC curve in the evaluation of machine

learning algorithms. Pattern Recognition 1997, 30, 1145-1159. (33)

Ben-Hur, A.; Weston, J., A User’s Guide to Support Vector Machines. In Data Mining

Techniques for the Life Sciences, Carugo, O.; Eisenhaber, F., Eds. Humana Press: Totowa, NJ, 2010; pp 223-239. (34)

Caira, M. R.; Stieger, N.; Liebenberg, W.; Villiers, M. M. D.; Samsodien, H., Solvent

Inclusion by the Anti-HIV Drug Nevirapine: X-Ray Structures and Thermal Decomposition of Representative Solvates. Cryst. Growth Des. 2008, 8, 17-23.

ACS Paragon Plus Environment

29

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(35)

Page 30 of 33

Guguta, C.; Eeuwijk, I.; Smits, J. M. M.; de Gelder, R., Structural Diversity of Ethinyl

Estradiol Solvates. Cryst. Growth Des. 2008, 8, 823-831. (36)

Karpinska, J.; Erxleben, A.; McArdle, P., Applications of Low Temperature Gradient

Sublimation in Vacuo: Rapid Production of High Quality Crystals. The First Solvent-Free Crystals of Ethinyl Estradiol. Cryst. Growth Des. 2013, 13, 1122-1130. (37)

Chawla, G.; Gupta, P.; Thilagavathi, R.; Chakraborti, A. K.; Bansal, A. K.,

Characterization of solid-state forms of celecoxib. Eur J Pharm Sci 2003, 20, 305-317. (38)

Florence, A. J.; Johnston, A.; Price, S. L.; Nowell, H.; Kennedy, A. R.; Shankland, N.,

An automated parallel crystallisation search for predicted crystal structures and packing motifs of carbamazepine. J. Pharm. Sci. 2006, 95, 1918-1930. (39)

Fabbiani, F. P. A.; Byrne, L. T.; McKinnon, J. J.; Spackman, M. A., Solvent inclusion in

the structural voids of form II carbamazepine: single-crystal X-ray diffraction, NMR spectroscopy and Hirshfeld surface analysis. CrystEngComm 2007, 9, 728-731. (40)

Braun, D. E.; Karamertzanis, P. G.; Arlin, J.-B.; Florence, A. J.; Kahlenberg, V.; Tocher,

D. A.; Griesser, U. J.; Price, S. L., Solid-State Forms of β-Resorcylic Acid: How Exhaustive Should a Polymorph Screen Be? Cryst. Growth Des. 2011, 11, 210-220. (41)

Aitipamula, S.; Chow, P. S.; Tan, R. B. H., The solvates of sulfamerazine: structural,

thermochemical, and desolvation studies. CrystEngComm 2012, 14, 691-699. (42)

Braun, D. E.; Bhardwaj, R. M.; Florence, A. J.; Tocher, D. A.; Price, S. L., Complex

Polymorphic System of Gallic Acid—Five Monohydrates, Three Anhydrates, and over 20 Solvates. Cryst. Growth Des. 2013, 13, 19-23. (43)

Iwata, K.; Kojima, T.; Ikeda, Y., Solid Form Selection of Highly Solvating TAK-441

Exhibiting Solvate-Trapping Polymorphism. Cryst. Growth Des. 2014, 14, 3335-3342.

ACS Paragon Plus Environment

30

Page 31 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

(44)

Caira, M. R.; Bettinetti, G.; Sorrenti, M., Structural relationships, thermal properties, and

physicochemical characterization of anhydrous and solvated crystalline forms of tetroxoprim. J. Pharm. Sci. 2002, 91, 467-481. (45)

Be̅rziņš, A.; Skarbulis, E.; Rekis, T.; Actiņš, A., On the Formation of Droperidol

Solvates: Characterization of Structure and Properties. Cryst. Growth Des. 2014, 14, 2654-2664. (46)

Be̅rziņš, A.; Skarbulis, E.; Actiņš, A., Structural Characterization and Rationalization of

Formation, Stability, and Transformations of Benperidol Solvates. Cryst. Growth Des. 2015, 15, 2337-2351. (47)

Braun, D. E.; Gelbrich, T.; Kahlenberg, V.; Tessadri, R.; Wieser, J.; Griesser, U. J.,

Stability of Solvates and Packing Systematics of Nine Crystal Forms of the Antipsychotic Drug Aripiprazole. Cryst. Growth Des. 2009, 9, 1054-1065. (48)

Sun, C., Solid-state properties and crystallization behavior of PHA-739521 polymorphs.

Int. J. Pharm. 2006, 319, 114-120. (49)

Fabbiani, F. P. A.; Allan, D. R.; Dawson, A.; David, W. I. F.; McGregor, P. A.; Oswald,

I. D. H.; Parsons, S.; Pulham, C. R., Pressure-induced formation of a solvate of paracetamol. Chem. Commun. 2003, 3004-3005. (50)

Campeta, A. M.; Chekal, B. P.; Abramov, Y. A.; Meenan, P. A.; Henson, M. J.; Shi, B.;

Singer, R. A.; Horspool, K. R., Development of a Targeted Polymorph Screening Approach for a Complex Polymorphic and Highly Solvating API. J. Pharm. Sci. 2010, 99, 3874-3886. (51)

Be̅rziņš, A.; Trimdale, A.; Kons, A.; Zvaniņa, D., On the Formation and Desolvation

Mechanism of Organic Molecule Solvates: A Structural Study of Methyl Cholate Solvates. Cryst. Growth Des. 2017, 17, 5712-5724.

ACS Paragon Plus Environment

31

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(52)

Page 32 of 33

Cavallari, C.; Santos, B. P.-A.; Fini, A., Olanzapine Solvates. J. Pharm. Sci. 2013, 102,

4046-4056. (53)

Bhardwaj, R. M.; Price, L. S.; Price, S. L.; Reutzel-Edens, S. M.; Miller, G. J.; Oswald, I.

D. H.; Johnston, B. F.; Florence, A. J., Exploring the Experimental and Computed Crystal Energy Landscape of Olanzapine. Cryst. Growth Des. 2013, 13, 1602-1617. (54)

Minkov, V. S.; Beloborodova, A. A.; Drebushchak, V. A.; Boldyreva, E. V., Furosemide

Solvates: Can They Serve As Precursors to Different Polymorphs of Furosemide? Cryst. Growth Des. 2014, 14, 513-522. (55)

Braun, D. E.; McMahon, J. A.; Koztecki, L. H.; Price, S. L.; Reutzel-Edens, S. M.,

Contrasting Polymorphism of Related Small Molecule Drugs Correlated and Guided by the Computed Crystal Energy Landscape. Cryst. Growth Des. 2014, 14, 2056-2072.

ACS Paragon Plus Environment

32

Page 33 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

For Table of Contents Use Only

Solvate Prediction for Pharmaceutical Organic Molecules with Machine Learning Dongyue Xin*, Nina C. Gonnella, Xiaorong He, Keith Horspool

Synopsis: Machine learning models trained with datasets extracted from Cambridge Structural Database (CSD) can predict solvate formation for pharmaceutical molecules.

ACS Paragon Plus Environment

33