Combined Chemoinformatics Approach to Solvent Library Design

and Crystallisation c/o Strathclyde Institute of Pharmacy and Biomedical Sciences, University of Strathclyde, Technology and Innovation Centre, 99...
0 downloads 10 Views 4MB Size
Article pubs.acs.org/jcim

Combined Chemoinformatics Approach to Solvent Library Design Using clusterSim and Multidimensional Scaling Andrea Johnston,† Rajni Bhardwaj-Miglani,† Rajesh Gurung,‡ Antony D. Vassileiou,† Alastair J. Florence,† and Blair. F. Johnston*,† †

EPSRC Centre for Innovative Manufacturing in Continuous Manufacturing and Crystallisation and ‡EPSRC Doctoral Training Centre in Continuous Manufacturing and Crystallisation c/o Strathclyde Institute of Pharmacy and Biomedical Sciences, University of Strathclyde, Technology and Innovation Centre, 99 George Street, Glasgow G1 1RD, United Kingdom S Supporting Information *

ABSTRACT: Reported here is a rational approach for the selection of solvents intended for use in physical form screening based on a novel chemoinformatics analysis of solvent properties. A comprehensive assessment of eight clustering methods was carried out on a series of 94 solvents described by calculated molecular descriptors using the clusterSim package in R. The effectiveness of clustering methods was evaluated using a range of statistical measures as well as increasing efficiency of solid form discovery using a cluster-based solvent selection approach. Multidimensional scaling was used to illustrate cluster analysis on a two-dimensional solvent map. The map presented here is a valuable tool to aid efficient solvent selection in physical form screens. This tool is equally applicable to any scientific area which requires a solubility dependent decision on solvent choice.



INTRODUCTION The primary aims of experimental physical form screening are to obtain the maximum number of different crystalline forms, including polymorphs, solvates, salts, or cocrystals, of the compound under study and to identify the most thermodynamically stable form.1,2 This should be achieved as efficiently as possible, using the minimum amount of material (solute and solvent) and the least number of experiments necessary, in order to ensure that all practically relevant forms have been found. Solution crystallization is a key element of any rigorous physical form screen and is also amenable to small-scale multiwell plate experiments, enabling high-throughput approaches to be employed to deliver possibly thousands of experiments covering large regions of crystallization space.3 Solution crystallization in itself covers a broad range of techniques and variables. The most commonly employed crystallizations from solution include cooling crystallization, antisolvent crystallization, evaporative crystallization, and slurry crystallization. Typically, effects of experimental variables on the resultant solid form are investigated but the extent of their influence on the crystallization outcome is not always fully understood. Experimental variables may include solvent identity, solution supersaturation, temperature, agitation, and heteronuclei such as impurities or templates.4−7 Solvent choice is a crucial factor in the successful identification of new crystalline forms,8,9 and it is common practice to crystallize the molecule of interest from a wide range of solvents.2,10 Changes in solution chemistry, effected by the use of large numbers of solvents and solvent mixtures, provide access to an extensive variation in critical crystallization conditions which can in turn © 2017 American Chemical Society

lead to nucleation and growth of different crystalline forms. Solvent choice may also be restricted by safety considerations, however, for the purposes of this work, all safety classes of solvents have been considered. A key objective in the pharmaceutical industry and solid-state science community is the ability to predict crystal structures of small organic molecules. However, predicting the range of thermodynamically feasible structures for a given molecule ab initio remains an exceptionally challenging area,11,12 and so, rigorous experimentation remains a prerequisite. In the absence of reliable tools for the prediction of solid form landscape, it is essential to design screens to maximize the diverse chemical space covered while minimizing the number of individual experiments required to sample them. To achieve this, chemoinformatics techniques have been applied to assist with experimental design and solvent sampling.8,9,13−15 While this clustering of solvents was designed to assess chemical diversity for a specific application as described, this mapping of solvents is widely applicable to any situation where solubility of a given compound in a range of solvents is a consideration. There have been several reports in the literature of solvent library clustering and sampling based on the analysis of properties used to describe individual solvent molecules, using techniques such as principal components analysis (PCA),14 design of experiments (DoE),13 self-organizing maps (SOMs),8 and hierarchical clustering.14,16 Some studies report the use of physicochemical properties of solvents obtained by a computaReceived: January 20, 2017 Published: June 30, 2017 1807

DOI: 10.1021/acs.jcim.7b00038 J. Chem. Inf. Model. 2017, 57, 1807−1815

Article

Journal of Chemical Information and Modeling

Figure 1. Solvent library comprising 94 solvents.

tional approach, which utilize comprehensive sets of calculated molecular descriptors,16 and some have included a combination of calculated and experimentally measured physicochemical properties.8,14,17,18 To date, studies of this type share a common theme: they all use PCA either as a preliminary or major step in cluster identification. The main disadvantage of PCA is that any nonlinear correlation between variables will not be captured and inevitably a certain percentage of variance is lost. Although often used to identify patterns in data, PCA is not a clustering method and therefore studies are comparatively subjective and prone to error. DoE is an alternative approach for solvent subset selection, utilizing factorial, d-optimal, or d-optimal onion design methods.8 It is used to reduce the number of experiments required while allowing the exploration of crystallization space in a systematic manner. A limitation of DoE, when considering the variation in a single variablesuch as solvent identity, in this casearises from a statistical randomization approach that does not include any weighting in relation to which descriptors are used and which are redundant. In other words, chemical insight can be lost by retention of many descriptors which describe similar parameters while dismissing parameters that may describe different chemical features. As a result, mainly outliers will be selected and, even with onion designs, only one substance will be offered from regions, or layers, of chemical space. The advantages basic clustering methods have over DoE are that the scientist can choose from alternative substances

from each region of chemical space and, if required, identify properties or regions that merit further exploration, therefore providing a choice of alternative solvents.16 Perhaps one of the most important steps when clustering any data set is the selection of an appropriate clustering algorithm.19 This study differs from previous library design reports in that the algorithm used, clusterSim,20 does not rely solely on any one approach but instead uses all available permutations of clustering methods to identify a user-specified number of clusters.20 The resulting clusters are then assessed statistically using five internal cluster indexes. An advantage of this method is that results are presented as clearly defined cluster lists or tables, as opposed to a two-dimensional (2D) visual representation of multidimensional data. This hence removes any ambiguity which may occur as a result of a 2D plot of a homogeneous spread of solvents with poorly defined boundaries between groups. Presented is a novel approach to solvent library modeling and solvent selection implemented in four steps: (i) calculation of solvent descriptors, (ii) clustering of solvents, (iii) graphical representation of clusters, and finally (iv) cluster assessment using prior experimental physical form screening results in addition to statistical quantification of performance.



MATERIALS AND METHODS Molecular Descriptors. The solvents (Figure 1) presented in this study are currently held within a library for solubility 1808

DOI: 10.1021/acs.jcim.7b00038 J. Chem. Inf. Model. 2017, 57, 1807−1815

Article

Journal of Chemical Information and Modeling

(S).28,29,31,32 The top ten ranked methods, using the assessment methods detailed above, were output for user assessment, and the remaining 70 were discarded. Multidimensional Scaling (MDS) of Molecular Descriptors and Clustering. An MDS plot of the best solvent library clustering was created to illustrate the similarity and dissimilarity within and between clusters. The data set was divided into 24 clusters based on the outcomes from clustering. A mean value for descriptors of all solvents within each cluster was calculated, and MDS was performed on these means to give a central point on the plot for individual clusters, the lines from which extend one standard deviation in each direction. This provides a means for assessing similarity between clusters. Subsequently, 24 individual MDS plots were created which were then overlaid and centered on the means of each cluster to give the positions of individual solvents in relation to the cluster mean. This provides a gauge for similarity within a cluster. All MDS plots were created in R. Clustering Assessment. The effectiveness of solvent library clustering was assessed qualitatively and quantitatively based on results of prior experimental solution crystallization results for four compounds (Figure 2): namely, hydrochlorothiazide17 (HCT); chlorothiazide33 (CT); 10,11-dihydrocarbamazepine34 (DHC), and 3-azobicylcononane-2,4dione18 (BQT), that were carried out under the auspices of the CPOSS project.35 These four compounds were selected for clustering assessment as the results from solution crystallization screens comprised a diverse set of types and number of physical forms that were observed for each. Furthermore, the range of solvents in which each physical form (i.e., polymorph/solvate) was observed (Table 2) varied. For example, only one nonsolvated form of CT was obtained whereas three polymorphs of DHC were observed, two of which are readily obtained from a number of solvents and the other from only a single solvent. A quantitative method was employed whereby the probability of finding all physical forms obtained from the experimental screens (Table 2) was assessed using random selection of 24 solvents from the entire library or 24 solvents selected across each solvent cluster. The likelihood of observing all physical forms from 24 solvents had they been grouped according to polarity only was also assessed. For each method of selection, 10 000 random samplings of 24 solvents were made and the probability of discovering all form types determined. Solvents which were not used during the experimental screens were not sampled. Where the solvent selection yielded all observed forms, a value of 1 was assigned to those solvents. If all expected forms were not observed a value of 0 was assigned. These values were expressed as a percentage success rate, where 100% would be all physical forms observed from all solvent selections. Only outcomes assigned as forms I, II, or III or solvate were counted when assessing success. A visual representation or qualitative assessment of the effectiveness of solvent library clustering

screening for solution crystallization and physical form searches as part of an academic research program. All solvents were modeled by converting 2D molecular structures to threedimensional (3D) with Chemical Computing Group’s MOE software.21 A total of 1968 thermodynamic, electronic, topological, spatial, and feature-count molecular descriptors were computed for each solvent using MOE and E-dragon.22 These were reduced to 552 by removing any descriptors with zero variance or those correlated at a threshold greater than 90 percent. Zero-variance descriptors were identified using SIMCA-P v1123 and a correlation matrix was generated within R.24 Additionally, PCA carried out in SIMCA-P highlighted a bias in the descriptor set toward size and aromaticity, thus a further 252 descriptors were manually removed to remove any bias. A total of 250 descriptors remained; a list of these descriptors alongside a brief explanation is provided in the Supporting Information (Table S1). Clustering. Clustering algorithms are used to group objects based on a similarity or distance criterion. The clusterSim package20 in R allows for a brute force approach to clustering, using permutations of various clustering methods to identify optimal clusters within a data set. A total of 80 distinct clustering approaches were applied to the data set consisting of 94 solvents × 250 descriptors. The combinations of methods that were used, including variations of normalization, distance measurements and clustering methods, are listed in Table 1. Table 1. Combination of Clustering Methods Used within clusterSim normalizations N1: (x-mean)/sd N5: (x-mean)/ max[abs(x-mean)]

distance measures D1: Manhattan D2: Euclidean D3: Chebychev D3: squared Euclidean D5: GDM1 (general distance measure interval)

clustering methods C1: single link C2: complete link C3: average link C4: McQuitty C5: K-medoids (PAM) C6: Ward C7: centroid C8: median

The number of clusters to be formed from the data is predefined prior to performing the analysis. In this study, 24 clusters were selected to correspond with the general approach capabilities of locally available laboratory hardware for automated parallel crystallization and subsequent X-ray powder diffraction analysis.25 The 80 clustering results were assessed internally in clusterSim using five cluster indexes: CalinskiHarabasz’s pseudo F-statistic (G1),26−30 Baker and Hubert’s adaptation of Goodman and Kruskal’s gamma statistics (G2),27−30 Hubert and Levine’s internal cluster quality index (G3),28−30 Krzanowski and Lai’s index (KL),28−30 and the Rousseeuw Silhouette internal cluster quality index

Figure 2. Molecular diagrams of compounds from left to right: HCT, CT, DHC, and BQT. Results from solution crystallization screens of these compounds17,18,33,34 were used to assess the effectiveness of clustering results. 1809

DOI: 10.1021/acs.jcim.7b00038 J. Chem. Inf. Model. 2017, 57, 1807−1815

Article

Journal of Chemical Information and Modeling

were not practical in terms of equal sampling across the library: the algorithms had isolated the most diverse chemicals into individual clusters, leaving one or two clusters which comprised the remainder of the library. Statistically reasonable clusters, as assessed by these three indexes, were not chemically sensible. Of the remaining clustering results (outputs 1, 2, 9, and 10) cluster output 1 (Table 3) gave the most reasonable chemical spread of solvents and provided sensible grouping within clusters in the majority of cases. The best clustering result (output 1) was achieved by using normalization method N5 combined with a Euclidean distance measure and PAM clustering, assessed with the G1 metric. Cluster results for output 1 are provided in Table 4 and have been termed Strathclyde24. Details of all cluster outputs listed in Table 3 are provided in Supporting Information Tables S2a−j. Visualization of Clusters using MDS. A graphical representation of Strathclyde24 is presented in Figure 3. MDS was performed on each individual cluster. The central point of the cross in each cluster represents the cluster mean, with the lines stretching one standard deviation in each direction. The crosses are themselves positioned according to an overarching MDS of all cluster means. This plot is therefore substantially more informative than viewing clusters in tabular format: in addition to identifying similar solvents due to their placement in the same cluster, it allows interpretation of degrees of similarity and dissimilarity between different clusters. There are three key points of note regarding interpretations of the solvent map. First, the intercluster similarities can be assessed by comparing the distances between the clusters’ center points; the smaller the distance, the more similar the properties of solvents within these clusters are likely to be. Likewise the farther apart the clusters reside on the map the more diverse the properties of the solvents within those clusters will be. Second, the intracluster distance between solvents is also representative of their similarity. This is explained by way of example. For instance, cluster 11 (Table 5) contains eight solvents, namely 2-propanol, 1-propanol, ethanol, cyclohexanol, 2-methyl-1-propanol, 2-butanol, 2-butanone, and 1,2-propanediol, and is located in the lower central area of the map (shaded light green). It can be surmised from looking at this cluster that 2-butanol, 3-methyl-1-propanol, 2-propanol, and 1-propanol are particularly similar, as the points for each solvent lie in close proximity; they are also the most representative of the cluster as they are closest to the central point. Also, within this cluster it can be assumed that 1,2-propanediol is least like the remaining solvents and in fact appears almost as an outlier, being not only well separated from the other solvents, but also a large distance from the mean. Similarly, trimethylamine appears as a relative

Table 2. Number and Types of Forms Observed for Each Compound in Physical Form Screens and Also the Number of Solvents from Which Each Was Obtained compound/screening conditiona HCT results representative of condition cosolventa,b CT results representative of condition Tsat (max)a,b,c DHC results representative of condition Tsat (max)a,b,c

BQT results representative of condition Tsat (max)a,b,c

crystalline form

number of solvents

form I form II solvate no outcome form I solvate no outcome form I form II form III solvate no outcome form I form II solvate no outcome

58 2 7 0 30 7 30 34 34 1 6 3 60 2 2 3

a

Physical form screens were performed using 67 solvents and a number of different crystallization conditions for each compound. b Results used in MDS mapping are accurate for each condition as reported in cited publications excluding that for DHC form III which was obtained from a manual recrystallization from methanol. cTsat (max)solutions were saturated at ∼10° below the boiling point of a solvent prior to filtration and crystallization by cooling.

was achieved by using the results from the physical form screens (Table 2) and mapping them onto the final MDS plot. Thus, four new plots were created, which were colored according to crystallization outcome from particular solvents.



RESULTS AND DISCUSSION Clustering Results. clusterSim was used to perform 80 different methods of clustering (Table 1) on the library of 94 solvents (Figure 1, Table 1) with each represented by 250 molecular descriptors. The two statistically best clusters (one from each normalization method) as assessed by each of the five internal quality indexes in clusterSim are listed in Table 3. It is not feasible to compare index values originating from different metrics, thus each output listed in Table 3 was manually assessed for chemical similarity within solvent clusters. The importance of a manual assessment was exemplified when it was found that the results of those methods assessed by KL, G2, and S internal cluster indexes (cluster outputs 3−8)

Table 3. clusterSIM Output Showing the 10 Best Clusters As Assessed by the Cluster Quality Indices cluster output

normalization

distance measure

cluster method

metric

index values

1 2 3 4 5 6 7 8 9 10

N5 N1 N5 N1 N5 N1 N5 N1 N1 N5

Euclidean squared Euclidean Chebyschev Chebyschev GDM1 Chebyschev GDM1 Chebyschev GDM1 GDM1

PAM Ward McQuitty average link average link average link average link single link complete link PAM

G1 G1 KL KL G2 G2 S S G3 G3

15.87106 10.49527 5.542532 5.277713 0.953114 0.939211 0.456094 0.424797 0.079828 0.059007

1810

DOI: 10.1021/acs.jcim.7b00038 J. Chem. Inf. Model. 2017, 57, 1807−1815

Article

Journal of Chemical Information and Modeling

all four compounds onto the Strathclyde24 MDS plot (Figure 4). These maps show a good overall alignment between the types of outcomes and solvent clusters. It is clear that Strathclyde24-guided solvent selection greatly increases the likelihood of observing all types of crystalline form when compared both random and polarity-based selection. Interestingly, the clustering of solvents by polarity did not show an improvement over random selection for the studied APIs, suggesting that a simple, one-property model for solvent selection was not adequate. In all cases, the chances of observing all forms listed in Table 2 by selecting any one solvent from each of the Strathclyde24 clusters is 95% or greater. For HCT, DHC, and BQT expected values of observing all forms using Strathclyde24 were increased to 100% percent compared to values ranging between 20% and 43% obtained from blind and polarity-based sampling. On inspection of these values, it appears that, by performing 24 solution crystallization experiments for HCT, DHC, and BQT, it is possible to observe all forms of both compounds and this would be consistent regardless of which solvent from each cluster had been randomly selected. It is worth noting that solvents closer to the cluster crosshairs are more representative of the cluster as a whole. It is also worth reiterating that the algorithm used does not distinguish between different solvated forms and provides a more generalized quantification of success. However, what this approach does identify are the clusters of solvents that will form solvates, and if the user is particularly interested in solvate formation, then this map will aid in the identification of solvents which should be used for crystallization studies. In the case of CT where the likelihood of observing all forms identified in the experimental screen by blind sampling alone is already high (estimated at 96%) due to the relatively low number of physical forms identified for CT, Strathclyde24 guidance was not estimated to increase the efficiency of physical forms observed, calculated to be 95% effective. However, CT has relatively low solubility across the full range of solvents included in the library and what this map identifies is the solvents that CT is soluble in and therefore which solvents could be used to recrystallize or used as antisolvents. The maps presented in Figure 4 aid interpretation of the figures quoted in Table 5 and give a clearer view of the effectiveness of clustering the solvent library in terms of reducing the number of experiments required to encounter all form types. These are discussed further for each compound below. Figure 4a represents the results from a solution crystallization screen for HCT. From this screen, two polymorphs and seven solvates were observed and it is noteworthy that results were obtained from all solvents used in this search. It can be seen that the majority of solvents yielded form I HCT but importantly the only solvent which produced form II was nitromethane, which is a single-solvent cluster. Therefore, this form would be observed in a reduced search. In the case of solvated forms, these are grouped in a small region in the lower right of the map and in fact four out of the seven solvates observed experimentally all belong to the same cluster. Generally speaking, if compounds form a variety of solvates they tend to so with similar solvents, or solvents which share particular properties, which reside in close proximity on the map. Thus, using sampling guided by Strathclyde24, it would not be expected to find all solvated forms of HCT with only 24 experiments; however, at least one solvated form of HCT

Table 4. Strathclyde24Solvent Library Clustering Obtained from Output 1 in Table 3 clustera

solvents

1

1,5-pentanediol, 2-butoxyethanol, di(2-methoxyethyl) ether, 2ethoxyethanol, 2-methoxyethanol, 1,2-dimethoxyethane, 2-amino1-butanol 1-methylnaphthalene nitrobenzene, furfural, 2-phenylethanol, anisole formamide, acetic acid, methanoic acid dodecane, 2,2,4-triemthylheptane, heptane, hexane N-methyl-2-pyrrolidone, N,N-dimethylacetamide, N,Ndimethylformamide, methyl acetate 1-octanol, 2-octanol, 1-hexanol, 1-pentanol, 3-methyl-1-butanol, 2pentanol, 1-butanol, triethylamine dimethyl sulfoxide, acetone xylene, toluene, benzene, 3-methylthiophene, 3-fluorotoluene, bromobenzene, iodobenzene, 4-fluorotoluene, aniline, pyridine butyric acid, pentyl acetate, butyl acetate, diethyl carbonate, 4methyl-2-pentanone, isobutyl acetate, ethyl acetate, ethyl lactate, butyl lactate 2-propanol, ethanol, cyclohexanol, 2-methyl-1-propanol, 2-butanol, 1-propanol, 2-butanone, 1,2-propanediol dibutyl ether, 2-methoxy-2-methylpropane, diethyl ether, diethyl sulfide tetrachloroethane, trichloroethene 1,4-dioxane, tetrahydrofuran, tetrahydrothiophene, 1,3-dioxane nitromethane water 1,2-dichloroethane, 1-bromo-2-chloroethane, diethyl disulfide acetonitrile, ethanethiol, methyl sulfide, 2-propanethiol, thioacetic acid cyclohexane, cyclopentane 1-chlorobutane, 2-bromobutane, 1-bromobutane, 2-iodobutane carbon tetrachloride, chloroform, dichloromethane, bromoform trifluoroethanol, trifluoroacetic acid methanol iodomethane

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 a

Clusters are illustrated graphically in Figure 3.

outlier in cluster 7, as expected. The third point is less intuitive than the former two: it is not possible to draw any precise meaning from the relative positions of solvents within a single cluster and those in adjacent clusters, since these positions are the result of different MDS plots. This is perhaps the single limitation of this visual approach, although the intent was to provide a map that was immediately practical and bereft of the ambiguity which would be created by considering all intersolvent distances, subsequently identified by clustering. The impact of the clearly defined clusters would then be lost. Clustering Assessment. The probabilities of finding each type of crystalline form of HCT, CT, DHC, and BQT, observed from solution crystallization screens as described earlier, were calculated using three approaches: blind sampling, polarity-based clustering, and Strathlcyde24-guided. The polarity-based clustering scheme was obtained by splitting solvents into 24 bins according to their calculated log P value. This scheme is intended to represent a more realistic standard for comparison than entirely blind solvent sampling, though the latter remained a useful baseline. The clusters obtained by binning according to polarity are shown in Supporting Information Table S3. The expected chance of finding all observed crystalline forms by selecting 24 solvents via each approach is listed in Table 5. These results are complemented by the qualitative assessment of cluster efficiency based on mapping the experimental solution crystallization outcomes for 1811

DOI: 10.1021/acs.jcim.7b00038 J. Chem. Inf. Model. 2017, 57, 1807−1815

Article

Journal of Chemical Information and Modeling

Figure 3. MDS plot of Strathclyde24 illustrating similarity and dissimilarity within and between clusters of solvents. Clusters containing one solvent are represented by single points, and those containing two solvents are represented by connected points. The boundaries of clusters of three or more solvents are illustrated by colored hulls. The central point of the cross in each cluster represents the cluster mean with the lines stretching one standard deviation in each direction.

may be attributed to the lack of one cluster solely forming solvates. Given that there are clusters in the map which only contain form I or no outcome, if there had been a cluster containing only solvates, the guided sampling expected value would be one hundred percent. It is worth noting, however, that the solvents in cluster 6 all form solvates with CT with the exception of methyl acetate. This poses the question: under the correct experimental conditions, would CT form a solvate with methyl acetate? Classification of experimental screening results onto the MDS plot in this manner demonstrates the value of Strathclyde24, not only in experimental design but also for retrospective assessment of the screening performance.36 DHC poses the biggest challenge in terms of assessing the effectiveness of clustering and as a measure of how many distinct forms of a compound could realistically be identified from a reduced set of experiments. Aside from no outcome, there were four distinct experimental outcomes associated with DHC: forms I, II, III, solvates (Table 3). The expected value for finding all forms of DHC based on Strathclyde 24 is 100%. This is clearly represented in Figure 4c where it can be seen that the majority of clusters have only one type of experimental outcome associated with them. The tight clustering for all outcome categories observed in this figure further supports the reliability of Strathclyde24. The screening results for BQT are illustrated in Figure 4d. The majority of crystallizations produced form I with only two solvents yielding form II (chloroform and carbon tetrachloride) and two solvents forming solvates (1-methyl naphthalene and acetic acid). It was interesting to evaluate how many of the

Table 5. Probability of Finding All Forms Observed from HCT, CT, DHC, and BQT Screens, as Outlined in Table 2, Using Blind Sampling of 24 Solvents Compared to Sampling Based on Polarity and Randomly Selecting One Solvent from Each Cluster of Strathclyde24 probabilities compounds

blind sampling (%)

polarity-based sampling (%)

Strathclyde24 (%)

HCT CT DHC BQT

34 96 30 36

25 100 20 43

100 95 100 100

would be observed. In practice therefore, observation of a solvate would trigger subsequent experiments from similar solvents identified from the cluster to explore whether further solvates are formed. In contrast to HCT, the map for CT (Figure 4b) shows that recrystallization from 30 out of 67 solvents did not yield sufficient sample for analysis. This is principally due to poor solubility of the material. A benefit of using the map to view experimental outcomes, even those considered as “no outcome”, is that it further supports the effectiveness of solvent clustering. Those solvents in which CT exhibited poor solubility are also loosely clustered in the higher region of the map. Additionally, those solvents which produced form I or a solvate are also well grouped on the map. The fact that the expected value for using Strathclyde24 guided sampling was not a significant improvement in comparison to random sampling 1812

DOI: 10.1021/acs.jcim.7b00038 J. Chem. Inf. Model. 2017, 57, 1807−1815

Article

Journal of Chemical Information and Modeling

Figure 4. Strathclyde24 MDS plot with solvent positions as for Figure 3, for (a) HCT, (b) CT, (c) DHC, and (d) BQT, colored according to experimental outcomes of solution crystallization screens. Solvents represented by small dots were not included in crystallization screens.

simple method to visualize multidimensional solvent space and provide an intuitive route to solvent subset selection. Furthermore, mapped MDS plots highlight a high correlation between solvent type and experimental outcome and provide confidence in both the calculated molecular descriptors, which have been selected to model the library and also in the final clustering output. Such a high correlation suggests that these descriptors may be useful for subsequent machine learning and data mining techniques applied to polymorph screening results such as those applied to the dibenzazepine drug carbamazepine.13,36 As a starting point for an experimental approach, it can be concluded that for compounds that do not show extensive physical form diversity (polymorphs and solvates) and where reasonable solubility is achieved using individual solvents selected form each of the Strathclyde 24, it is unlikely to show extensive variability on a more extended screen, although further forms cannot of course be ruled out. In crystallizations where solvent identity is not the driving force,37,38 the clustering may not yield an accurate picture. This approach only takes solvent properties into account; other factors such as temperature, supersaturation, and agitation, which can be varied, also have a significant effect on crystallization. However, it is recommended that the descriptors presented here are also included in any informatics analysis in combination with other experimental parameters to investigate crystallization processes and what governs particular physical form outcome.

forms other than form I would be identified from a reduced crystallization screen and the calculated probability was another positive result100%. As before, this has the disadvantage that only one solvate would be observed but unlike the previous examples it would not be possible to find the additional solvate by simply exploring regions close to cluster 2 in the map. Thus, both the qualitative and quantitative methods of assessment demonstrate successful clustering of the solvent library that shows considerable promise in terms of practical exploitation and application. The calculated probabilities alone indicate that Strathclyde24 is effective but the mapped MDS plots give a more comprehensive account. Using these plots, it is also possible to identify solute/solvent relationships such as solubility and solvate formation.



CONCLUSIONS We have focused on generating an efficient means to rationalize solvent selection for the application of efficient and effective physical form screening via a novel approach to clustering a chemical solvent library. This approach is equally viable for any application concerning grouping of molecules. The clustering method presented here is only one of many possible techniques for library design or similarity grouping, and as discussed earlier, the advantages of this particular method lie in the range of clustering methods explored. The graphical representation of solvent library clusters, Strathclyde24, provides an effective gauge of similarity and dissimilarity within the library and is a 1813

DOI: 10.1021/acs.jcim.7b00038 J. Chem. Inf. Model. 2017, 57, 1807−1815

Article

Journal of Chemical Information and Modeling

(8) Alleso, M.; Rantanen, J.; Aaltonen, J.; Cornett, C.; van den Berg, F. Solvent subset selection for polymorph screening. J. Chemom. 2008, 22, 621−631. (9) Gu, C. H.; Li, H.; Gandhi, R. B.; Raghavan, K. Grouping solvents by statistical analysis of solvent property parameters: Implication to polymorph screening. Int. J. Pharm. 2004, 283, 117−125. (10) Hasa, D.; Miniussi, E.; Jones, W. Mechanochemical synthesis of multicomponent crystals: One liquid for one polymorph? A myth to dispel. Cryst. Growth Des. 2016, 16, 4582−4588. (11) Price, S. L. Predicting crystal structures of organic compounds. Chem. Soc. Rev. 2014, 43, 2098−2111. (12) Neumann, M. A.; van de Streek, J.; Fabbiani, F. P. A.; Hidber, P.; Grassmann, O. Combined crystal structure prediction and highpressure crystallization in rational pharmaceutical polymorph screening. Nat. Commun. 2015, 6, 7793. (13) McCabe, J. F. Application of design of experiment (doe) to polymorph screening and subsequent data analysis. CrystEngComm 2010, 12, 1110−1119. (14) Xu, D.; Redman-Furey, N. Statistical cluster analysis of pharmaceutical solvents. Int. J. Pharm. 2007, 339, 175−188. (15) Gramatica, P.; Navas, N.; Todeschini, R. Classification of organic solvents and modelling of their physico-chemical properties by chemometric methods using different sets of molecular descriptors. TrAC, Trends Anal. Chem. 1999, 18, 461−471. (16) Rannar, S.; Andersson, P. L. A novel approach using hierarchical clustering to select industrial chemicals for environmental impact assessment. J. Chem. Inf. Model. 2010, 50, 30−36. (17) Johnston, A.; Florence, A. J.; Shankland, N.; Kennedy, A. R.; Shankland, K.; Price, S. L. Crystallization and crystal energy landscape of hydrochlorothiazide. Cryst. Growth Des. 2007, 7, 705−712. (18) Hulme, A. T.; Johnston, A.; Florence, A. J.; Fernandes, P.; Shankland, K.; Bedford, C. T.; Welch, G. W. A.; Sadiq, G.; Haynes, D. A.; Motherwell, W. D. S.; Tocher, D. A.; Price, S. L. Search for a predicted hydrogen bonding motif - a multidisciplinary investigation into the polymorphism of 3-azabicyclo[3.3.1]nonane-2,4-dione. J. Am. Chem. Soc. 2007, 129, 3649−3657. (19) Hubert, L. Approximate evaluation techniques for the single-link and complete-link hierarchical clustering procedures. J. Am. Stat. Assoc. 1974, 69, 698−704. (20) Walesiak, M.; Dudek, A. Clustersim, v0.38-1; 2010. (21) Molecular operating environment, 2009.10; Chemical Computing Group, 2009. (22) Tetko, I. V.; Gasteiger, J.; Todeschini, R.; Mauri, A.; Livingstone, D.; Ertl, P.; Palyulin, V.; Radchenko, E.; Zefirov, N. S.; Makarenko, A. S.; Tanchuk, V. Y.; Prokopenko, V. V. Virtual computational chemistry laboratory - design and description. J. Comput.-Aided Mol. Des. 2005, 19, 453−463. (23) E-dragon; Virtual Computational Chemistry Laboratory, 2005. (24) R: A language and environment for statistical computing; R Development Core Team: Vienna, Austria, 2009. (25) Florence, A. J.; Johnston, A.; Fernandes, P.; Shankland, N.; Shankland, K. An automated platform for parallel crystallization of small organic molecules. J. Appl. Crystallogr. 2006, 39, 922−924. (26) Calinski, R. B.; Harabasz, J. A dendrite method for cluster analysis. Communications in Statistics 1974, 3, 1−27. (27) Everitt, B. S.; Landau, E.; Leese, M. Cluster analysis; Arnold: London, 2001; p 103−104. (28) Gatnar, E.; Walesiak, M. Metody statystycznej analizy wielowymiarowej w badaniach marketingowych [multivariate statistical analysis methods in marketing research]; Wydawnic-two AE: Wroclaw, 2004; p 338−339. (29) Gordon, A. D. Classification; Chapman & Hall/CRC: London, 1999; p 62. (30) Milligan, G. W.; Cooper, M. C. An examination of procedures for determining the number of clusters in a data set. Psychometrika 1985, 50, 159−179. (31) Kaufman, L.; Rousseeuw, P. J. Finding groups in data: An introduction to cluster analysis. Wiley: New York, 1990; p 83 - 88.

It can be concluded that this approach to experimental design, where solvent identity plays a key role, can significantly reduce the number of experiments required for solid form screening. It can also provide insights on the areas of chemical space which are being searched and if the solvents sampled therein are sufficiently diverse as crystallization from diverse solvents may increase the success rate of getting all solid forms. This approach can be used for optimizing the crystallization process by providing a guide to molecular solubility from only 24 experiments.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.7b00038. List of calculated descriptors and tables of top 10 clustering approaches to the solvent library (PDF)



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Phone: +44 (0) 1415485756. ORCID

Rajesh Gurung: 0000-0003-1822-5075 Antony D. Vassileiou: 0000-0001-8146-8972 Alastair J. Florence: 0000-0002-9706-8364 Blair. F. Johnston: 0000-0001-9785-6822 Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS The authors would like to acknowledge the EPSRC Centre for Innovative Manufacturing in Continuous Manufacturing and Crystallization (grant ref EP/I033459/1) and the EPSRC Doctoral Training Centre in Continuous Manufacturing and Crystallization (grant ref EP/K503289/1) for funding this work.



REFERENCES

(1) Florence, A. J. The solid state. In Basic principles and systems; Florence, A. T., Siepman, J., Eds.; Informa Healthcare: New York, 2009; Vol. 1, Chapter 8, pp 253−310. (2) Hilfiker, R.; Paul, S. M. D.; Szelagiewicz, M. Polymorphism: In the pharmaceutical industry; Hilfiker, R., Ed.; Wiley-VCH: Weinheim, 2006; Chapter 287−308. (3) Florence, A. J. Approaches to high-throughput physical form screening and discovery. In Polymorphism in pharmaceutical solids; Brittain, H. G., Ed.; Informa Healthcare: New York, 2009; Vol. 192, pp 139−184. (4) Monissette, S. L.; Almarsson, O.; Peterson, M. L.; Remenar, J. F.; Read, M. J.; Lemmo, A. V.; Ellis, S.; Cima, M. J.; Gardner, C. R. Highthroughput crystallization: Polymorphs, salts, co-crystals and solvates of pharmaceutical solids. Adv. Drug Delivery Rev. 2004, 56, 275−300. (5) Parmar, M. M.; Khan, O.; Seton, L.; Ford, J. L. Polymorph selection with morphology control using solvents. Cryst. Growth Des. 2007, 7, 1635−1642. (6) Kitamura, M. Strategy for control of crystallization of polymorphs. CrystEngComm 2009, 11, 949−964. (7) Lang, P.; Kiss, V.; Ambrus, R.; Farkas, G.; Szabo-Revesz, P.; Aigner, Z.; Varkonyi, E. Polymorph screening of an active material. J. Pharm. Biomed. Anal. 2013, 84, 177−183. 1814

DOI: 10.1021/acs.jcim.7b00038 J. Chem. Inf. Model. 2017, 57, 1807−1815

Article

Journal of Chemical Information and Modeling (32) Rousseeuw, P. J. Silhouettes - a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53−65. (33) Johnston, A.; Bardin, J.; Johnston, B. F.; Fernandes, P.; Kennedy, A. R.; Price, S. L.; Florence, A. J. Experimental and predictedl crystal energy landscapes of chlorothiazide. Cryst. Growth Des. 2011, 11, 405. (34) Arlin, J. B.; Johnston, A.; Miller, G. J.; Kennedy, A. R.; Price, S. L.; Florence, A. J. A predicted dimer-based polymorph of 10,11dihydrocarbamazepine (form iv). CrystEngComm 2010, 12, 64−66. (35) EPSRC. (36) Johnston, A.; Johnston, B. F.; Kennedy, A. R.; Florence, A. J. Targeted crystallisation of novel carbamazepine solvates based on a retrospective random forest classification. CrystEngComm 2008, 10, 23−25. (37) Florence, A. J.; Johnston, A.; Price, S. L.; Nowell, H.; Kennedy, A. R.; Shankland, N. An automated parallel crystallisation search for predicted crystal structures and packing motifs of carbamazepine. J. Pharm. Sci. 2006, 95, 1918−1930. (38) Bhardwaj, R. M.; Price, L. S.; Price, S. L.; Reutzel-Edens, S. M.; Miller, G. J.; Oswald, I. D. H.; Johnston, B. F.; Florence, A. J. Exploring the experimental and computed crystal energy landscape of olanzapine. Cryst. Growth Des. 2013, 13, 1602−1617.

1815

DOI: 10.1021/acs.jcim.7b00038 J. Chem. Inf. Model. 2017, 57, 1807−1815