Combined Chemoinformatics Approach to Solvent Library Design

Subscriber access provided by BOSTON UNIV

Article

A Combined Chemoinformatics Approach to Solvent Library Design using clusterSim and Multi-Dimensional Scaling Andrea Johnston, Rajni Miglani Bhardwaj, Rajesh Gurung, Antony D Vassileiou, Alastair J. Florence, and Blair F. Johnston J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00038 • Publication Date (Web): 30 Jun 2017 Downloaded from http://pubs.acs.org on July 3, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

A Combined Chemoinformatics Approach to Solvent Library Design using clusterSim and MultiDimensional Scaling Andrea Johnston,† Rajni Bhardwaj-Miglani,† Rajesh Gurung,# Antony D. Vassileiou,† Alastair J. Florence,† and Blair. F. Johnston†* †

EPSRC Centre for Innovative Manufacturing in Continuous Manufacturing and Crystallisation c/o

Strathclyde Institute of Pharmacy and Biomedical Sciences, University of Strathclyde, Technology and Innovation Centre, 99 George Street, Glasgow G1 1RD, United Kingdom #

EPSRC Doctoral Training Centre in Continuous Manufacturing and Crystallisation, c/o Strathclyde

Institute of Pharmacy and Biomedical Sciences, University of Strathclyde, Technology and Innovation Centre, 99 George Street, Glasgow, G1 1RD, United Kingdom *Corresponding author: email: [email protected]; phone: +44 (0)1415485756.

ABSTRACT

Reported here is a rational approach for the selection of solvents intended for use in physical form screening based on a novel chemoinformatics analysis of solvent properties. A comprehensive assessment of eight clustering methods was carried out on a series of 94 solvents described by calculated molecular descriptors using the clusterSim package in R. The effectiveness of clustering methods was evaluated using a range of statistical measures as well as increasing efficiency of solid ACS Paragon Plus Environment

1


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 30

form discovery by using a cluster-based solvent selection approach. Multi-dimensional scaling was used to illustrate cluster analysis on a 2-dimensional solvent map. The map presented here is a valuable tool to aid efficient solvent selection in physical form screens. This tool is equally applicable to any scientific area which requires a solubility dependent decision on solvent choice.

KEYWORDS: solvent library design, informatics, clustering, molecular descriptors, polymorph, screening, solubility, solvent

INTRODUCTION The primary aims of experimental physical form screening are to obtain the maximum number of different crystalline forms, including polymorphs, solvates, salts or co-crystals, of the compound under study and to identify the most thermodynamically stable form.1, 2 This should be achieved as efficiently as possible, using the minimum amount of material (solute and solvent) and the least number of experiments necessary, in order to ensure that all practically relevant forms have been found. Solution crystallization is a key element of any rigorous physical form screen and is also amenable to small-scale multi-well plate experiments, enabling high-throughput approaches to be employed to deliver possibly thousands of experiments covering large regions of crystallization space.3 Solution crystallization in itself covers a broad range of techniques and variables. The most commonly employed crystallizations from solution include cooling crystallization, anti-solvent crystallization, evaporative crystallization, and slurry crystallization. Typically, effects of experimental variables on the resultant solid form are investigated but the extent of their influence on the crystallization outcome is not always fully understood. Experimental variables may include solvent identity, solution supersaturation, temperature, agitation and heteronuclei such as impurities or templates.4-7 Solvent choice is a crucial factor in the successful identification of new crystalline forms,8,

9

and it is common practice to crystallize the

molecule of interest from a wide range of solvents.2, 10 Changes in solution chemistry, effected by the ACS Paragon Plus Environment

2

Page 3 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


use of large numbers of solvents and solvent mixtures, provide access to an extensive variation in critical crystallization conditions which can in turn lead to nucleation and growth of different crystalline forms. Solvent choice may also be restricted by safety considerations, however, for the purposes of this work, all safety classes of solvents have been considered. A key objective in the pharmaceutical industry and solid-state science community is the ability to predict crystal structures of small organic molecules. However, predicting the range of thermodynamically feasible structures for a given molecule ab initio remains an exceptionally challenging area11,

12

and so rigorous experimentation remains a

prerequisite. In the absence of reliable tools for the prediction of solid form landscape, it is essential to design screens to maximize the diverse chemical space covered whilst minimizing the number of individual experiments required to sample them. To achieve this, chemoinformatics techniques have been applied to assist with experimental design and solvent sampling.8, 9, 13-15 Whilst this clustering of solvents was designed to assess chemical diversity for a specific application as described, this mapping of solvents is widely applicable to any situation where solubility of a given compound in a range of solvents is a consideration. There have been several reports in the literature of solvent library clustering and sampling based on the analysis of properties used to describe individual solvent molecules, using techniques such as principal components analysis (PCA),14 design of experiments (DoE),13 self-organizing maps (SOMs)8 and hierarchical clustering.14,

16

Some studies report the use of physicochemical properties of solvents

obtained by a computational approach, which utilize comprehensive sets of calculated molecular descriptors16 and some have included a combination of calculated and experimentally measured physicochemical properties.8, 14, 17, 18 To date, studies of this type share a common theme: they all use PCA either as a preliminary or major step in cluster identification. The main disadvantage of PCA is that any non-linear correlation between variables will not be captured and inevitably a certain percentage of variance is lost. Although often used to identify patterns in data, PCA is not a clustering method and therefore studies are comparatively subjective and prone to error. ACS Paragon Plus Environment

3


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 30

DoE is an alternative approach for solvent sub-set selection, utilizing factorial, d-optimal or d-optimal onion design methods.8 It is used to reduce the number of experiments required whilst allowing the exploration of crystallization space in a systematic manner. A limitation of DoE, when considering the variation in a single variable – such as solvent identity, in this case – arises from a statistical randomization approach that does not include any weighting in relation to which descriptors are used and which are redundant. In other words, chemical insight can be lost by retention of many descriptors which describe similar parameters whilst dismissing parameters that may describe different chemical features. As a result, mainly outliers will be selected and, even with onion designs, only one substance will be offered from regions, or layers, of chemical space. The advantages basic clustering methods have over DoE are that the scientist can choose from alternative substances from each region of chemical space and, if required, identify properties or regions that merit further exploration, therefore providing a choice of alternative solvents.16 Perhaps one of the most important steps when clustering any data set is the selection of an appropriate clustering algorithm.19 This study differs from previous library design reports in that the algorithm used, clusterSim,20 does not rely solely on any one approach but instead uses all available permutations of clustering methods to identify a user-specified number of clusters.20 The resulting clusters are then assessed statistically using five internal cluster indexes. An advantage of this method is that results are presented as clearly defined cluster lists or tables, as opposed to a two-dimensional (2-D) visual representation of multi-dimensional data. This hence removes any ambiguity which may occur as a result of a 2-D plot of a homogeneous spread of solvents with poorly defined boundaries between groups. Presented is a novel approach to solvent library modelling and solvent selection implemented in four steps: (i) calculation of solvent descriptors, (ii) clustering of solvents, (iii) graphical representation of clusters and finally (iv) cluster assessment using prior experimental physical form screening results in addition to statistical quantification of performance. ACS Paragon Plus Environment

4

Page 5 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


MATERIALS AND METHODS Molecular Descriptors. The solvents (Figure 1) presented in this study are currently held within a library for solubility screening for solution crystallization and physical form searches as part of an academic research program. All solvents were modeled by converting 2-D molecular structures to three-dimensional (3-D) with Chemical Computing Group’s MOE software.21

A total of 1,968

thermodynamic, electronic, topological, spatial and feature-count molecular descriptors were computed for each solvent using MOE and E-dragon.22 These were reduced to 552 by removing any descriptors with zero variance or those correlated at a threshold greater than ninety percent.

Zero-variance

descriptors were identified using SIMCA-P v1123 and a correlation matrix was generated within R.24 Additionally, PCA carried out in SIMCA-P highlighted a bias in the descriptor set towards size and aromaticity, thus a further 252 descriptors were manually removed to remove any bias. A total of 250 descriptors remained; a list of these descriptors alongside a brief explanation is provided in Supporting Information (Table S1).

ACS Paragon Plus Environment

5


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 30

Figure 1. Solvent library comprising 94 solvents. Clustering. Clustering algorithms are used to group objects based on a similarity or distance criterion. The clusterSim package20 in R allows for a brute force approach to clustering, using permutations of various clustering methods to identify optimal clusters within a data set. A total of 80 distinct clustering approaches were applied to the data set consisting of 94 solvents × 250 descriptors. The combinations of methods that were used, including variations of normalization, distance measurements and clustering methods, are listed in Table 1. The number of clusters to be formed from the data is predefined prior to performing the analysis.

In this study, 24 clusters were selected to correspond with the general

approach capabilities of locally available laboratory hardware for automated parallel crystallization and subsequent x-ray powder diffraction analysis.25 The 80 clustering results were assessed internally in clusterSim using five cluster indexes: Calinski-Harabasz’s pseudo F-statistic (G1),26-30 Baker and ACS Paragon Plus Environment

6

Page 7 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Hubert’s adaptation of Goodman and Kruskal’s gamma statistics (G2),27-30 Hubert and Levine’s internal cluster quality index (G3),28-30 Krzanowski and Lai’s index (KL)28-30 and the Rousseeuw Silhouette internal cluster quality index(S).28, 29, 31, 32 The top ten ranked methods, using the assessment methods detailed above, were output for user assessment and the remaining 70 were discarded.

Table 1. Combination of clustering methods used within clusterSim.

Normalisations

Distance measures

Clustering methods

N1: (x-mean)/sd

D1: Manhattan

C1: Single link

N5: (x-mean)/max[abs(x-mean)]

D2: Euclidean

C2: Complete link

D3: Chebychev

C3: Average link

D3: squared Euclidean

C4: McQuitty

D5: GDM1 (General distance measure interval)

C5: K-medoids (PAM) C6: Ward C7: Centroid C8: Median

Multi-Dimensional Scaling (MDS) of Molecular Descriptors and Clustering. A MDS plot of the best solvent library clustering was created to illustrate the similarity and dissimilarity within and between clusters. The data set was divided into 24 clusters based on the outcomes from clustering. A mean value for descriptors of all solvents within each cluster was calculated and MDS was performed on these means to give a central point on the plot for individual clusters, the lines from which extend one standard deviation in each direction. This provides a means for assessing similarity between clusters. Subsequently, 24 individual MDS plots were created which were then overlaid and centered on the means of each cluster to give the positions of individual solvents in relation to the cluster mean. This provides a gauge for similarity within a cluster. All MDS plots were created in R. Clustering Assessment. The effectiveness of solvent library clustering was assessed qualitatively and quantitatively based on results of prior experimental solution crystallization results for four compounds ACS Paragon Plus Environment

7


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 30

(Figure 2), namely: hydrochlorothiazide17 (HCT); chlorothiazide33 (CT); 10,11-dihydrocarbamazepine34 (DHC) and 3-azobicylcononane-2,4-dione18 (BQT), that were carried out under the auspices of the CPOSS project.35 These four compounds were selected for clustering assessment as the results from solution crystallization screens comprised a diverse set of types and number of physical forms that were observed for each.

Furthermore, the range of solvents in which each physical form (i.e.

polymorph/solvate) was observed (Table 2) varied. For example, only one non-solvated form of CT was obtained whereas three polymorphs of DHC were observed, two of which are readily obtained from a number of solvents and the other from only a single solvent.

Figure 2. Molecular diagrams of compounds from left to right: HCT, CT, DHC and BQT. Results from solution crystallization screens of these compounds17, 18, 33, 34 were used to assess the effectiveness of clustering results.

Table 2. Number and types of forms observed for each compound in physical form screens and also the number of solvents from which each was obtained. Compound / Screening conditiona HCT Results representative of condition co-solventa, b CT Results representative of condition Tsat (max)a, b, c

Crystalline form Form I Form II Solvate No outcome Form I Solvate No outcome

Number of solvents 58 2 7 0 30 7 30


8

Page 9 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


DHC Results representative of condition Tsat (max)a, b, c

BQT Results representative of condition Tsat (max)a, b, c

Form I Form II Form III Solvate No outcome Form I Form II Solvate No outcome

34 34 1 6 3 60 2 2 3

a

Physical form screens were performed using 67 solvents and a number of different crystallization conditions for each compound. b Results used in MDS mapping are accurate for each condition as reported in cited publications excluding that for DHC form III which was obtained from a manual recrystallization from methanol. c Tsat (max) – solutions were saturated at ~10º below the boiling point of a solvent prior to filtration and crystallization by cooling.

A quantitative method was employed whereby the probability of finding all physical forms obtained from the experimental screens (Table 2) was assessed using random selection of 24 solvents from the entire library or 24 solvents selected across each solvent cluster. The likelihood of observing all physical forms from 24 solvents had they been grouped according to polarity only was also assessed. For each method of selection, 10,000 random samplings of 24 solvents were made and the probability of discovering all form types determined. Solvents which were not used during the experimental screens were not sampled. Where the solvent selection yielded all observed forms, a value of 1 was assigned to those solvents. If all expected forms were not observed a value of 0 was assigned. These values were expressed as a percentage success rate, where 100% would be all physical forms observed from all solvent selections. Only outcomes assigned as forms I, II, III or solvate were counted when assessing success.

A visual representation or qualitative assessment of the effectiveness of solvent library

clustering was achieved by using the results from the physical form screens (Table 2) and mapping them onto the final MDS plot.

Thus four new plots were created, which were colored according to

crystallization outcome from particular solvents.

RESULTS AND DISCUSSION


9


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 30

Clustering Results. clusterSim was used to perform 80 different methods of clustering (Table 1) on the library of 94 solvents (Figure 1, Table 1) with each represented by 250 molecular descriptors. The two statistically best clusters (one from each normalization method) as assessed by each of the five internal quality indexes in clusterSim are listed in Table 3. It is not feasible to compare index values originating from different metrics, thus each output listed in Table 3 was manually assessed for chemical similarity within solvent clusters.

Table 3. clusterSIM output showing the 10 best clusters as assessed by the cluster quality indices.

Cluster output

Normalization

Distance measure

Cluster method

Metric

Index Values

1

N5

Euclidean

PAM

G1

15.87106

2

N1

Squared Euclidean

Ward

G1

10.49527

3

N5

Chebyschev

McQuitty

KL

5.542532

4

N1

Chebyschev

Average link

KL

5.277713

5

N5

GDM1

Average link

G2

0.953114

6

N1

Chebyschev

Average link

G2

0.939211

7

N5

GDM1

Average link

S

0.456094

8

N1

Chebyschev

Single link

S

0.424797

9

N1

GDM1

Complete link

G3

0.079828

10

N5

GDM1

PAM

G3

0.059007

The importance of a manual assessment was exemplified when it was found that the results of those methods assessed by KL, G2, and S internal cluster indexes (cluster outputs 3-8) were not practical in terms of equal sampling across the library: the algorithms had isolated the most diverse chemicals into individual clusters, leaving one or two clusters which comprised the remainder of the library. Statistically reasonable clusters, as assessed by these three indexes, were not chemically sensible. Of the remaining clustering results (outputs 1, 2, 9 and 10) cluster output 1 (Table 3), gave the most reasonable chemical spread of solvents and provided sensible grouping within clusters in the majority of cases. The best clustering result (output 1) was achieved by using normalization method N5 combined ACS Paragon Plus Environment

10

Page 11 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


with a Euclidean distance measure and PAM clustering, assessed with the G1 metric. Cluster results for output 1 are provided in Table 4 and have been termed Strathclyde24. Details of all cluster outputs listed in Table 3 are provided in Supporting Information, Tables S2a-j.


11


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 30

Table 4. Strathclyde24 - Solvent library clustering obtained from output 1 in Table 3.

Cluster*

Solvents

1

1,5-pentanediol, 2-butoxyethanol, di(2-methoxyethyl) ether, 2-ethoxyethanol, 2methoxyethanol, 1,2-dimethoxyethane, 2-amino-1-butanol

2

1-methylnaphthalene

3

nitrobenzene, furfural, 2-phenylethanol, anisole

4

formamide, acetic acid, methanoic acid

5

dodecane, 2,2,4-triemthylheptane, heptane, hexane

6

N-methyl-2-pyrrolidone, N,N-dimethylacetamide, N,N-dimethylformamide, methyl acetate

7

1-octanol, 2-octanol, butanol, triethylamine

8

dimethylsulfoxide, acetone

9

xylene, toluene, benzene, 3-methylthiophene, 3-fluorotoluene, bromobenzene, iodobenzene, 4-fluorotoluene, aniline, pyridine

10

butyric acid, pentyl acetate, butyl acetate, diethyl carbonate, 4-methyl-2-pentanone, isobutyl acetate, ethyl acetate, ethyl lactate, butyl lactate

11

2-propanol, ethanol, cyclohexanol, 2-methyl-1-propanol, 2-butanol, 1-propanol, 2butanone, 1,2-propanediol

12

dibutyl ether, 2-methoxy-2-methylpropane, diethyl ether, diethyl sulfide

13

tetrachloroethane, trichloroethene

14

1,4-dioxane, tetrahydrofuran, tetrahydrothiophene, 1,3-dioxane

15

nitromethane

16

water

17

1,2-dichloroethane, 1-bromo-2-chloroethane, diethyl disulfide

18

acetonitrile, ethanethiol, methyl sulfide, 2-propanethiol, thioacetic acid

19

cyclohexane, cyclopentane

20

1-chlorobutane, 2-bromobutane, 1-bromobutane, 2-iodobutane

21

carbon tetrachloride, chloroform, dichloromethane, bromoform

22

trifluoroethanol, trifluoroacetic acid

23

methanol

24

iodomethane

1-hexanol, 1-pentanol, 3-methyl-1-butanol, 2-pentanol, 1-

* Clusters are illustrated graphically in Figure 3. ACS Paragon Plus Environment

12

Page 13 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Visualization of Clusters using MDS. A graphical representation of Strathclyde24 is presented in Figure 3. MDS was performed on each individual cluster. The central point of the cross in each cluster represents the cluster mean, with the lines stretching one standard deviation in each direction. The crosses are themselves positioned according to an overarching MDS of all cluster means. This plot is therefore substantially more informative than viewing clusters in tabular format: in addition to identifying similar solvents due to their placement in the same cluster, it allows interpretation of degrees of similarity and dissimilarity between different clusters. There are three key points of note regarding interpretations of the solvent map. Firstly, the inter-cluster similarities can be assessed by comparing the distances between the clusters’ center points; the smaller the distance, the more similar the properties of solvents within these clusters are likely to be. Likewise the farther apart the clusters reside on the map the more diverse the properties of the solvents within those clusters will be. Secondly, the intra-cluster distance between solvents is also representative of their similarity. This is explained by way of example. For instance, cluster 11 (Table 5) contains eight solvents, namely: 2-propanol, 1-propanol, ethanol, cyclohexanol, 2-methyl-1-propanol, 2-butanol, 2butanone and 1,2-propanediol, and is located in the lower central area of the map (shaded light green). It can be surmised from looking at this cluster that 2-butanol, 3-methyl-1-propanol, 2-propanol and 1propanol are particularly similar, as the points for each solvent lie in close proximity; they are also the most representative of the cluster as they are closest to the central point. Also, within this cluster it can be assumed that 1,2-propanediol is least like the remaining solvents and in fact appears almost as an outlier, being not only well separated from the other solvents, but also a large distance from the mean. Similarly, trimethylamine appears as a relative outlier in cluster 7, as expected. The third point is less intuitive than the former two: it is not possible to draw any precise meaning from the relative positions of solvents within a single cluster and those in adjacent clusters, since these positions are the result of different MDS plots. This is perhaps the single limitation of this visual approach, although the intent was to provide a map that was immediately practical and bereft of the ambiguity which would be ACS Paragon Plus Environment

13


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 30

created by considering all inter-solvent distances, subsequently identified by clustering; The impact of the clearly defined clusters would then be lost.


14

Page 15 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48


Figure 3. MDS plot of Strathclyde24 illustrating similarity and dissimilarity within and between clusters of solvents. Clusters containing one solvent are represented by single points and those containing two solvents are represented by connected points. The boundaries of clusters of three or more solvents are illustrated by colored hulls. The central point of the cross in each cluster represents the cluster mean with the lines stretching one standard deviation in each direction. ACS Paragon Plus Environment

15


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 30

Clustering Assessment. The probabilities of finding each type of crystalline form of HCT, CT, DHC and BQT, observed from solution crystallization screens as described earlier, were calculated using three approaches: blind sampling, polarity-based clustering and Strathlcyde24-guided. The polaritybased clustering scheme was obtained by splitting solvents into 24 bins according to their calculated log P value. This scheme is intended to represent a more realistic standard for comparison than entirely blind solvent sampling, though the latter remained a useful baseline. The clusters obtained by binning according to polarity are shown in Supporting Information, Table S3. The expected chance of finding all observed crystalline forms by selecting 24 solvents via each approach are listed in Table 5. These results are complemented by the qualitative assessment of cluster efficiency based on mapping the experimental solution crystallization outcomes for all four compounds onto the Strathclyde24 MDS plot (Figure 4). These maps show a good overall alignment between the types of outcomes and solvent clusters.

Table 5. Probability of finding all forms observed from HCT, CT, DHC and BQT screens, as outlined in Table 2, using blind sampling of 24 solvents compared to sampling based on polarity and randomly selecting one solvent from each cluster of Strathclyde24. Probabilities Compounds

Blind sampling (%)

Polarity-Based Sampling (%)

Strathclyde24 (%)

HCT

34

25

100

CT

96

100

95

DHC

30

20

100

BQT

36

43

100


16

Page 17 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

solvated form


form I

form II

form III

no outcome: insufficient sample for analysis

no outcome: solvent not included in crystallization screen

Figure 4. Strathclyde24 MDS plot with solvent positions as for Figure 3, for: (a) HCT, (b) CT, (c) DHC and (d) BQT, colored according to experimental outcomes of solution crystallization screens. Solvents represented by small dots were not included in crystallization screens.


17


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 30

It is clear that Strathclyde24-guided solvent selection greatly increases the likelihood of observing all types of crystalline form when compared both random and polarity-based selection. Interestingly, the clustering of solvents by polarity did not show an improvement over random selection for the studied APIs, suggesting that a simple, one-property model for solvent selection was not adequate. random selection. In all cases, the chances of observing all forms listed in Table 2 by selecting any one solvent from each of the Strathclyde24 clusters is ninety-five percent or greater. For HCT, DHC and BQT expected values of observing all forms using Strathclyde24 were increased to one-hundred percent compared to values ranging between twenty and forty-three percent obtained from blind and polaritybased sampling. On inspection of these values, it appears that by performing 24 solution crystallization experiments for HCT, DHC and BQT, it is possible to observe all forms of both compounds and this would be consistent regardless of which solvent from each cluster had been randomly selected. It is worth noting that solvents closer to the cluster crosshairs are more representative of the cluster as a whole. It is also worth reiterating that the algorithm used does not distinguish between different solvated forms and provides a more generalized quantification of success. However, what this approach does identify are the clusters of solvents that will form solvates and if the user is particularly interested in solvate formation then this map will aid in the identification of solvents which should be used for crystallization studies. In the case of CT where the likelihood of observing all forms identified in the experimental screen by blind sampling alone is already high (estimated at ninety-six percent) due to the relatively low number of physical forms identified for CT, Strathclyde24 guidance was not estimated to increase the efficiency of physical forms observed, calculated to be ninety-five percent effective. However, CT has relatively low solubility across the full range of solvents included in the library and what this map identifies is the solvents that CT is soluble in and therefore which solvents could be used to recrystallize or use as anti-solvents.

The maps presented in Figure 4 aid interpretation of the figures quoted in Table 5 and give a clearer view of the effectiveness of clustering the solvent library in terms of reducing the number of ACS Paragon Plus Environment

18

Page 19 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


experiments required to encounter all form types. These are discussed further for each compoundbelow. Figure 4(a) represents the results from a solution crystallization screen for HCT. From this screen, two polymorphs and seven solvates were observed and it is noteworthy that results were obtained from all solvents used in this search. It can be seen that the majority of solvents yielded form I HCT but importantly the only solvent which produced form II was nitromethane, which is a single-solvent cluster. Therefore this form would be observed in a reduced search. In the case of solvated forms, these are grouped in a small region in the lower right of the map and in fact four out of the seven solvates observed experimentally all belong to the same cluster. Generally speaking, if compounds form a variety of solvates they tend to so with similar solvents, or solvents which share particular properties, which reside in close proximity on the map. Thus, using sampling guided by Strathclyde24, it would not be expected to find all solvated forms of HCT with only 24 experiments, however at least one solvated form of HCT would be observed. In practice therefore, observation of a solvate would trigger subsequent experiments from similar solvents identified from the cluster to explore whether further solvates are formed. In contrast to HCT, the map for CT (Figure 4(b)) shows that recrystallization from 30 out of 67 solvents did not yield sufficient sample for analysis. This is principally due to poor solubility of the material. A benefit of using the map to view experimental outcomes, even those considered as ‘no outcome,’ is that it further supports the effectiveness of solvent clustering. Those solvents in which CT exhibited poor solubility are also loosely clustered in the higher region of the map. Additionally, those solvents which produced form I or a solvate are also well grouped on the map. The fact that the expected value for using Strathclyde24 guided sampling was not a significant improvement in comparison to random sampling may be attributed to the lack of one cluster solely forming solvates. Given that there are clusters in the map which only contain form I or ‘no outcome,’ if there had been a cluster containing only solvates, the guided sampling expected value would be one hundred percent. It is worth noting, however, that the solvents in cluster 6 all form solvates with CT with the exception of methyl acetate. ACS Paragon Plus Environment

19


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 30

This poses the question: under the correct experimental conditions, would CT form a solvate with methyl acetate? Classification of experimental screening results onto the MDS plot in this manner demonstrates the value of Strathclyde24, not only in experimental design but also for retrospective assessment of the screening performance.36 DHC poses the biggest challenge in terms of assessing the effectiveness of clustering and as a measure of how many distinct forms of a compound could realistically be identified from a reduced set of experiments. Aside from ‘no outcome,’ there were four distinct experimental outcomes associated with DHC: forms I, II, III, solvates (Table 3). The expected value for finding all forms of DHC based on Strathclyde 24 is one-hundred percent. This is clearly represented in Figure 4(c) where it can be seen that the majority of clusters have only one type of experimental outcome associated with them. The tight clustering for all outcome categories observed in this figure further supports the reliability of Strathclyde24. The screening results for BQT are illustrated in Figure 4(d). The majority of crystallizations produced form I with only two solvents yielding form II (chloroform and carbon tetrachloride) and two solvents forming solvates (1-methyl naphthalene and acetic acid). It was interesting to evaluate how many of the forms other than form I would be identified from a reduced crystallization screen and the calculated probability was another positive result – one-hundred percent. As before, this has the disadvantage that only one solvate would be observed but unlike the previous examples it would not be possible to find the additional solvate by simply exploring regions close to cluster 2 in the map. Thus, both the qualitative and quantitative methods of assessment demonstrate successful clustering of the solvent library that shows considerable promise in terms of practical exploitation and application. The calculated probabilities alone indicate that Strathclyde24 is effective but the mapped MDS plots give a more comprehensive account. Using these plots it is also possible to identify solute/solvent relationships such as solubility and solvate formation.

CONCLUSIONS ACS Paragon Plus Environment

20

Page 21 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


We have focused on generating an efficient means to rationalize solvent selection for the application of efficient and effective physical form screening via a novel approach to clustering a chemical solvent library. This approach is equally viable for any application concerning grouping of molecules. The clustering method presented here is only one of many possible techniques for library design or similarity grouping and, as discussed earlier, the advantages of this particular method lie in the range of clustering methods explored. The graphical representation of solvent library clusters, Strathclyde24, provides an effective gauge of similarity and dissimilarity within the library and is a simple method to visualize multi-dimensional solvent space and provide an intuitive route to solvent subset selection. Furthermore, mapped MDS plots highlight a high correlation between solvent type and experimental outcome and provide confidence in both the calculated molecular descriptors, which have been selected to model the library and also in the final clustering output. Such a high correlation suggests that these descriptors may be useful for subsequent machine-learning and data-mining techniques applied to polymorph screening results such as those applied to the dibenzazepine drug carbamazepine.13, 36 As a starting point for an experimental approach, it can be concluded that for compounds that do not show extensive physical form diversity (polymorphs and solvates) and where reasonable solubility is achieved using individual solvents selected form each of the Strathclyde 24, it is unlikely to show extensive variability on a more extended screen, although further forms cannot of course be ruled out. In crystallizations where solvent identity is not the driving force,37, 38 the clustering may not yield an accurate picture.

This approach only takes solvent properties into account; other factors such as

temperature, supersaturation and agitation, which can be varied, also have a significant effect on crystallization. However, it is recommended that the descriptors presented here are also included in any informatics analysis in combination with other experimental parameters to investigate crystallization processes and what governs particular physical form outcome. It can be concluded that this approach to experimental design, where solvent identity plays a key role, can significantly reduce the number of experiments required for solid form screening. It can also


21


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 30

provide insights on the areas of chemical space which are being searched and if the solvents sampled therein are sufficiently diverse as crystallization from diverse solvents may increase the success rate of getting all solid forms.

This approach can be used for optimizing the crystallization process by

providing a guide to molecular solubility from only 24 experiments. ASSOCIATED CONTENT Supporting Information List of calculated descriptors and tables of top 10 clustering approaches to the solvent library, are provided in SI. This material is available free of charge at http://pubs.acs.org.

ACKNOWLEDGMENTS The authors would like to acknowledge the EPSRC Centre for Innovative Manufacturing in Continuous Manufacturing and Crystallisation (grant ref. EP/I033459/1) and the EPSRC Doctoral Training Centre in Continuous Manufacturing and Crystallisation (grant ref. EP/K503289/1) for funding this work.

REFERENCES 1. Florence, A. J. The solid state. In Basic principles and systems, Florence, A. T.; Siepman, J., Eds.; Informa Healthcare: New york, 2009; Vol. 1, Chapter 8, pp 253-310. 2. Hilfiker, R.; Paul, S. M. D.; Szelagiewicz, m. Polymorphism: In the pharmaceutical industry. In Hilfiker, R., Ed.; Wiley-VCH: Weinheim, 2006; Chapter 287-308. 3. Florence, A. J. Approaches to high-throughput physical form screening and discovery In Polymorphism in pharmaceutical solids, Brittain, H. G., Ed.; Informa Healthcare, New York: 2009; Vol. 192, pp 139-184. 4. Monissette, S. L.; Almarsson, O.; Peterson, M. L.; Remenar, J. F.; Read, M. J.; Lemmo, A. V.; Ellis, S.; Cima, M. J.; Gardner, C. R., High-throughput crystallization: Polymorphs, salts, co-crystals and solvates of phan-naceutical solids. Advanced Drug Delivery Reviews 2004, 56, 275-300. 5. Parmar, M. M.; Khan, O.; Seton, L.; Ford, J. L., Polymorph selection with morphology control using solvents. Cryst. Growth Des. 2007, 7, 1635-1642. 6. Kitamura, M., Strategy for control of crystallization of polymorphs. Crystengcomm 2009, 11, 949-964. 7. Lang, P.; Kiss, V.; Ambrus, R.; Farkas, G.; Szabo-Revesz, P.; Aigner, Z.; Varkonyi, E., Polymorph screening of an active material. J Pharmaceut Biomed 2013, 84, 177-183. 8. Alleso, M.; Rantanen, J.; Aaltonen, J.; Cornett, C.; van den Berg, F., Solvent subset selection for polymorph screening. J. Chemometr. 2008, 22, 621-631. 22 ACS Paragon Plus Environment

Page 23 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


9. Gu, C. H.; Li, H.; Gandhi, R. B.; Raghavan, K., Grouping solvents by statistical analysis of solvent property parameters: Implication to polymorph screening. Int. J. Pharm. 2004, 283, 117-125. 10. Hasa, D.; Miniussi, E.; Jones, W., Mechanochemical synthesis of multicomponent crystals: One liquid for one polymorph? A myth to dispel. Cryst. Growth Des. 2016, 16, 4582-4588. 11. Price, S. L., Predicting crystal structures of organic compounds. Chemical Society Reviews 2014, 43, 2098-2111. 12. Neumann, M. A.; de Streek, J. V.; Fabbiani, F. P. A.; Hidber, P.; Grassmann, O., Combined crystal structure prediction and high-pressure crystallization in rational pharmaceutical polymorph screening. Nat Commun 2015, 6, 7793. 13. McCabe, J. F., Application of design of experiment (doe) to polymorph screening and subsequent data analysis. CrystEngComm 2010, 12, 1110-1119. 14. Xu, D.; Redman-Furey, N., Statistical cluster analysis of pharmaceutical solvents. Int. J. Pharm. 2007, 339, 175-188. 15. Gramatica, P.; Navas, N.; Todeschini, R., Classification of organic solvents and modelling of their physico-chemical properties by chemometric methods using different sets of molecular descriptors. Trac-Trends Anal. Chem. 1999, 18, 461-471. 16. Rannar, S.; Andersson, P. L., A novel approach using hierarchical clustering to select industrial chemicals for environmental impact assessment. J. Chem Inf. Model. 2010, 50, 30-36. 17. Johnston, A.; Florence, A. J.; Shankland, N.; Kennedy, A. R.; Shankland, K.; Price, S. L., Crystallization and crystal energy landscape of hydrochlorothiazide. Cryst. Growth Des. 2007, 7, 705712. 18. Hulme, A. T.; Johnston, A.; Florence, A. J.; Fernandes, P.; Shankland, K.; Bedford, C. T.; Welch, G. W. A.; Sadiq, G.; Haynes, D. A.; Motherwell, W. D. S.; Tocher, D. A.; Price, S. L., Search for a predicted hydrogen bonding motif - a multidisciplinary investigation into the polymorphism of 3azabicyclo[3.3.1]nonane-2,4-dione. J. Am. Chem. Soc. 2007, 129, 3649-3657. 19. Hubert, L., Approximate evaluation techniques for the single-link and complete-link hierarchical clustering procedures. J. Am. Stat. Assoc. 1974, 69, 698-704. 20. Walesiak, M.; Dudek, A. Clustersim, v0.38-1; 2010. 21. Molecular operating environment, 2009.10; Chemical Computing Group: 2009. 22. Tetko, I. V.; Gasteiger, J.; Todeschini, R.; Mauri, A.; Livingstone, D.; Ertl, P.; Palyulin, V.; Radchenko, E.; Zefirov, N. S.; Makarenko, A. S.; Tanchuk, V. Y.; Prokopenko, V. V., Virtual computational chemistry laboratory - design and description. Journal of Computer-Aided Mol. Des. 2005, 19, 453-463. 23. E-dragon, VCCLAB, Virtual Computational Chemistry Laboratory: 2005. 24. R: A language and environment for statistical computing, R Development Core Team Vienna, Austria, 2009. 25. Florence, A. J.; Johnston, A.; Fernandes, P.; Shankland, N.; Shankland, K., An automated platform for parallel crystallization of small organic molecules. J. Appl. Crystallogr. 2006, 39, 922-924. 26. Calinski, R. B.; Harabasz, J., A dendrite method for cluster analysis. Communications in Statistics 1974, 3, 1-27. 27. Everitt, B. S.; Landau, E.; Leese, M., Cluster analysis. Arnold: London, 2001; p 103-104. 28. Gatnar, E.; Walesiak, M., Metody statystycznej analizy wielowymiarowej w badaniach marketingowych [multivariate statistical analysis methods in marketing research]. Wydawnic-two AE: Wroclaw, 2004; p 338-339. 29. Gordon, A. D., Classification. Chapman & Hall/CRC: London, 1999; p 62. 30. Milligan, G. W.; Cooper, M. C., An examination of procedures for determining the number of clusters in a data set. Psychometrika 1985, 50, 159-179. 31. Kaufman, L.; Rousseeuw, P. J., Finding groups in data: An introduction to cluster analysis. Wiley: New York, 1990; p 83 - 88.


23


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 30

32. Rousseeuw, P. J., Silhouettes - a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53-65. 33. Johnston, A.; Bardin, J.; Johnston, B. F.; Fernandes, P.; Kennedy, A. R.; Price, S. L.; Florence, A. J., Experimental and predictedl crystal energy landscapes of chlorothiazide. Cryst. Growth Des. 2010. 34. Arlin, J. B.; Johnston, A.; Miller, G. J.; Kennedy, A. R.; Price, S. L.; Florence, A. J., A predicted dimer-based polymorph of 10,11-dihydrocarbamazepine (form iv). CrystEngComm 2010, 12, 64-66. 35. In; EPSRC. 36. Johnston, A.; Johnston, B. F.; Kennedy, A. R.; Florence, A. J., Targeted crystallisation of novel carbamazepine solvates based on a retrospective random forest classification. CrystEngComm 2008, 10, 23-25. 37. Florence, A. J.; Johnston, A.; Price, S. L.; Nowell, H.; Kennedy, A. R.; Shankland, N., An automated parallel crystallisation search for predicted crystal structures and packing motifs of carbamazepine. J. Pharm. Sci. 2006, 95, 1918-1930. 38. Bhardwaj, R. M.; Price, L. S.; Price, S. L.; Reutzel-Edens, S. M.; Miller, G. J.; Oswald, I. D. H.; Johnston, B. F.; Florence, A. J., Exploring the experimental and computed crystal energy landscape of olanzapine. Cryst. Growth Des. 2013, 13, 1602-1617.


24

Page 25 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


For Table of Contents use only A Combined Chemoinformatics Approach to Solvent Library Design using clusterSim and MultiDimensional Scaling Andrea Johnston, Rajni Miglani-Bhardwaj, Rajesh Gurung, Antony D. Vassileiou, Alastair J. Florence and Blair. F. Johnston


25


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47


Page 26 of 30

Page 27 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

MDS plot of Strathclyde24 illustrating similarity and dissimilarity within and between clusters of solvents. Clusters containing one solvent are represented by single points and those containing two solvents are represented by connected points. The boundaries of clusters of three or more solvents are illustrated by colored hulls. The central point of the cross in each cluster represents the cluster mean with the lines stretching one standard deviation in each direction. 254x190mm (200 x 200 DPI)


Page 28 of 30

Page 29 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Strathclyde24 MDS plot with solvent positions as for Figure 3, for: (a) HCT, (b) CT, (c) DHC and (d) BQT, colored according to experimental outcomes of solution crystallization screens. Solvents represented by small dots were not included in crystallization screens. 254x190mm (200 x 200 DPI)



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47


Page 30 of 30

Combined Chemoinformatics Approach to Solvent Library Design

Recommend Documents