Harnessing Cloud Architecture for Crystal Structure Prediction

Oct 3, 2018 - In this work we present a novel cloud-computing CSP platform with the ... and integrating cutting-edge computational chemistry algorithm...
1 downloads 0 Views 1MB Size
Subscriber access provided by Kaohsiung Medical University

Article

Harnessing Cloud Architecture for Crystal Structure Prediction Calculations Peiyu Zhang, Geoffrey P. F. Wood, Jian Ma, Mingjun Yang, Yang Liu, Guangxu Sun, Yide A. Jiang, Bruno C Hancock, and Shuhao Wen Cryst. Growth Des., Just Accepted Manuscript • DOI: 10.1021/acs.cgd.8b01098 • Publication Date (Web): 03 Oct 2018 Downloaded from http://pubs.acs.org on October 3, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

Harnessing Cloud Architecture for Crystal Structure Prediction Calculations Peiyu Zhang †, Geoffrey P. F. Wood*,‡, Jian Ma†, Mingjun Yang†, Yang Liu†, Guangxu Sun†, Yide A. Jiang†, Bruno C. Hancock‡, and Shuhao Wen*,† † Xtalpi Inc, One Broadway, 9th floor, Cambridge, MA 02142, U.S.A ‡Pfizer Worldwide R&D, Eastern Point Road, Groton, CT 06340, U.S.A.

Corresponding Authors: Geoffrey P. F. Wood ([email protected]) and Shuhao Wen ([email protected])

ACS Paragon Plus Environment

1

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 35

ABSTRACT

Accurate and rapid crystal structure predictions have the potential to transform the development of new materials, particularly in fields with highly complex molecular structures (such as in drug development).

In this work we present a novel cloud-computing CSP platform with the

capability of scheduling hundreds of thousands CPU cores and integrating cutting-edge computational chemistry algorithms. This new cloud-computing based CSP platform has been applied to three crystalline drug substances of increasing complexity. The lattice energies of the experimental crystal structures are all within 4.0 kJ/mol of the lowest energy predicted structures. Based on the results of this work, the algorithm improvement and the mass computational power of cloud computing can reduce the whole CSP process to just 1 - 3 weeks for Z' = 1 systems and less than 5 weeks for significantly more complex systems.

Furthermore, it is possible to

simultaneously perform calculations for multiple molecules if desired. As a result of these improvements, CSP calculations can potentially be applied in conjunction with state-of-the-art experimental screening experiments to reduce the risk of finding new solid forms after product launch provided that a sufficient number of stoichiometries and space-groups are explored.

KEYWORDS crystal structure prediction, cloud computing, polymorphism

ACS Paragon Plus Environment

2

Page 3 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

INTRODUCTION Understanding the polymorph landscape is important in the development of new materials, such as active pharmaceutical ingredients (APIs) [1]. It is important to understand the energetic relationship between different crystalline forms of a compound, as well as to identify the thermodynamically stable polymorph of a material under any given set of conditions. The consequences of not finding a polymorph of an API can be very significant in commercial terms. For example, in 1998 the semi-solid formulation of Norvir™ (ritonavir) had to be removed from the market due to unexpected appearance of a more thermodynamically stable form that altered the properties of the product.

A reformulated version of this product was eventually

reintroduced, but not until after many months of lost sales and an immeasurable impact on patients. Traditionally, an experimental approach has been used to map out the polymorph landscape of new APIs, but such trial-and-error screening approaches are quite inefficient.

A typical

experimental protocol can require significant amounts of material (typically, 50-100g) and time (up to 6 months), and can cost upward of US$100K. Even with careful experimental design it is possible to miss solid structures that form under unusual conditions or which are disfavored by lab-scale kinetics [2]. It is currently possible to reliably generate the potential 3D crystal structures of simple materials from their 2D chemical structures and compare these to the known experimental structures [3,4, 5,6,7,8,9]. To do this, established chemical knowledge is used to predict all of the potential 3D molecular conformations and pack those 3D molecules into putative crystal lattices.

After

energy minimization of all the possible structures, the lowest energy structure is expected to be the most stable observable polymorph.

Other low energy structures that are predicted are

ACS Paragon Plus Environment

3

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 35

potential metastable polymorphs of the material. Usually these calculations require a significant amount of time (weeks or months) even when using high-performance computing (HPC) platforms and they can produce many more low energy structures than are observed experimentally. Clearly there are many drivers for being able to identify potential crystal structures both accurately and quickly.

However, there are significant drawbacks with the current crystal

structure prediction (CSP) methods that prevent their widespread adoption [3-8]. These include: •

Very high computational resource requirements (for conformer generation, packing, and energy ranking)



Practical limits to the size and flexibility of molecules that can be considered (that is, too many possible conformers to be considered in a reasonable timeframe)



Limits to the number of molecules in the unit cell (thus Z’ > 1, solvates, hydrates, cocrystals, etc are very challenging to consider)



Large numbers of theoretical structures are predicted, many of which do not ever come to fruition in real life



Predictions are usually made at zero Kelvin, with no attempt to make corrections due to temperature effects and so the results are not correct for enantiotropic polymorphs



Long calculation times (usually several months per molecule for realistic APIs)

These limitations of current CSP approaches have prevented their widespread usage in drug development where normal cycle times require a response within a matter of days. Instead, CSP methods have been used to confirm the results of experimental screening studies and to de-risk new products as they approach commercial launch [10].

ACS Paragon Plus Environment

4

Page 5 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

In this paper we describe the use of cloud computing for CSP calculations (through the use of Amazon Web Services (AWS) [11] and the associated distributed algorithms for CSP) and show that many of the wall time limitations of traditional CSP approaches can be overcome. By using these modern computing approaches it is now possible to return results within a matter of days to weeks.

Using these methods we have successfully applied CSP to systems with similar

complexity to the most complex molecules in the 6th blind crystal structure prediction test[8]. These advancements should significantly enhance the field of CSP and enable an accurate and timely determination of polymorphic relationships in real materials, such as APIs.

MATERIALS AND METHODS Test Systems. The chemical diagrams of the test systems are given in Figure 1. System (a) represents the most basic test system. It has a molecular weight of 294.27, 2 rotatable bonds and an experimentally determined Z of 1. It was deposited in the Cambridge Structural Database (CSD 12 ) in 2007 and has the ref. code VIPRES [13] (corresponding to the 1R,2S isomer). System (b) has a molecular weight of 171.24 and 3 rotatable bonds. It exists as both an anhydrous and monohydrate structure. The test will primarily focus on the monohydrate in order to demonstrate how the algorithms scale with multicomponent systems. Both the anhydrous and monohydrate are found to be polymorphic with a number of structures deposited in the CSD. Representative structures of the anhydrous polymorphs have CSD ref codes: QIMKIG01, QIMKIG02, and QIMKIG03, whereas representative structures of the monohydrate polymorphs have CSD ref codes: QIMKOM, and QIMKOM02 [14,15,16,17]. System (c) is the most complex molecule considered in this work and is representative of the typical size and flexibility of API molecules. It has a molecular weight of 484.91, 8 rotatable bonds, 2 chiral centers

ACS Paragon Plus Environment

5

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 35

(2R,4R) and an experimentally determined Z of 1. This makes it of comparable complexity to the most challenging molecules attempted in the crystal structure prediction “blind tests” [5-8]. This structure has not been uploaded to the CSD, instead the coordinates have been taken from an internal database held at Pfizer Inc.

Figure 1: Test systems used in the present study (a) Anhydrous neutral structure with 2 rotatable bonds (Z = 1; CSD ref code VIPRES) (b) Polymorphic anhydrous (Z = 1; CSD ref code QIMKIG) and polymorphic mono-hydrate (Z = 2; CSD ref code QIMKOM) structures with 3 rotatable bonds (c) Anhydrous structure with 8 rotatable bonds (Z = 1). All rotatable bonds are defined in the figure.

ACS Paragon Plus Environment

6

Page 7 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

Overview of CSP Algorithm and Cloud-Based Implementation. The polymorph landscape for each compound was predicted using XtalPi’s [18] cloud-computing based crystal structure prediction platform which is described in more detail below.

This integrates several

methodological and technological innovations as shown in Figure 2, including: smart conformation analyses, scoring by multi-set force-field parameterization, cross-validation of crystal structure searching, GPU-based clustering, and high-precision ranking algorithms, to create a versatile, scalable, and fully automated tool. The CSP platform starts by selecting the best overall performing force-field from several force-fields that are trained specifically for the system under consideration (e.g., the target molecule together with any co-formers and/or crystallographic solvent molecules). The force-fields are then used for conformer generation, packing, and initial energy ranking. Crystal structure packing is achieved through two global optimization algorithms, i.e., Monte Carlo and particle swarm optimization that supplement one another in terms of accelerating convergence towards the globally enumerated crystal landscape. Once the landscape has been generated the lattice energies are re-evaluated using a higher accuracy method, which in the current study is the optPBE-vdW[ 19 ] exchange-correlation functional corrected for dispersion effects as implemented in the Vienna ab Initio Simulation Package (VASP) [20,21,22]. The pipeline has been tested extensively both internally (at Xtalpi) and externally (at Pfizer Inc. and SSCI [23]) as an enterprise service on dozens of molecular systems that consist of variety of different organic crystalline materials including: anhydrous structures, hemi-hydrates, monohydrates, salts and co-crystals.

ACS Paragon Plus Environment

7

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 35

Figure 2: Schematic Summarizing the Key Components of Xtalpi’s Cloud-Computing Based Crystal Structure Prediction Platform. For details refer to the text.

Compute Architecture. The CSP platform leverages the flexibility and computational power of cloud computing in order to design a fit-for-purpose HPC system specific for each molecular system under analysis and at each phase of the CSP workflow. The coupled technology and algorithmic approach, in which the software and hardware are optimized for each task, has allowed accurate and efficient processing of many different molecular systems. Specifically, the cloud-based nature of the platform allows a job-specific cluster to be built on-the-fly for each compound and task, thereby optimizing the set of compute resources devoted to each job. Clusters consisting of tens to hundreds of thousands of nodes with 16 - 64 CPU cores per node are built within minutes for each new job. The CSP platform is highly portable and can be deployed on local compute clusters as well as on cloud computing platforms, such as AWS [24],

ACS Paragon Plus Environment

8

Page 9 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

Google cloud [25], Tencent cloud [26] and Aliyun [27]. To protect the intellectual property of users (e.g., molecular and crystal structure information), a virtual private cloud (VPC) is created for each crystal structure prediction task, which isolates the computations and data from public networks and therefore guarantees security. The overall pipeline has a processing capacity of 50 thousand tasks per minute and can deploy a CSP environment consisting of one million cores in less than one hour, which can allow accurate CSP results for a very complex molecular system to be obtained within several days to weeks, depending on the amount of financial resources available. However, this relies on good scaling of the calculations across many 100,000’s of cores, approximate scaling of the current method is given in Table 1. As depicted in Figure 2, Cluster management is performed through Mesos [28] and distribution and deployment of software is achieved through Docker [29]. The main controller of the CSP platform, which is called the “HPC Master”, is used for managing the computational operations during the crystal structure prediction calculation, including: task creation, query and termination, and security validation of requests. Various algorithms (including the clustering, different global searching methods for cross-validation and high-precision ranking etc.) and pre-/post-processing units are built into Docker images, which are stored and managed in the Software Warehouse. During the CSP process, the job with a specific Docker image will be invoked according the predefined workflow. This workflow will simultaneously employ different global search methods to get the energy landscapes for cross-validation. After creation, tasks are queued using the PostgreSQL database management system [30]. The scheduler assigns tasks to the message queue according to priority, creation time, and number of CPU cores requested.

When a

computational task has been completed, the “HPC Master” moves the task and associated data to an archival database. A directed acyclic graph describes the dependency relationships of tasks

ACS Paragon Plus Environment

9

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 35

by the execution engine. This engine manages the hundreds of thousands of tasks required for the complete crystal structure prediction process.

Several clusters are created and managed

according to ‘zones’, which can be defined based on their geographic location or cloud provider (for example, AWS, Google and Tencent cloud). The ‘zones’ can rapidly change the size of the virtual computational cluster and allow resources to be applied according to need. Coordinated scheduling of multiple zones is guaranteed by a unified scheduler, which gets the tasks from a unique task query and distributes them to the zones. Both the load balance and the number of CPU cores of zones are considered in scheduling to improve the computational efficiency by increasing utilization of zones. As an example, Figure 3 shows the computational resources used in two zones on a single day (Nov. 22, 2017). During this day, the number of CPU cores used increases from zero to 35 thousand in one zone, and reaches 60 thousand cores with another zone due to changes in the number of tasks running on that day. This illustrates a major strength of this scalable cloud-computing platform, i.e. only the computational resources that are needed for a particular task are used, and the number of resources requested can be varied to obtain the desired balance between speed and financial constraints. In general, requests for more than 100 thousands CPU cores at the same time lead to an increase in the unit price of computational resources.

ACS Paragon Plus Environment

10

Page 11 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

Figure 3. Number of CPU cores used over time on an arbitrary day (Nov. 22, 2017). Colors represent different ‘zones’ created for the CSP analysis. Zones may be physical or virtual i.e. the geographical location or service provider The whole CSP process generates approximately 200 GB of data. The input/output of tasks and log records are saved in Amazon Simple Storage Service (S3) [30] as unstructured data. Figure 4 depicts the four types of data generated during the CSP process, and persistent data structures are used for each data type. In the conformation analysis, many conformations are produced from enumeration of isomers (e.g. ring conformers and nitrogen configurations) and stepwise scanning along flexible torsion angles. The conformations and relationship are stored in the graph database. Squares and associated connections in the figure represent conformations and their relationships, respectively. The metadata of force field parameters is also stored in the graph database. The relationship between force fields parametrized with different training set and protocols is also recorded in the database. The inheritance relationship of different force field versions is shown by the associated connection in the database. In the structure pool, a large number of crystal structures up to billions are stored in a relational database. The large sparse similarity-matrix of

ACS Paragon Plus Environment

11

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 35

the crystal structure is saved in Amazon S3. The dimers are extracted from low-energy crystal structures. Multiple dimers can be extracted from one single crystal structures, which are stored in a subgroup in the graph database for force field training and validation.

Figure 4. Data generated and their relationships from different stages of a crystal structure prediction calculation using the Cloud-based Platform. For details refer to the text.

Algorithmic Details. Scoring by multi-set force-field parameterization. Classical molecular mechanics force-fields are used to evaluate crystal structures for each system during the initial stages of the calculation. In addition to traditional bonded and non-bonded terms as found in many popular force-fields such as, AMBER [31], CHARMM [32] and OPLS [33], XtalPi’s CSP force-field includes additional terms for modeling complicated conformational and crystal structure spaces, such as anharmonic terms, hydrogen bond corrections, lone-pair corrections, selected non-bonded interactions, separated parameters of inter- and intra-molecular interactions, as described in equation (1).

ACS Paragon Plus Environment

12

Page 13 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

𝑖𝑛𝑡𝑟𝑎 𝑖𝑛𝑡𝑟𝑎 𝑖𝑛𝑡𝑟𝑎 𝑖𝑛𝑡𝑒𝑟 𝑈 = 𝑈𝑏𝑜𝑛𝑑 + 𝑈𝑎𝑛𝑔𝑙𝑒 + 𝑈𝑑𝑖ℎ𝑒 + 𝑈𝑖𝑚𝑝𝑟 + 𝑈𝑐𝑜𝑢𝑝𝑙𝑖𝑛𝑔 + 𝑈𝐻𝐵 + 𝑈𝑒𝑙𝑒𝑐 + 𝑈𝑣𝑑𝑤 + 𝑈𝐻𝐵 + 𝑖𝑛𝑡𝑒𝑟 𝑖𝑛𝑡𝑒𝑟 𝑈𝑒𝑙𝑒𝑐 + 𝑈𝑣𝑑𝑤 + 𝑈𝑠𝑝𝑒𝑐𝑖𝑎𝑙

(1)

For the bond (Ubond) and angle (Uangle) terms, an expanded function to 4th order was used to account for the anharmonic effects. A cosine expansion is used for proper torsion terms, which is also employed by the CHARMM, OPLSAA and AMBER force-fields. The out-of-plane terms (Uimpr) are restrained with a harmonic function with the equilibrium geometry at 0 or 180 degrees according to the definition of the improper torsions. The 12-6 Lenard-Jones potential (Uvdw) is used to model van der Waals (vdW) interactions and the 10-6 form is used for hydrogen bonding interactions (UHB). For both vdW and electrostatic terms (Uelec), different parameters are used for intramolecular and intermolecular interaction pairs (Uintra and Uinter).

The special term

(Uspecial) includes all other remaining corrections, e.g. for some selected interaction pairs of vdW terms, of the force field in order to maximize the accuracy of a given system. Ucoupling is used for terms involving bond−bond and bond−angle couplings. This term was employed to more accurately model ring conformers. System-specific parameters are developed at the beginning of the CSP process and re-trained at various stages in order to describe the target system as accurately as possible. Several sets of parameters are fitted iteratively against different target data in order to create various versions of the force-field, which are simultaneously used in prediction and scoring the of the crystal structures. Conformational analysis of the molecular structures using Quantum Mechanical calculations provides the initial data for fitting the force field. A 2D diagram or 3D molecular structure, e.g.

ACS Paragon Plus Environment

13

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 35

a SMILES string or XYZ/PDB/MOL file is the initial input for the CSP pipeline and provides the information necessary for conformational analysis. Specifically, the connectivity information encoded in the structure is used to drive the conformational analysis of that structure, in which all low-energy isomers are sampled. That is, conformational isomers are sampled through incremental rotations around the sigma bonds and stereoisomers are sampled through enumeration of chiral centers and molecular rings. By scanning all the conformations along these degrees of freedom, the conformational analysis procedure generates thousands of conformers. These conformers (initially in the gas-phase) form a first training set on which the custom force-field parameters are optimized, as described by equation (1). In the later stages of the procedure, the intra- and inter-molecular force field terms are re-parameterized based on the conformational energy, molecular pair interactions, and lattice energies of the gas-phase monomer, dimers, and lattices, respectively. Cross validation through simultaneous crystal structure generation and clustering. After initial force-field parameterization, a global search for crystal structures is performed. The objective function for the search is a linear combination of the custom force field describing the molecular degrees of freedom and a set of functions on variables describing the crystal form as given in equation (2).

⃑⃑1 , 𝑋⃑1𝑜𝑡ℎ𝑒𝑟 , ⋯ , 𝑋⃑𝑛 , 𝜃⃑𝑛 , 𝐷 ⃑⃑𝑛 , 𝑋⃑𝑛𝑜𝑡ℎ𝑒𝑟 ) 𝐸 = 𝑓(𝑆𝑝𝑎𝑐𝑒𝐺𝑟𝑜𝑢𝑝, a, b, c, α, β, γ, 𝑋⃑1 , 𝜃⃑1 , 𝐷

(2)

⃑⃑𝑖 are the molecular position, a, b, c, α, β and γ are the lattice parameters. 𝑋⃑𝑖 , 𝜃⃑𝑖 and 𝐷 orientation, and rotatable dihedrals, respectively. 𝑋⃑𝑖𝑜𝑡ℎ𝑒𝑟 represents the rest of variables, such as bonds, angles and non-flexible dihedrals. To ensure both a broad coverage of the

ACS Paragon Plus Environment

14

Page 15 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

search space as well as a thorough enumeration of all the low energy minima, two global search algorithms are carried out in parallel. Specifically, simultaneous global searches are performed with the particle swarm optimization (PSO) method and the Monte Carlo (MC) algorithm combined with a basin or minima hopping scheme [34, 35]. PSO is used for rapid global searches for crystal structures in low energy regions. For this algorithm, a large number of structures (termed “particles”) are initially generated.

These

particles move (“swarm”) in the search space according to local and global best particles. For the MC search, a small number of crystal structures are initially generated followed by random moves in search space as governed by equation (2). At each MC step the acceptance of the proposed move is governed by the Boltzmann distribution using a scaled energy difference derived from the force-field. The PSO algorithm is a heuristic algorithm36 with the advantage of fast convergence but the disadvantage of early maturity. The MC based algorithm 37 has the advantage of strong capability of global search and the disadvantage of slow convergence. Therefore these two algorithms tend to complement one another. We use both algorithms to independently predict crystal structures. Then the landscapes from the two algorithms within a pre-defined low energy range are compared in order to cross validate the convergence. After putative low energy structures have been generated through the two global searches, structures within this combined set are locally minimized using higher precision methods in order to generate the final landscape. While the global search can produce millions or even billions of putative crystal structures, many of them are structurally identical or nearly so thus the CSP pipeline clusters similar structures in order to reduce redundancies. A clustering algorithm has been developed that computes the distances between 20,000 structure pairs in approximately one second of compute time, greatly

ACS Paragon Plus Environment

15

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 35

decreasing the time traditionally needed for clustering. This algorithm takes advantage of a set of descriptors shown to correlate strongly with inter-structure differences. The descriptors are extracted from the generated crystal structures, such as: density, isomerism, RMSD of molecular conformations, values of flexible dihedrals, and inter-molecular interaction networks. The similarity of two given structures are compared using the difference of these descriptors. These descriptors can be computed as the putative crystals are being generated and can be used to measure similarity as any new structure is generated, thus allowing almost simultaneous clustering and sampling —making it feasible to cluster millions of structures ‘on-the-fly’. Some additional computational techniques used in the clustering stage include: (1) storing the crystal structures using a high compression ratio format to promote data transfer efficiency, (2) use of the K-means algorithm for finding representative structures from each group, and only comparing new structures to the representative structure. (3) Since a huge amount of structural comparisons are required GPUs are typically used for clustering. Each GPU thread executes one calculation of structural comparison. Therefore, thousands of individual comparisons can be run simultaneously on a single GPU card, which can achieve two orders of speed up compared to CPU only calculations. The cloud-based CSP platform supports all 230 space groups. However, it is common to select a sub-set of space groups that give a certain known percentage of all crystal structures present in the CSD [38]. Convergence and high-precision ranking. The number of unique structures for which the energy is lower than a default energy threshold is monitored constantly during the global search and clustering phases of the pipeline, and these structures are considered to be within the low energy region defined by that threshold. When the number of structures within this region converges,

ACS Paragon Plus Environment

16

Page 17 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

we assume that the low energy region has been fully clustered and halt the search and clustering phases of the pipeline. Once the search has completed, a DFT computation with optPBE-vdw functional is used to rank the structures within the low energy regions. The ion–electron interaction is treated by the projector-augment-wave (PAW) technique. A kinetic energy cutoff of 600 eV and smallest allowed spacing between k-points of 0.5 Å-1 were selected. For structures with at least one interplanar spacing shorter than 5 Å, a k-point spacing of 0.3 Å-1 is employed to confirm the energy from the previous step. The job scheduler system is designed for DFT calculations on the cloud with as many as 100,000 nodes. The whole procedure includes initializing the virtual machine with a docker image, sending input files to the nodes, DFT computation, monitoring the computation for the error tracking/results collection, and termination of the nodes. In contrast to traditional HPC with the data stored in the central storage, the data in cloud is saved in the database or S3, and the virtual machine will access the data through the API and saved in local storage for further computation.

RESULTS & DISCUSSION Using the methods described in the previous section crystal structure predictions were performed on the three test systems shown in Figure 1. The crystal structure prediction calculations for Systems (a) and (c) were carried out in the twelve most frequently observed Sohncke space groups as found in the CSD [38]. These account for more than 98.9% of all crystal structures that crystallize in one of the Sohncke groups, namely: space groups 1, 4, 19, 5, 18, 76, 92, 96, 78, 20, 144 and 145. For System (b) the 38 most frequent space groups were used, though in this case chirality was ignored.

ACS Paragon Plus Environment

17

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Accuracy of predictions.

Page 18 of 35

Before considering any potential improvements in CSP speed

provided by the use of cloud computing approaches it is important to establish the accuracy of the predictions. In this section the predicted energy-density landscapes are compared to the known crystal structures to achieve this. The predicted energy density landscapes of Systems (a), (b) and (c) calculated using cloud-based CSP approach (with the optPBE-vdw functional as the final ranking) are shown in Figure 5.

ACS Paragon Plus Environment

18

Page 19 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

Figure 5. Energy-density diagrams as calculated using the optPBE-vdw functional for the test systems used in the present study, see Figure 1 for system chemical diagrams and labeling. All points in the diagrams have symbols and coloring according to their crystallographic space group number as shown in the legend. Labeled points correspond to the known experimental forms as defined by CSD reference codes for systems (a) and (b) (anhydrous and monohydrate) or as “Expt-Form” for the known internally solved Pfizer structure. System (a). The energy landscape for Z' = 1 forms of System (a) is presented in Figure 5(a). In the energy landscape, there are a total of 105 crystal structures within a relative lattice energy of

ACS Paragon Plus Environment

19

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 35

10.0 kJ/mol, fifteen of which were predicted in the most stable energy window[39], i.e., with a lattice energy within 5 kJ/mol of the lowest energy structure (inclusive). Of these, eleven are found in space group 19 (P212121), two in space group 4 (P21), one in space group 1 (P1) and one in space group 92 (P41212). The second ranked structure in the landscape corresponds to the known Form 1 lattice (VIPRES) and is approximately 2 kJ/mol higher in energy that the lowest ranked structure. Further refinements to the energy beyond simply taking the electronic energy as determined from a DFT calculation is out of the scope of the current study but understanding the origin of these discrepancies i.e., having many more predicted lattices than those observed experimentally and understanding the effect of temperature of the landscape are the next major challenges of CSP research[40]. From the predicted crystal structures within a relative energy of 10.0 kJ/mol from the global minimum structure, the conformational populations for ϕ1 and ϕ2 can be examined in more detail (definitions of the dihedrals are given in Figure 1). In Figure 6 we plot the energy as a function of the dihedral angle, as found from an energy scan along with a histogram of the values for these dihedrals in the low energy lattices. The conformation energies are calculated using the B3LYP functional in association with the 6-31G* basis set implemented in PSI4[41]. For ϕ1, the scan indicates a low energy region between 70° and 180°, which coincides with the population of this dihedral in the lattices.

This corresponds well with what has been analyzed for

conformational polymorphs in the CSD[42]. For ϕ2, there are two separate local energy minima, one between -30° and 30° and the other between 150° and -150°. The compound samples two states in low-energy crystal forms, one with an angle around 0° and one with an angle 180°. In this case, the dihedral values observed in the low energy crystal structures are equivalent to the dihedral angles that correspond to low conformation energies. Interestingly this indicates that

ACS Paragon Plus Environment

20

Page 21 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

this molecule could potentially form conformational polymorphs even though only one known lattice exists. Future extensions of the search algorithm could include focusing on angles with low conformation energies, although enough special cases exist to make this endeavor challenging (see below for example in Fig. 8 for ϕ5).

Figure 6. Distribution of torsion angles in System (a). The blue histograms show the number of times each flexible torsion angle was observed within low energy crystal structures. The black curve shows the conformational energy of each torsion angle in the isolated molecule.

System (b). The energy landscape of the monohydrate and anhydrous structures of System (b) are shown in Figures 5(b) and 5(d), respectively. For the monohydrate, nine crystal structures are predicted to be within 5 kJ/mol of the global energy minimum (inclusive). The 1st and 2nd most stable predicted crystal structures in the energy landscape correspond to the two known

ACS Paragon Plus Environment

21

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 35

polymorphs of this system, i.e. QIMKOM and QIMKOM02, respectively. The experimental anhydrous forms QIMKIG01, QIMKIG02 and QIMKIG03 are predicted as the 1st, 4th and 2nd most stable predicted crystals in the energy landscape of anhydrous system. The conformation energies and the histograms of the number of predicted monohydrate structures with each flexible torsion angle are shown in Figure 7. The dihedral scans were produced using the B3LYP/6-31G(d) method in association with the IEFPCM polarizable continuum water solvent model in order to stabilize the zwitterionic structures, which are the predominate ones in the crystal landscapes.

Figure 7. Distribution of torsion angles in System (b). The blue histograms show the number of times each flexible torsion angle was observed within low energy crystal structures. The black curve shows the conformational energy of each torsion angle in the isolated molecule.

On comparing the dihedral scans in Figure 7 with the histograms of the monohydrate structures it demonstrates that the lattices are not as clearly distributed in the low energy regions as for

ACS Paragon Plus Environment

22

Page 23 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

System (a). It is worth noting that the distribution of torsional angle ϕ2 is different between monohydrate and anhydrous lattices (not shown).

For example, 8 of 9 predicted crystal

structures in the monohydrate landscape have the ϕ2 dihedral in the basin surrounding 60°. This conformation corresponds to having the atom C6 in an axial position (see Figure 8). Whereas, for the anhydrous lattices most of predicted crystal structures in the most stable energy window have the atom C6 in an equatorial position of cyclohexane ring. This highlights how the water molecule brings new patterns of hydrogen networks in crystals and changes the conformational distributions.

Figure 8: The conformations of System (b) with axial and equatorial positions of cyclohexanyl for C6 in the carboxylate. Axial conformations (left) predominate in the monohydrate lattices whereas equatorial conformations (right) predominate in the anhydrous lattices.

System (c). For System (c) ten crystal structures were found with a relative lattice energy within 5 kJ/mol of the global minimum energy structure (inclusive). These were found in space groups 19, 4 and 96 as shown in Figure 5(c). The only known experimental form of System (c) was predicted as the 2nd rank crystal structure in the energy landscape. The 1st and 2nd rank

ACS Paragon Plus Environment

23

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 35

structures both have an inter-molecular hydrogen bond between the amide N−H group with an oxygen acceptor as shown in Figure 9. For the rank one structure the oxygen acceptor comes from the pyridine-2-one group whereas it is from the other amide group in the rank 2 structure. Figure 9 shows that the rank one structure also has an intra-molecular hydrogen bond not found in rank 2 structure.

Figure 9: Hydrogen bonding patterns found in the rank 1 structure (left) of System (c) and the rank 2 structure (right), which is the experimentally observed form. The conformation energies and the number of observed structures with each flexible torsion angle are shown in Figure 10. The relative conformational energies are larger than 40 kJ/mol within the range of certain angle for torsion angles ϕ2, ϕ3, ϕ4, ϕ5 and ϕ6, which is relatively rigid comparing to the rest of torsion angles. A majority of the molecular conformations present in the low energy crystal structures are also in the low conformation energy regions for all flexible torsional angles with the exception of ϕ5. Even for the most flexible torsion angles ϕ1 and ϕ8, the predicted crystal structures are in the basin of energy curves rather than in all over the angles. The conformation corresponding to the local minima around -100° of ϕ5 has an intra-molecular hydrogen bond N15-H…O9. The inter-molecular hydrogen bond induces the breaking of intra-

ACS Paragon Plus Environment

24

Page 25 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

molecular hydrogen bond in some structures. This competition of inter- and intra-molecular hydrogen bonds is the cause of larger conformational energies of ϕ5 shown in Figure 10.

Figure 10. Distribution of torsion angles in compound (c). The blue histograms show the number of times each flexible torsion angle was observed within low energy crystal

ACS Paragon Plus Environment

25

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 35

structures. The black curve shows the conformational energy of each torsion angle in the isolated molecule.

In summary, for the systems studied in this work the accuracy of the cloud-based CSP approach was similar to the best performing algorithms used in the most recent CSP blind-test[8]. In each case the experimentally reported forms were amongst the top-four low energy predicted structures. In addition to comparable accuracy to current CSP methods the cloud-based CSP approach should be capable of processing significantly more complex structures due to the scaling of resource as described in Table 1 and also allow for larger number of concurrent projects to be processed. However, the cloud-based CSP approach still predicts many theoretical forms that are not realized experimentally and it over-predicts the complexity of the solid form landscape. In addition, it does not account for environmental influences (such as temperature, relative humidity and processing stresses) and a fixed stoichiometry (Z’ value) has to be assumed at the start of the calculations. It should also be noted that the CSP results do not inform us what material if any, will be isolated experimentally, nor do they predict the experimental conditions required to generate a specific polymorph or which form will have the most desirable physical properties (such as dissolution rate or chemical stability).

Computing resources & speed of calculations. In this section the speed of the cloud-based CSP calculations is reported, along with a summary of the computing resources used for each calculation.

ACS Paragon Plus Environment

26

Page 27 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

The approximate wall times in the present study are shown in Table 1 and it is clear that the use of cloud computing architecture has removed many of the through-put constraints imposed by a traditional HPC cluster [8]. This approach can not only decrease the wall time by 2 - 4 times compared to a traditional HPC cluster but it also has the capacity to carry out tens of projects simultaneously.

Table 1: Approximate wall times and scaling (days) for each crystal structure prediction calculation using the cloud-based platform

System (a)

System (b) anhydrous

System (b) monohydrate

System (c)

12

38

38

12

919

1358

3202

902

Search Parameters Number of Space groups Number of DFT optimizations

Approximate Wall time (days) using peak number of cores of 0.15 million/day Round of predictions FF generation per round Crystal structure searching per round Ranking per round Total CSP

1 1

1 1

3 1

2 1

1 4 6

1 4 6

1 4 18

2 4 14

Estimated Wall time (days) using peak number of cores of 0.3 million/day Round of predictions FF generation per round Crystal structure searching per round Ranking per round Total CSP

1 1

1 1

2 1

2 1

1 2 4

1 2 4

1 2 8

1 3 10

The number of CPU cores used in the crystal structure predictions changes with each process. The most time-consuming process of prediction is energy ranking, in which the peak number of cores is 0.15 million per day in this work. The cloud-computing platform provides a convenient

ACS Paragon Plus Environment

27

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 35

way to scale up the number of cores for resource intensive processes such as this. If the peak number of cores used in the prediction were to be doubled the estimated wall times would be reduced to 4-10 days (Table 1). The complexity of prediction depends on the numbers of flexible dihedrals Nflex, ring conformers, Z' and the total number of crystal structures in the ranking. For the case with Nflex < 5 and Z' = 1, such as System (a) and the anhydrous structure of System (b) it typically takes 0.2 million CPU core hours. For the case with Nflex > 5 and Z' = 1, such as System (c), it typically takes 0.5 million CPU core hours. For the prediction Z' = 2 and Nflex < 5 such as the monohydrate of System (b) it typically takes one million CPU core hours.

CONCLUSIONS In this work, a new cloud-computing based CSP platform has been successfully applied to three crystalline drug substances of increasing complexity. The lattice energies of the experimental crystal structures are all within 4.0 kJ/mol of the lowest energy CSP structures. In addition, based on the results of this work, the algorithm improvement and the mass computational power of cloud computing can reduce the whole CSP process to just 1 - 3 weeks for Z' = 1 systems and less than 5 weeks for more complex systems. Furthermore, it is possible to simultaneously perform calculations for multiple molecules if desired. With current methods, due to cost and speed, CSP methods are only applied in late-stage drug development and primarily serve to confirm experimental results. Here we have presented a novel cloud-computing CSP platform with the capability of scheduling of hundreds of thousands CPU cores and integrating cutting-edge computational chemistry algorithms. As a result of these architecture and algorithm improvements CSP calculations can be applied much earlier in the

ACS Paragon Plus Environment

28

Page 29 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

drug development process and, when employed in conjunction with state-of-the-art experimental approaches, the risk of finding new solid forms after product launch can be substantially mitigated provided that a sufficient number of stoichiometries and space-groups are explored. Similar to the other CSP algorithms, the CSP approach described in this work provides a ‘zeroKelvin’ energy-density landscape and it does not account for temperature contributions, humidity variations, or processing stresses. In reality, experimental crystal forms are produced and stored at finite temperatures and this affects the relative stability of crystal forms. Future work will be focused on improvements to the current CSP method to implement a pseudo-supercritical Path (PSCP) scheme with the goal of capturing the contribution of temperature to the relative free energies of polymorphs [43]. The extensions of that scheme, empowered by the cloud-based design, include an improvement to the robustness of PSCP scheme through replica exchange molecular dynamics.

Supporting Information: None.

Author Contributions The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. The authors are employees of XtalPi Inc. and Pfizer Inc. and declare no conflicts of interest.

ACS Paragon Plus Environment

29

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 35

Funding Sources The work described in this manuscript was funded by XtalPi Inc.

ACS Paragon Plus Environment

30

Page 31 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

Acknowledgements The following colleagues are thanked for their support and for stimulating scientific discussions: Sarah Kelly, Charlotte Allerton, Bob Docherty, Frank Pickard, Kevin Back, Paul Meenan, Tony Wood.

ACS Paragon Plus Environment

31

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 35

REFERENCES

(1) Brittain, H. Methods for the characterization of polymorphs and solvates. In H. G. Brittain (ed.) Polymorphism in Pharmaceutical Solids. Marcel Dekker, Inc., New York, 1999, 227. ( 2 ) Lee, A. Y.; Erdemir, D.; Myerson, A. S. Crystal polymorphism in chemical process development. Annu. Rev. Chem. Biomol. Eng. 2011, 2, 259. (3) Woodley, S. M.; Catlow, R. Crystal structure prediction from first principles. Nat. Mater. 2008, 7, 937. (4) Lommerse, J. P. M.; Motherwell, W. D. S.; Ammon, H. L.; Dunitz, J. D.; Gavezzotti, A.; Hofmann, D. W. M.; Leusen, F. J. J.; Mooij, W. T. M.; Price, S. L.; Schweizer, B.; Schmidt, M. U.; van Eijck, B. P.; Verwer, P.; Williams, D. E. A test of crystal structure prediction of small organic molecules. Acta Crystallogr. B, 2000, 56, 697. (5) Motherwell, W. D. S.; Ammon, H. L.; Dunitz, J. D.; Dzyabchenko, A.; Erk, P.; Gavezzotti, A.; Hofmann, D. W. M.; Leusen, F. J. J.; Lommerse, J. P. M.; Mooij, W. T. M.; Price, S. L.; Scheraga, H.; Schweizer, B.; Schmidt, M. U.; van Eijck, B. P.; Verwer, P.; Williams, D. E. Crystal structure prediction of small organic molecules: a second blind test. Acta Crystallogr. B 2002, 58, 647. (6) Day, G. M.; Motherwell, W. D. S.; Ammon, H. L.; Boerrigter, S. X. M.; Della Valle, R. G.; Venuti, E.; Dzyabchenko, A.; Dunitz, J. D.; Schweizer, B.; van Eijck, B. P.; Erk, P.; Facelli, J. C.; Bazterra, V. E.; Ferraro, M. B.; Hofmann, D. W. M.; Leusen, F. J. J.; Liang, C.; Pantelides, C. C.; Karamertzanis, P. G.; Price, S. L.; Lewis, T. C.; Nowell, H.; Torrisi, A.; Scheraga, H. A.; Arnautova, Y. A.; Schmidt, M. U.; Verwer, P. A third blind test of crystal structure prediction. Acta Crystallogr. B 2005, 61, 511. (7) Day, G. M.; Cooper, T. G.; Cruz-Cabeza, A. J.; Hejczyk, K. E.; Ammon, H. L.; Boerrigter, S. X. M.; Tan, J. S.; Della Valle, R. G.; Venuti, E.; Jose, J.; Gadre, S. R.; Desiraju, G. R.; Thakur, T. S.; van Eijck, B. P.; Facelli, J. C.; Bazterra, V. E.; Ferraro, M. B.; Hofmann, D. W. M.; Neumann, M. A.; Leusen, F. J. J.; Kendrick, J.; Price, S. L.; Misquitta, A. J.; Karamertzanis, P. G.; Welch, G. W. A.; Scheraga, H. A.; Arnautova, Y. A.; Schmidt, M. U.; van de Streek, J.; Wolf, A. K.; Schweizer, B. Significant progress in predicting the crystal structures of small organic molecules - a report on the fourth blind test. Acta Crystallogr. B 2009, 65, 107. (8) Reilly A. M. et al. Report on the sixth blind test of organic crystal structure prediction methods. Acta Crystallogr. B 2016, 72, 439. ( 9 ) Price, S. L.; Brandenburg J. G. Molecular crystal structure prediction. Non-Covalent Interactions in Quantum Chemistry and Physics. 2017. 333. (10) Abramov, Y. A. Computational Pharmaceutical Solid State Chemistry, Wiley. 2016. (11) "Amazon Web Services About Us". Amazon.com, retrieved December 19, 2017. 12

CSD is maintained by the Cambridge Crystallographic Data Center (CCDC) www.ccdc.org

(13) Hu, L.-Y.; Du, D.; Hoffman, J.; Smith, Y.; Fedij, V.; Kostlan, C.; Johnson, T. R.; Huang, Y.; Kesten, S.; Harter, W.; Yue, W. S.; Li, J. J.; Barvian, N.; Mitchella, L.; Lei, H. J.; Lefker, B.;

ACS Paragon Plus Environment

32

Page 33 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

Carroll, M.; Dettling, D.; Krieger-Burke, T.; Samas, B.; Yalamanchili, R.; Lapham, K.; Pocalyko, D.; Sliskovic, D.; Ciotti, S.; Stoller, B.; Hena, M. A.; Ding, Q.; Maiti, S. N.; Stier, M.; Welgus, H. (1R,2S)-4-(2-Cyano-cyclohexyl-oxy)-2-trifluoromethyl-benzonitrile, a potent androgen receptor antagonist for stimulating hair growth and reducing sebum production. Bioorg. Med. Chem. Lett. 2007, 17, 5983. (14) Ibers, J. A. Gabapentin and gabapentin monohydrate. Acta Crystallogr., Sect. C: Cryst. Struct. Commun. 2001, 57, 641. (15) Reece, H. A.; Levendis, D. C. Polymorphs of gabapenitin. Acta Crystallogr., Sect. C: Cryst. Struct. Commun. 2008, 64, O105. (16) Braga, D.; Grepioni, F.; Maini, L.; Rubini, K.; Polito, M.; Brescello, R.; Cotarca, L.; Duarte, M. T.; Andre, V.; Piedade, M. F. M. Polymorphic gabapentin: thermal behaviour, reactivity and interconversion of forms in solution and solid-state. New J. Chem. 2008, 32, 1788. (17) Vasudev, P. G.; Aravinda, S.; Ananda, K.; Veena, S. D.; Nagarajan, K.; Shamala, N.; Balaram, P. Crystal structures of a new polymorphic form of gabapentin monohydrate and the E and Z Isomers of 4-Tertiarybutylgabapentin. Chem. Biol. Drug Des. 2009, 73, 83. (18) XtalPi Inc. URL http://www.xtalpi.com/. (19) Klimeš, Jiří, David R. Bowler, and Angelos Michaelides. "Van der Waals density functionals applied to solids." Physical Review B 83.19 (2011): 195131.

(20) Kresse, G.; Hafner, J. Ab-initio molecular-dynamics for liquid-metals. Phys. Rev. B 1993, 47, 558. ( 21 ) Kresse, G.; Hafner, J. Ab-initio molecular-dynamics simulation of the liquid-metal amorphous-semiconductor transition in germanium. Phys. Rev. B 1994, 49, 14251. (22)Kresse G.; Furthmueller J. Efficiency of ab-initio total energy calculations for metals and semiconductors using a plane-wave basis set. Comput. Mat. Sci. 1996, 6, 15. (23) SSCI a division of Albany Molecular Research Inc. https://www.ssci-inc.com/ (24) Amazon Web Services. URL https://aws.amazon.com/. (25) Google Cloud Platform. URL https://cloud.google.com/. (26) Tencent Cloud. https://cloud.tencent.com/. (27) Alibaba Cloud. https://www.alibabacloud.com/. (28) Mesos. URL https://mesos.apache.org/. (29) Docker. URL https://www.docker.com/. (30) PostgreSQL. https://www.postgresql.org/. (31) Wang, J.; Wang, W.; Kollman, P. A.; Case, D. A. Automatic atom type and bond type perception in molecular mechanical calculations. J. Mol. Graph. Model. 2006, 25, 247. (32) MacKerell, A. D.; Bashford, D.; Bellott, M.; Dunbrack, R. L.; Evanseck, J. D.; Field, M. J.; Fischer, S.; Gao, J.; Guo, H.; Ha, S.; Joseph-McCarthy, D.; Kuchnir, L.; Kuczera, K.; Lau, F. T.

ACS Paragon Plus Environment

33

Crystal Growth & Design 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 35

K.; Mattos, C.; Michnick, S.; Ngo, T.; Nguyen, D. T.; Prodhom, B.; Reiher, W. E.; Roux, B.; Schlenkrich, M.; Smith, J. C.; Stote, R.; Straub, J.; Watanabe, M.; Wiorkiewicz-Kuczera, J.; Yin, D.; Karplus, M. All-atom empirical potential for molecular modeling and dynamics studies of proteins. J. Phys. Chem. B 1998, 102, 3586. (33) Jorgensen, W. L.; Maxwell, D. S.; TiradoRives, J. Development and testing of the OPLS all-atom force field on conformational energetics and properties of organic liquids. J. Am. Chem. Soc. 1996, 118, 11225. (34) Call, S. T.; Zubarev, D. Y.; Boldyrev, A. I. Global minimum structure searches via particle swarm optimization. J. Comput. Chem. 2007, 28, 1177. ( 35 ) Hastings, W. K. Monte-Carlo sampling methods using Markov chains and their applications. Biometrika 1970, 57, 97. ( 36 ) Trelea, Ioan Cristian. "The particle swarm optimization algorithm: convergence analysis and parameter selection." Information processing letters 85.6 (2003): 317-325.

( 37 ) Frenkel, Daan, and Berend Smit. Understanding molecular simulation: from algorithms to applications. Vol. 1. Elsevier, 2001.

( 38 ) See for example the Space Group Frequency table hosted by the CCDC at https://www.ccdc.cam.ac.uk/support-and-resources/ccdcresources/, first accessed on March 7th 2018. (39) Nyman, J., & Day, G. M. (2015). Static and lattice vibrational energy differences between polymorphs. CrystEngComm, 17(28), 5154-5165. (40) Hoja, J.; Ko, H.-Y.; Neumann, M. A.; Car, R.; DiStasio Jr., R. A.; Tkatchenko A. Reliable and practical computational prediction of molecular crystal polymorphs. arXiv:1803.07503, 2018. (41) Parrish, R. M.; Burns, L. A.; Smith, D. G. A.; Simmonett, A. C.; DePrince, A. E., III; Hohenstein, E. G.; Bozkaya, U.; Sokolov, A. Y.; Di Remigio, R.; Richard, R. M.; Gonthier, J. F.; James, A. M.; McAlexander, H. R.; Kumar, A.; Saitow, M.; Wang, X.; Pritchard, B. P.; Prakash, V.; Schaefer, H. F., III; Patkowski, K.; King, R. A.; Valeev, E. F.; Evangelista, F. A.; Turney, J. M.; Crawford, T. D.; Sherrill, C. D. PSI4 1.1: An open-source electronic structure program emphasizing automation, advanced libraries, and interoperability. J. Chem. Theo. Comput. 2017, 13, 3185. (42) Cruz-Cabeza, A. J.; Bernstein, J. Conformational polymorphism. Chem. Rev. 2014, 114, 2170–2191 ( 43 ) Dybeck, E. C.; Abraham, N. S.; Schieber, N. P.; Shirts, M. R. Capturing entropic contributions to temperature-mediated polymorphic transformations through molecular modeling. Cryst. Growth Des. 2017, 17, 1775.

ACS Paragon Plus Environment

34

Page 35 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Crystal Growth & Design

For Table of Contents Use Only Harnessing Cloud Architecture for Crystal Structure Prediction Calculations Peiyu Zhang †, Geoffrey P. F. Wood*,‡, Jian Ma†, Mingjun Yang†, Yang Liu†, Guangxu Sun†, Yide A. Jiang†, Bruno C. Hancock‡, and Shuhao Wen*,†

Crystal Structure Prediction algorithms have been implemented on a cloud-based platform significantly increasing throughput over tradition HPC clusters.

35 ACS Paragon Plus Environment