Combining Evolutionary Algorithms with Clustering toward Rational

Feb 10, 2017 - This is a good test system because the GM is apparent and because the problem is sufficiently challenging that the EA needs many attemp...
2 downloads 5 Views 2MB Size
Subscriber access provided by Washington University | Libraries

Article

Combining evolutionary algorithms with clustering toward rational global structure optimization at the atomic scale Mathias Siggaard Jørgensen, Michael Nelson Groves, and Bjørk Hammer J. Chem. Theory Comput., Just Accepted Manuscript • DOI: 10.1021/acs.jctc.6b01119 • Publication Date (Web): 10 Feb 2017 Downloaded from http://pubs.acs.org on February 14, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Theory and Computation is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Combining evolutionary algorithms with clustering toward rational global structure optimization at the atomic scale Mathias S. Jørgensen, Michael N. Groves, and Bjørk Hammer∗ Interdisciplinary Nanoscience Center (iNANO) and Department of Physics and Astronomy, Aarhus University, Denmark. E-mail: [email protected].

Abstract

learning techniques, attention has now turned toward making predictions and decisions based on patterns in the data. Machine learning is a fast developing discipline that sets up the computer to automatically improve itself through the continuous acquisition of data. 1 In the fields of physics, chemistry, and molecular biology, machine learning has also found its way into otherwise computationally expensive simulations at the atomic scale. 2 In particular, global optimization tasks can be exceedingly difficult when lacking a priori knowledge about the global minimum (GM). Until recently, researchers have relied heavily on manually guessing atomic structures on the basis of chemical and physical intuition. This quickly becomes infeasible due to the high-dimensional energy landscape with a vast number of local minima that could easily be mistaken for the GM. Hence, a move toward automated methods has set in. 3,4 Such methods include variants of molecular dynamics 5 and Monte Carlo, 6,7 particle swarm optimization, 8 random search, 9 evolutionary or genetic algorithms 10–17 (including various implementations hereof 18,19 ), and neural network based methods. 20,21 For problems in need of computationally expensive first principles methods such as density functional theory (DFT), an evolutionary algorithm (EA) approach is particularly suitable. It is a completely unbiased global optimization technique starting from only the chemical composition of

Predicting structures at the atomic scale is of great importance for understanding the properties of materials. Such predictions are infeasible without efficient global optimization techniques. Many current techniques produce a large amount of idle intermediate data before converging to the global minimum. If this information could be analyzed during optimization, many new possibilities emerge for more rational search algorithms. We combine an evolutionary algorithm (EA) and clustering, a machinelearning technique, to produce a rational algorithm for global structure optimization. Clustering the configuration space of intermediate structures into regions of geometrically similar structures enables the EA to suppress certain regions and favor others. For two test systems, an organic molecule and an oxide surface, the global minimum search proves significantly faster when favoring stable structures in unexplored regions. This clustering-enhanced EA is a step toward adaptive global optimization techniques that can act upon information in accumulated data.

1

Introduction

Technological advancements are causing an increasingly large flow and accumulation of data in all fields of science. With the aid of machine-

ACS Paragon Plus Environment

1

Journal of Chemical Theory and Computation

the system, it can efficiently cover large regions of the energy landscape, and it is readily parallelized. Many different systems have been studied with EAs including crystals, 22 nanoparticles, 23 surfaces, 24 and polymers. 25 The main components of the EA are the ”pairing” and ”mutation” operators that are applied to the parent structures, which produce new offspring structures that are then evaluated by local optimization (Figure 1). The local optimization step is by far the most time-consuming step of the EA, and in the present work we address the question of which structures to optimize rather than how to pair or mutate them – a large variety of such operators exist already. In most EAs, the choice of which structure to optimize is a stochastic one based on an evaluated fitness of the parent structures. We will show how clustering, a machine-learning technique, can enable the EA to rationally draw parent structures by analyzing the large number of intermediate, local minima structures produced by the EA as the search progresses (green path in Figure 1). The analysis will allow suppression of structures from certain regions in configuration space when choosing which parents to use for producing offspring. This turns the intermediate structures into valuable clues instead of unused structures resulting from a structured random walk through the potential energy surface. Throughout the paper, we focus our study on a system with a chemical composition of C9 H7 N for which the GM is the compound quinoline (Figure 1). This is a good test system because the GM is apparent and because the problem is sufficiently challenging that the EA needs many attempts (i.e. many locally optimized offspring structures) to find the GM. Section 2 will present computational details and the clustering methodology, while Section 3 will illustrate how clustering can be used to analyze the EA search space. Section 4 will combine clustering with an EA into a self-consistent algorithm. In the end we will have demonstrated that clustering methods can be used to improve the performance of EAs without any additional a priori information.

Generate starting structures Quinoline

Terminate Starting structures

Construct population

?

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 14

d ge ver Con

Draw parents (enhanced by clustering)

Cluster all structures Evaluate offspring

Pair/mutate parents

Figure 1: Flow chart depicting the EA from start to finish, including the cyclic process of evolving a population. This process terminates when some user defined convergence criterion is met. The green paths are for the clustering-enhanced EA. Starting structures for the C9 H7 N system and quinoline (the GM) are shown.

2

Methods

Computational details We employ an EA 26,27 that is currently implemented in the Atomic Simulation Environment (ASE). 28 A number Npop of structures are generated randomly for the starting population by placing atoms on random coordinates (obeying a minimum and a maximum distance to previously placed atoms based on their covalent radii) within a simulation cell in the unit cell followed by local optimization. We employ the cut-and-splice operator 11 for pairing of parent structures and omit the use of mutations. The population always consists of the Npop (we use Npop = 20) most stable structures among those found by the EA and the starting structures. The starting structures and all offspring candidates are locally optimized with the computationally inexpensive DFTB theory using the DFTB+ program package 29 to allow for large tests of the EA performance. The loss of accuracy in choosing DFTB over DFT is not a key issue here because our focus is EA performance, not finding true structures. We use the C9 H7 N bond parameters by Gaus et al., 30 and the TiO2 bond parameters by Dolgonos et al. 31 All C9 H7 N structures are optimized in

ACS Paragon Plus Environment

2

Page 3 of 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

˚ unit cell, and only forces in a 25×25×10 A the (x,y)-plane are included (resulting in flat, two-dimensional structures for easier visualization). Our enhanced clustering-EA has also been tested on a periodic system: an anatase TiO2 (001) (1×4) surface reconstruction (the ”added molecule model” 32 ). Lattice parameters for the anatase TiO2 were optimized with DFTB to a = 3.94 ˚ A and c = 9.47 ˚ A. For the TiO2 (001) reconstructed surface structures with two atomic layers, this results in a unit cell of 7.87×15.74×21.45 ˚ A (with periodic boundary conditions in the x- and y-directions), which leaves 8 ˚ A of vacuum below (above) the bottom (top) atomic layer. The uppermost eight oxygen atoms in the 2×4 unit cell of TiO2 (001) plus a Ti2 O4 unit are included in the EA search to obtain a stoichiometric TiO2 structure, while all other atoms are in their bulk positions in the starting structures. All structures are relaxed until the forces on all atoms are smaller than 0.05 eV/˚ A.

represented by its features, and the features must be invariant with respect to rotation and translation in the coordinate system. As proposed by Vilhelmsen et al., 27 we achieve this with feature lists fA−B of interatomic distances of atom pairs of type A − B (e.g. carboncarbon, C − C, and carbon-nitrogen, C − N). Each list is sorted by size to also make the features invariant to permutation of atoms of the same type. This type of features has later been coined the Bag of Bonds model. 48 The similarity s of two structures is then calculated by ′ (n)| NA + NB n |fA−B (n) − fA−B , s= P 1 ′ 2N n |fA−B (n) + fA−B (n)| A−B 2 (1) where n runs over the interatomic distances, and the factor in front of the second sum is just a normalization constant involving the total number N of atoms, the numbers NA and NB of atoms of each type, and the average size of the feature lists. Two structures are considered equal if s < 0.02 and

X

Clustering Several clustering techniques have been used elsewhere to cluster and analyze chemical compounds, 33,34 biological structures 35,36 and molecular dynamics trajectories. 37,38 Density-based clustering is also used in the USPEX code 39 for EA post-run analysis of crystal structures 40 and for maintaining population diversity. 41 In a self-consistent EA implementation, it is important that the clustering method has only a few parameters and none that restricts the number of clusters since these are to evolve naturally from the EA progress. We use agglomerative hierarchical clustering (AHC) that, unlike partitional clustering methods such as K-means, does not require a pre-specified number of clusters (for a review on clustering methods, see A. K. Jain 42 ). AHC starts with one cluster per structure and proceeds by successively merging the most similar pair of clusters to form a cluster hierarchy. This requires a structure similarity metric that quantifies the similarity of features that represent the structures. The choice of atomic structure features and similarity metric is a difficult one that has been studied on several occasions. 43–47 A structure must be uniquely

P

maxn (|f (n) − f ′ (n)|) < 0.7 ˚ A, 27

(2)

where f refers to the interatomic distances of all atom pair types. The EA population can only consist of unique structures in order to maintain diversity. The similarity metric s resembles that of the simple 1-norm distance between two vectors in Euclidean space. When comparing clusters consisting of more than one structure, one must decide on a linkage metric to calculate the cluster similarity. There are three main measures: Single-linkage, completelinkage, and average-linkage. These linkages define the similarity of two clusters as the most similar pair of structures between the clusters, the least similar pair of structures, and the average similarity of all structures, respectively. The latter is a tradeoff between the large (small) dispersion of cluster size produced by single-linkage (complete-linkage), 49 and hence we choose average-linkage for a more robust metric (see the Appendix for details). After iteratively merging the most similar pair of clusters until all clusters have been merged, the

ACS Paragon Plus Environment

3

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

level of the hierarchy at which to define the final clusters must be determined. We use the inconsistency coefficient for this purpose (see the Appendix for details). The extent of a cluster in configuration space is measured by its width, which is defined as the similarity of the two last clusters that merge into the respective cluster.

3

Page 4 of 14

deed be the resulting clusters if a higher inconsistency threshold is used). This is consistent with the multidimensional scaling (MDS) of the features to two dimensions, where the two groups are segregated in differentiated regions of the map (Figure 2b). Structures in subclusters that contain less than Nout structures are designated as outliers and colored grey (we have used Nout = 30 to avoid very small clusters that represent insignificant regions of configuration space relative to the number of 5000 structures used in total). A visual inspection of the structures in the left group of clusters (i) (colored red) reveals that the majority of these structures are fragmented (Figure 2c). A structure is defined as fragmented if the distance between two structural moieties is greater than that of a chemical bond plus 0.2 ˚ A, where the chemical bond length is defined as the sum of the atomic covalent radii. If the fragments of two distinct structures are of identical composition, the similarity of the structures is calculated fragment-wise. This is to avoid having multiple structures of identical fragment compositions at different relative orientations and relative distances. 68% of structures in this group (including outliers) are fragmented as opposed to only 14% in the right group of clusters (ii) (colored blue). Fragmentation of a structure obviously results in very long interatomic distances in the complete C9 H7 N structure as is evident in (Figure 2d), where the interatomic distances d have been plotted versus their feature index n. Even nonfragmented structures (isomers of C9 H7 N) with long interatomic distances due to elongated geometric shapes are present in group (i). Thus, the two groups represent very general structures: (i) elongated, linear structures and (ii) compact, circular structures. With the chosen inconsistency threshold, a total of 11 clusters (colored in tones of red and blue) emerge from different combinations of long and short C − C and C − N distances. Reading the dendrogram left to right, interatomic distances tend to decrease (Figure 2d), where the middle clusters (light tones of red and blue) bridge the far ends of configuration space (Figure 2b). The GM is contained in the darkest blue cluster.

EA search space analysis

We now have the tools required to analyze the EA behavior in search of the GM, quinoline. A remark on the dimensionality of the problem is in place here. Including all interatomic distances gives a 136-dimensional feature space. Besides increasing computational costs, it is known that high dimensionality can make a negative impact on the performance of clustering methods such as AHC, 49,50 and hence one should always consider reducing the number of dimensions to a minimum without losing relevant information. This can be done either by dimensionality-reduction techniques or by removing weakly relevant features. We find that, for the case of C9 H7 N, the interatomic distances involving hydrogen do not improve the performance of the AHC. This is not surprising since hydrogen is used mostly to satisfy the valency of carbon, and hence their positions are very dependent on the positions of carbon, which makes interatomic distances involving hydrogen redundant. Thus, the 136-dimensional feature space is reduced to 45 dimensions, and eq 1 only runs over the atom pairs C − C and C − N. For the analysis we need a mapping of the complete C9 H7 N configuration space. 5000 representative, unique structures are clustered to map this space. These have been extracted randomly from the most stable half of structures from each of 800 EA runs. This is to avoid significant noise in the clustering from very unstable, disparate structures as we are mostly interested in mapping the populated regions of configuration space. The result is shown in Figure 2. Reading the dendrogram (Figure 2a) from the top, there is a distinct division of structures into two major groups of clusters (i) and (ii) (note that these two groups will in-

ACS Paragon Plus Environment

4

Page 5 of 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(a)

(b)

(c)

(d) Fragmented structure

(i)

(ii)

(e)

Figure 2: Visualization of 5000 clustered C9 H7 N structures in (a) a dendrogram, (b) in a multidimensional scaling to two dimensions with cluster centroids marked as x, (c) as a cluster distribution of fragmented structures, and (d) as centroid feature plots. The structure closest to each cluster centroid is shown below the dendrogram. Two main groups of clusters (i) and (ii) contain five and six clusters, respectively. The density of states of each cluster is shown in (e) with the most stable structure from each cluster below and two representative outlier structures. The yellow horizontal lines mark the average cluster energies. The horizontal axes are scaled according to the scale factor. Clusters from this complete mapping will be referred to as true clusters, while structures from red and blue clusters will be referred to as red and blue structures, respectively. The average width of the true clusters is 0.11 (in the dendrogram, the width of a cluster is equal to the highest branch point of the cluster). On average, the red structures are higher in energy than the blue structures, but the density of states overlap significantly across all clusters (Figure 2e). This is evidence of a highly complex energy landscape with many local energy minima, or ”funnels”, 43 that overlap in energy despite having dissimilar structural characteristics. Using an unsupervised method to identify

such local minima is the first step of enabling the EA to learn about and adapt to its own behavior. Supervised suppression tests To investigate the effect of suppressing certain regions of configuration space in the EA, we first make two supervised tests. In the first test, fragmented structures are penalized and will only enter the population if too few non-fragmented structures have been found to make up the population. In the second test, red structures and outliers are penalized in favor of blue structures. Hence, the first test involves a very simple classification of fragmented versus non-fragmented structures,

ACS Paragon Plus Environment

5

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 14

(a)

whereas the second test involves a more careful classification resulting from clustering. The two tests include 800 EA runs each, where each run consists of 10k EA attempts. The tests are compared to 800 benchmark runs, where no regions are suppressed. Figure 3a shows the cumulative success of the benchmark runs, the fragmented structure suppression test, and the red cluster suppression test. Both tests outperform the benchmark runs, where no structures are suppressed. The fragment suppression test is effective only in the first few thousand EA attempts, while the more general red cluster suppression test greatly accelerates the EA search. To explain these effects, we use the true cluster centroids (averaged feature lists) to track how the EA benchmark runs explore different regions as the search progresses. A run is tracked by assigning parent structures (which in this case represent the population since they are drawn randomly) to whichever true cluster that has the closest centroid as calculated by eq (1). If a structure is farther away from every centroid than the average width of all clusters, it is designated as an outlier. All 800 benchmark runs are analyzed to get an average picture of the EA search. Figure 3b presents this analysis as a stacked histogram of the first 1000 EA attempts where parent structures from unfinished runs have been assigned to the true clusters in bins with a width of 20 attempts. A run is unfinished if it has not yet found the GM. In the beginning of an average EA run, 60– 70% of all parent structures belong to red clusters, 15–20% to light blue clusters, and 10% are outliers. Very few structures belong to dark blue clusters. After 1000 attempts of the benchmark runs, the dark red structures have diminished completely, 20% of the structures belong to light red clusters, and two blue clusters have become dominant. One of these, the darkest blue cluster, contains structures similar to the GM, and the other, a lighter blue cluster, contains structures with very long C − N distances (see Figure 2), i.e. structures that have nitrogen bound at the edge of the structure. We thus conclude that structures similar to the GM are not present in the beginning of the EA search

(b)

Figure 3: (a) Performance of the supervised suppression tests and the benchmark runs is measured as cumulative success in finding the GM. (b) Cluster analysis of the first 1000 EA attempts of the benchmark runs is shown as stacked histograms with assignment of parent structures to true clusters. Runs that have found the GM (i.e. finished runs) are removed from the cluster analysis meaning that initially 800 runs are analyzed while after 1000 EA attempts only the 663 unfinished runs are analyzed. and only slowly emerge as the EA search progresses. Suppressing either fragmented structures or, in particular, the red clusters will accelerate the EA toward finding structures similar to the GM.

4

Clustering-enhanced evolutionary algorithms

Previously we used a priori knowledge about the configuration space and the GM to show that the performance of the EA can be improved if certain regions of configuration space are suppressed in the population. In an applied situation, however, one knows neither the configuration space nor of course the GM prior to the search. The EA needs to search the configuration space and use the obtained knowledge as it progresses. We present here two simple algorithms that do exactly that by implementing

ACS Paragon Plus Environment

6

Page 7 of 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation (a)

however, all distances to the centroids exceed the average width of the existing clusters, it is designated as an outlier. In this first clusteringenhanced algorithm, new offspring structures are now produced under the restriction that if an outlier is present in the population, it will always be chosen as the first parent. If more than one outlier is present in the population, one of them is chosen randomly to be the first parent. The second parent is chosen randomly from all other structures in the population. Hence, the algorithm is facilitated only by labeling of the structures in the population according to the cluster they belong to. The cluster centroids are updated every Nc attempts by clustering all unique structures currently found by the EA including the starting structures. In this way the map of configuration space is continuously expanded as more structures become available. For the large tests that have been carried out here with density functional tight binding (DFTB) theory, Nc = 20. In an applied situation where DFT or other computationally expensive methods are used to optimize structures (and hence the relative time spent on clustering will be much lower), one should use Nc = 1 for best possible time resolution of the clustering. By always choosing outliers in the population (and hence suppressing the structures already belonging to a cluster), the EA search develops faster. Whenever a relatively stable structure is found that is unlike any of the currently found structures, it marks an unexplored region of configuration space that has potential for low-energy structures. The EA shall then immediately shift its attention to this region until a cluster is formed or other structures outmatch the outlier in the population. A cluster is formed when a minimum of Nout = 2 structures form AHC branches that are below the inconsistency threshold. Hence, a cluster will quickly form to avoid spending an excessive time using the same parents for pairing. The performance of this EA test versus the benchmark runs is presented in Figure 4 together with some data from the clustering. C9 H7 N has been tested for 10k EA attempts (Figure 4a). It is evident that the outlier al-

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Figure 4: Performance of (a) the C9 H7 N system and (e) the anatase TiO2 (001) added molecule model for the benchmark runs and the clustering outlier algorithm. Clustering data include (b, f) the number of clusters, (c, g) the average cluster width, and (d, h) the fraction of paired parent structures that are outliers as a function of the EA search progress (attempts). the clustering method outlined above. The EA initializes with a population of Npop random structures (we use Npop = 20). These starting structures are clustered with AHC to map the first small piece of configuration space. The next offspring structure, produced by applying the cut-and-splice pairing operator 11 to two randomly chosen parents from the population, is then assigned to an existing cluster. If,

ACS Paragon Plus Environment

7

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

gorithm greatly outperforms the benchmark in the first few thousand attempts. 50% of the clustering-enhanced EA runs have found the GM after 1457 attempts, whereas the benchmark runs need 2040 attempts to reach the same success rate. Averaged over all 800 runs, the random starting structures form a single cluster (Figure 4b) with a width of approximately 0.30 (Figure 4c) (compared to the average true cluster width of 0.11). Stable outliers are more likely to be found and used as parents in the beginning of a search (Figure 4d) because most of configuration space is still unexplored. About 30% of all parents are outliers after the first 100 attempts. However, after 1000 attempts, the fraction of outlier parents has dropped to below 5%. Typically, 6 groups have formed with an average width of 0.16. A different system, the anatase TiO2 (001) added molecule model 32 (Figure 4e), has been tested to demonstrate how the algorithm can also be applied to periodic problems involving crystal surfaces. A (2×1) supercell of the reconstruction has been chosen to obtain a sufficiently challenging problem, where the uppermost eight oxygen atoms plus a Ti2 O4 unit are included in the EA search to obtain a stoichiometric TiO2 structure. The performance results and clustering data are essentially the same as for C9 H7 N. One noticeable difference is that the number of stable outliers is significantly lower (Figure 4h), resulting in a smaller increase in performance. Nevertheless, the success rate has still increased after 1000 attempts from 16% to 23%. For both systems, the EA search progressively changes to that of the traditional one where the most stable, explored regions are searched thoroughly. Adequate pairing and mutation operators are still required in this final stage to find the GM. The outlier algorithm is about rationally choosing parents in selected regions of configuration space, not about how to pair or mutate them. Therefore, in situations where structures similar to the GM are easily found, but finding the actual GM structure is difficult, the outlier algorithm above will have little effect. In the opposite situation, however, the algorithm will greatly accelerate the EA search.

Page 8 of 14 (b)

(a) (c)

Figure 5: (a) Performance of the anatase TiO2 (001) added molecule model for the benchmark runs, the penalizing algorithm, the outlier algoritm, and the combined penalizing + outlier algorithm. Clustering data include (b) The average number of clusters in the population, and (c) the fraction of paired parent structures that are outliers as a function of the EA search progress (attempts). We propose a second clustering-enhanced algorithm which has been tested on the TiO2 (001) system. This algorithm operates with a fitness of each structure based not only on their potential energy but also on the size of the cluster that they belong to. The population consists of the Npop = 20 structures of highest fitness. Combining the potential energy Ei and cluster size xl of the cluster l that the structure belongs to, the total fitness Fi of a candidate structure i is calculated as 

Fi = − Ei +

L 1 + e−k(xl −x0 )



.

(3)

The second term is a logistic function with the asymptotic value L (the maximum penalty). x0 is the cluster size at which half of the maximum penalty is reached, and k controls the steepness with which the penalty increases. All fitness scores are updated after every EA attempt so that the fitness is a function of the current size of the cluster that they belong to. For the performance test in Figure 5a on the TiO2 (001) system, we have chosen the maximum penalty L to be the fitness variation of the population, L = Fpop,max −Fpop,min , k = 0.1, and x0 = 75. The test compares this penalizing algorithm (yellow curve) with the 800 bench-

ACS Paragon Plus Environment

8

Page 9 of 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

5

mark runs for 2000 EA attempts. After ∼650 attempts, the effect of the penalty begins to show, and after 2000 attempts, the success rate has increased from 28% to 35%. The effect is also evident from the number of clusters in the population (Figure 5b), where the penalizing algorithm prohibits early dominance of a cluster in the population by increasing the number competitive clusters. The second term of eq 3 enables the EA to penalize structures in large clusters since their fitness will decrease as the cluster size increases. Eventually, structures in smaller clusters will have a higher fitness and displace them in the population. Again, this means that the EA will focus less on already explored regions of configuration space. In addition, the EA can to some extent counteract stagnation, a frequent EA problem, 51 by ”climbing” out of a local minimum in a manner reminiscent of metadynamics 52 when the number of structures in the corresponding cluster increases. We hypothesize that this effect is responsible for the sudden increase in success rate after ∼650 attempts, where structures in the population are being penalized enough that stagnated EA runs can escape local minima. A more advanced version of the algorithm should be able to introduce this effect again later in the search by adjusting the parameters in the penalty function during the search. Finally, using the penalizing algorithm together with the outlier algorithm (red curve in Figure 5) further increases the success rate to 41%. The outlier algorithm is active in the beginning of the search by favoring outliers, while the penalizing algorithm is active later in the search by penalizing structures in large clusters. Both methods combined compliment each other and make an effective algorithm as evident in Figure 5a. As existing structures incur higher penalties, outliers (which have no penalty) are introduced into the population, and are immediately chosen as parent structures. This increases both the number of clusters in the population (Figure 5b) and the fraction of outlier parents (Figure 5c) compared to using only one of the algorithms.

Conclusions and outlook

In the present work, an EA for global structure optimization has been combined with a clustering method to improve upon the traditional EA. In traditional EAs, progression is blindly based on a few of the most stable structures (the population), whereas with a clustering method the progression can be based on the entire configuration space of currently found structures. We have developed a combined clustering-EA approach that enable the EA to store information about the explored regions of the search space and act upon this information in the continued search. In two simple algorithms based on this approach, stable structures in unexplored regions of configuration space are favored to increase the EA performance on two test systems. The first algorithm directly identifies and favors structures in unexplored regions, while the other algorithm penalizes structures in dense, explored regions. These algorithms take advantage of the clustered configuration space at the current state of the search. We envision that more sophisticated algorithms of the clusteringenhanced EA will be able to analyze the evolution of clusters prior to the current state in order to predictively adjust parameters and alter the strategy. Furthermore, we have illustrated how clustering can be used for exploration and analysis of configuration space and efficient postsearch analysis of global optimization algorithms. This is an important step not only in analyzing metastable structures for application purposes, but also for further development of global optimization algorithms toward more efficient search algorithms.

Appendix Clustering We use the unweighted pair group method with arithmetic mean (UPGMA) version of average-linkage. The similarity S of two clusters at a given level of the hierarchy is calculated as the average of all similarities s (eq 1) between pairs of elements of clusters u

ACS Paragon Plus Environment

9

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(a)

(b)

Page 10 of 14

(c)

Figure 6: Visualization of seven clustered C9 H7 N structures (a) in a dendrogram, (b) in a multidimensional scaling to two dimensions with cluster centroids marked as x, and (c) as feature plots. The individual structures are shown below the dendrogram. and v: S=

XX i∈u j∈v

s(i, j) , |u| · |v|

two merging groups of the cluster, corresponding to the highest branch point of the cluster in the dendrogram). The last structure, grey, which in principle is a cluster on its own, is instead labeled as an outlier. The clustering is also visualized in two dimensions using multidimensional scaling (MDS) (Figure 6b), where the centroid (the averaged feature lists) of each cluster is marked by a cross. Some insight into the characteristics of these clusters can be extracted from the features, which are plotted in Figure 6c as the interatomic distance d versus the feature index n. It is clear that structures in the red cluster have both longer C − C and C − N distances than structures in the blue cluster. Furthermore, within the clusters, structure 2 stands out from 3 and 4 because of shorter C − N distances, while structure 5 stands out from 6 and 7 because of longer C − N distances. The outlier, which is split into three fragments, has by far the longest C − C and C − N distances.

(4)

where |u| and |v| are the cardinals of clusters u and v. After the complete hierarchy of clusters is found with UPGMA, an independent cluster is defined in the dendrogram when a branch and all its sub-branches have an inconsistency coefficient below 4.8 (3.0 was used for the anatase TiO2 surface test). The inconsistency coefficient IC is given as IC =

S0 −

1 N

PN

n=1

σ

Sn

,

(5)

where S0 is the height of the branch in the dendrogram, Sn is the height of the n’th sub-branch (with N sub-branches in total), and σ is the standard deviation of all sub-branches. Note that for only one structure in each cluster, the inconsistency coefficient reduces to the similarity s (eq 1). In Figure 6, AHC has been applied to seven C9 H7 N structures to illustrate the practical approach. The dendrogram (Figure 6a) shows the similarities of all clusters in the hierarchy. Two consistent clusters, red and blue, with each three structures emerge from this analysis. The width of the red cluster is 0.13, and the width of the blue cluster is 0.16 (where the cluster width is defined as the similarity of the last

Acknowledgement This research has been supported by a grant from the Danish Research Council (grant no. 0602-02566B).

References (1) Jordan, M. I.; Mitchell, T. M. Science 2015, 349, 255–260.

ACS Paragon Plus Environment

10

Page 11 of 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(2) Behler, J. J. Chem. Phys. 2016, 145, 170901.

(19) Lonie, D. C.; Zurek, E. Comput. Phys. Commun. 2011, 182, 372–387.

(3) Wales, D. J.; Scheraga, H. A. Science 1999, 285, 1368–1372.

(20) Ouyang, R.; Xie, Y.; Jiang, D. Nanoscale 2015, 7, 14817.

(4) Heiles, S.; Johnston, R. L. Int. J. Quantum Chem. 2013, 113, 2091–2109.

(21) Zhai, H.; Alexandrova, A. J. Chem. Theory Comput. 2016, 12, 6213–6226.

(5) Goedecker, S. J. Chem. Phys. 2004, 120, 9911.

(22) Li, Q.; Ma, Y.; Oganov, A. R.; Wang, H.; Wang, H.; Xu, Y.; Cui., T.; Mao, H.K.; Zou, G. Phys. Rev. Lett. 2009, 102, 175506.

(6) Wales, D. J. J. Phys. Chem. A 1997, 101, 5111–5116.

(23) Lysgaard, S.; M´ yrdal, J. S. G.; Hansen, H. A.; Vegge, T. Phys. Chem. Chem. Phys. 2015, 17, 28270–28276.

(7) Ozolins, V.; Majzoub, E. H.; Wolverton, C. Phys. Rev. Lett. 2008, 100, 135501.

(24) Martinez, U.; Vilhelmsen, L. B.; Kristoffersen, H. K.; Stausholm-Møller, J.; Hammer, B. Phys. Rev. B 2011, 84, 205434.

(8) Call, S. T.; Zubarev, D. Y.; Boldyrev, A. I. J. Comput. Chem. 2007, 1177–1186. (9) Pickard, C. J.; Needs, R. J. J. Phys. Condens. Matter 2011, 23, 053201.

(25) Sharma, V.; Wang, C.; Lorenzini, R. G.; Ma, R.; Zhu, Q.; Sinkovits, D. W.; Pilania, G.; Oganov, A. R.; Kumar, S.; Sotzing, G. A.; Boggs, S. A.; Ramprasad, R. Nat. Commun. 2014, 5, 4845.

(10) Hartke, B. J. Phys. Chem. 1993, 97, 9973–9976. (11) Deaven, D. M.; Ho, K. M. Phys. Rev. Lett. 1995, 75, 288–291.

(26) Vilhelmsen, L. B.; Hammer, B. Phys. Rev. Lett. 2012, 108, 126101.

(12) Johnston, R. L. Dalton Trans. 2003, 4193–4207.

(27) Vilhelmsen, L. B.; Hammer, B. J. Chem. Phys. 2014, 141, 044711.

(13) Alexandrova, A. N.; Boldyrev, A. I. J. Chem. Theory Comput. 2005, 1, 566–580.

(28) Bahn, S. R.; Jacobsen, K. W. Comput. Sci. Eng. 2002, 4, 56–66.

(14) Oganov, A. R.; Glass, C. W. J. Chem. Phys. 2006, 124, 244704.

(29) Aradi, B.; Hourahine, B.; Frauenheim, T. J. Phys. Chem. 2007, 111, 5678–5784.

(15) Marques, J. M. C.; Pereira, F. B. Chem. Phys. Lett. 2010, 485, 211–216.

(30) Gaus, M.; Goez, A.; Elstner, M. J. Chem. Theory Comput. 2013, 9, 338–354.

(16) Chu, Y.; Heyndrickx, W.; Occhipinti, G.; Jensen, V. R.; Alsberg, B. K. J. Am. Chem. Soc. 2012, 134, 8885–8895.

(31) Dolgonos, G.; Aradi, B.; Moreira, N. H.; Frauenheim, T. J. Chem. Theory Comput. 2010, 6, 266–278.

(17) Bhattacharya, S.; Levchenko, S. V.; Ghiringhelli, L. M.; Scheffler, M. Phys. Rev. Lett. 2013, 111, 135501.

(32) Lazzeri, M.; Selloni, A. Phys. Rev. Lett. 2001, 87, 266105.

(18) Trimarchi, G.; Zunger, A. Phys. Rev. B 2007, 75, 104113.

(33) Brown, R. D.; Martin, Y. C. J. Chem. Inf. Comput. Sci. 1996, 36, 572–584.

ACS Paragon Plus Environment

11

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 14

(47) Zhu, L.; Amsler, M.; Fuhrer, T.; Schaefer, B.; Faraji, S.; Rostami, S.; Ghasemi, S. A.; Sadeghi, A.; Grauzinyte, M.; Wolverton, C.; Goedecker, S. J. Chem. Phys. 2016, 144, 034203.

(34) Li, W. J. Chem. Inf. Model. 2006, 46, 1919–1923. (35) Eisen, M. B.; Spellman, P. T.; Brown, P. O.; Botstein, D. Proc. Natl. Acad. Sci. USA 1998, 95, 14863–14868. (36) Enright, A. J.; Dongen, S. V.; Ouzounis, C. A. Nucleic Acids Res. 2002, 30, 1575–1584.

(48) Hansen, K.; Biegler, F.; Ramakrishnan, R.; Pronobis, W.; von Lilienfeld, O. A.; M¨ uller, K.-R.; Tkatchenko, A. J. Phys. Chem. Lett. 2015, 6, 2326–2331.

(37) Torda, A. E.; van Gunsteren, W. F. J. Comput. Chem. 1994, 15, 1331–1340.

(49) J. Wu, H. X.; Chen, J. Neurocomputing 2009, 72, 2319–2330.

(38) Shao, J.; Tanner, S. W.; Thomson, N.; Cheatham, T. E. J. Chem. Theory Comput. 2007, 3, 2312–2334.

(50) Berkhin, P. Grouping Multimedia Data; Springer Berlin Heidelberg, 2002; pp 25– 71.

(39) Glass, C. W.; Oganov, A. R.; Hansen, N. Comput. Phys. Commun. 2006, 175, 713– 720.

(51) Zhou, Z.; Harris, K. D. M. Phys. Chem. Chem. Phys. 2008, 10, 7262–7269.

(40) Valle, M.; Oganov, A. R. Crystal structure classifier for an evolutionary algorithm structure predictor. VAST ’08 IEEE Symposium. Columbus, OH, 2008; pp 11– 18.

(52) Laio, A.; Parrinello, M. Proc. Natl. Acad. Sci. USA 2002, 99, 12562–12566.

(41) Lyakhov, A. O.; Oganov, A. R.; Stokes, H. T.; Zhu, Q. Comput. Phys. Commun. 2013, 184, 1172–1182. (42) Jain, A. K. Pattern Recogn. Lett. 2010, 31, 651–666. (43) Oganov, A. R.; Valle, M. J. Chem. Phys. 2009, 130, 104504. (44) Montavon, G.; Hansen, K.; Fazli, S.; Rupp, M.; Biegler, F.; Ziehe, A.; Tkatchenko, A.; von Lilienfeld, O. A.; M¨ uller, K. NIPS 2012, 25, 449–457. (45) Sadeghi, A.; Ghasemi, A.; Schaefer, B.; Mohr, S.; Lill, M. A.; Goedecker, S. J. Chem. Phys. 2013, 139, 184118. (46) Sch¨ utt, K. T.; Glawe, H.; Brockherde, F.; Sanna, A.; M¨ uller, K. R.; Gross, E. K. U. Phys. Rev. B 2014, 89, 205118.

ACS Paragon Plus Environment

12

Page 13 of 14

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

For Table of Contents Only 85x47mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 14 of 14