Creating Machine Learning-Driven Material ... - ACS Publications

Jan 4, 2019 - Creating Machine Learning-Driven Material Recipes Based on. Crystal Structure. Keisuke Takahashi* and Lauren Takahashi. Center for ...
0 downloads 0 Views 2MB Size
Subscriber access provided by La Trobe University Library

Surfaces, Interfaces, and Catalysis; Physical Properties of Nanomaterials and Materials

Creating Machine Learning-Driven Material Recipes Based on Crystal Structure Keisuke Takahashi, and Lauren Takahashi J. Phys. Chem. Lett., Just Accepted Manuscript • DOI: 10.1021/acs.jpclett.8b03527 • Publication Date (Web): 04 Jan 2019 Downloaded from http://pubs.acs.org on January 5, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

Creating Machine Learning-Driven Material Recipes Based on Crystal Structure Keisuke Takahashi∗ and Lauren Takahashi Center for Materials research by Information Integration (CMI2 ),National Institute for Materials Science (NIMS), 1-2-1 Sengen, Tsukuba, Ibaraki 305-0047, Japan E-mail: [email protected]

Abstract Determining the manner in which crystal structures are formed is considered a great mystery within materials science. Potential solutions have the possibility to be uncovered by revealing hidden patterns within the materials data. Data science is therefore implemented in order to link the materials data to the crystal structure. In particular, unsupervised and supervised machine learning techniques are used where the gaussian mixture model is employed in order to understand the data structure of the materials database while random forest classification is used to predict the crystal structure. As a result, the unsupervised and supervised machine learning techniques reveal descriptors for determining the crystal structure via materials database. In addition, predicting atomic combinations from the crystal structure is also achieved using a trained machine where the first principle calculations confirm the stability of predicted materials. Thus, one can consider that the estimation of the crystal structure can be achieved in principle via the combination of materials data and machine learning, thereby leading towards the advancement of crystal structure prediction.

1

ACS Paragon Plus Environment

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table of Contents Identifying the origin of crystal structures has been a mystery within the field of material science. 1–7 The challenges in determining crystal structures are not only dependent on the chemical composition but also numerous uncertainties. In general, characterization of crystal structures is carried out using experimental techniques such as diffraction, spectroscopy, and chemical composition analysis. Although such methods allow for the identification of the crystal structures of synthesized materials, prediction of crystal structure before synthesis is a difficult task. In order to enable the prediction of crystal structure, first principle calculations are used and have been proven to be an effective approach. 1 Within computational materials science, stable material structures are explored by visiting all potential crystal structures within a fixed composition where numerous computational algorithms have been also developed. 8–12 However, such computational approaches become difficult when the composition and number of atoms becomes complex as the number of potential structure candidates exponentially increase. Thus, a more rapid and direct prediction of crystal structures is essential when applied towards material design. In particular, with rapid advancement of high throughput first principle calculations, acquisition of material data becomes possible. 13–17 By applying machine learning towards a materials database, acceleration of materials discovery has been proposed and demonstrated. 18–22 Thus, properties of a material can be predicted if the materials descriptors responsible for determining the materials properties are revealed. 23–26 Similarly, one can con-

2

ACS Paragon Plus Environment

Page 2 of 18

Page 3 of 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

sider that the particular descriptors for determining the crystal structure can be identified with the aid of machine learning and material databases. Here, prediction of crystal structure is performed through the application of machine learning towards a material database. A materials database containing single and binary components is investigated where the database is constructed using first principle calculations. 13,27 Two types of machine learning are implemented to understand and predict the crystal structure from a material database. Unsupervised machine learning– in particular, the gaussian mixture model in scikit-learn– is implemented in order to reveal the hidden data structure. 28 Classification in supervised machine learning– in particular, random forest classification in scikit-learn– is implemented to classify the crystal structure. 28 Within the random forest classification, the number of trees is set to 100 trees. Accuracy of the trained random forest is evaluated by cross validation where materials data is randomly split into 20% test data and 80% trained data. The average score of 10 random test and trained data in cross validation is then evaluated. Preprocessing of the materials data is carried out in order for the machine learning process to progress. 4130 single and binary compounds (Am Bn ) are present within the database. 13,27 Please note that elements belonging to the lanthanides (atomic numbers 57 through 71) are not present within the 4130 binary compound dataset. 493 prototype structures are defined and assigned to their appropriate materials. Here, the numerical values are assigned towards each of the 493 prototype structures. Please note that 1 out of 493 possible structures is classified as “none“ which represents the cases where the structures of the materials do not fall within the remaining 492 prototype structures. Numerical values with the corresponding prototype structures are collected in Supporting Information. In addition, the corresponding electronegativity and atomic radius related to the atomic number are taken from the periodic table and merged into the database. Note that the electronegativity and atomic radius are multiplied by the number of atoms of their respective element. Unsupervised machine learning is first applied in order to reveal the data structure of the material data. The gaussian mixture model is applied and the following variables are

3

ACS Paragon Plus Environment

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 1: Radviz visualization of the predicted gaussian mixture model. Blue and red color indicates the prediction of gaussian mixture model which classifies the 4130 binary compound dataset into two groups. The circle and cross symbols represent ”none’“structure and materials that have a prototype structure as found in the 4130 binary compound dataset, respectively. trained by the gaussian mixture model: atomic number A, the number of A atoms, atomic number B, the number of B atoms, electronegativity of A, electronegativity of B, atomic radius of A, atomic radius of B, and prototype structure. Here, the number of mixture components and covariance of each mixture model are set to 2 and spherical, respectively. Radviz visualization is implemented for visualizing the predicted gaussian mixture model and original data as shown in Figure 1 in order to visualize the high dimensional data. 29 For reference, results of the gaussian prediction are classified into two groups which are colored blue and red. Additionally, the original data are marked by circle and cross symbols, where the circle represents unknown, or ”none“, structures while the cross represents structures within the remaining 492 possible structures. In particular, number of atoms A and B, and atomic radii A and B appear to impact whether or not a mixture has a structure that

4

ACS Paragon Plus Environment

Page 4 of 18

Page 5 of 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

belongs to one of the 492 prototype structures. Thus, the two groups classified by gaussian mixture model matches with unknown and known structure groups within the original 4130 binary compound dataset. Hence, one can consider that the following 8 physical quantities could play an important role in understanding the material structure: atomic number A, the number of A atoms, atomic number B, the number of B atoms, electronegativity of A, electronegativity of B, atomic radius of A, and atomic radius of B.

Figure 2: Visualization of (a) atomic numbers A and B (b) number of atoms m and n from the 2952 Am Bn compounds in database where the corresponding data frequency is also shown as a histogram. (c) Data frequency of structural data in database. The 1178 compounds out of 4130 compounds in the database having the structure “none“ are then removed from the database for further supervised machine learning. The remaining 2952 single and binary compounds which fall into one of the 492 prototype structures are then investigated. Data distribution of the 2952 single and binary compounds are visualized as shown in Figure 2. Figure 2 (a) shows that atomic elements A and B are dispersed while the number of A and B atoms are concentrated at small numbers as shown in Figure 2 (b). In addition, frequency analysis of the 492 prototype structures in the database is shown in Figure 2 (c) where each of the 492 prototype structures is relatively dispersed within the 5

ACS Paragon Plus Environment

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

2952 compounds.

Figure 3: Importance of descriptors in random forest classification. Purple and green colors represent information from material database and corresponding information from periodic table, respectively. A: atomic number A, Am: the number of atomic number A, B: atomic number B, Bm: the number of atomic number B, Aele: electronegativity of A, Bele: electronegativity of B, Ar: atomic radius of A, Br: atomic radius of B Random forest classification is implemented in order to predict the crystal structures where the number of tree is set to 100. The following 8 variables are found to be descriptors for determining crystal structures: atomic number A, the number of A atoms, atomic number B, the number of B atoms, electronegativity of A, electronegativity of B, atomic radius of A, and atomic radius of B. The objective variable is set to 492 prototype structures as shown in Supporting Information. With the 8 descriptors and objective variable, the average score of trained random forest classification in cross validation is 79% with a highest score of 84% and standard deviation of 1.8%. Thus, 492 prototype structures can be accurately predicted via the eight descriptors with random forest classification. Here, the importance of the eight descriptors are evaluated via random forest classification as shown in Figure 3. Figure 3 shows that the corresponding information from the periodic table, electronegativity and atomic radius, have a large impact when determining crystal structure. This demonstrates that information from the periodic table is important physical quantity besides atomic numbers. Predicting combinations of elements using crystal structure is performed using trained 6

ACS Paragon Plus Environment

Page 6 of 18

Page 7 of 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

random forest classification with the 8 descriptors. Database expansion is performed using trained random forest classification as the first step. Here, the following 640,000 combinations of 8 descriptors variables are generated: A,1-80; B,1-80; Am,1-10; and Bn,1-10 where corresponding electronegativity and atomic radius of A and B are also generated.. Please note that A and B refer to the element’s atomic number while m and n represent the number of atoms of elements A and B. Additionally, please note that atomic numbers up 80 are chosen for prediction as this is the range that grid based projector augmented wave method is able to support for confirmation calculations.. 640,000 combinations of descriptors variables are then given to trained random forest classification which output the corresponding one of 492 prototype structures. In this way, the original 2952 materials database is expanded to a database containing 642,952 data using machine learning.

Figure 4: Left heatmap represents the frequency of predicted atomic element A and B for (a)NaCl, (b)ZnS(cF8), (c)NiAs, and (d)ReO3 by machine learning. Center heatmap represents the frequency of element commonly known element and B for (a)NaCl, (b)ZnS(cF8), (c)NiAs, and (d)ReO3 as a reference. 30 Right heatmap represents the frequency of element A and B for (a)NaCl, (b)ZnS(cF8), (c)NiAs, and (d)ReO3 in original 2952 material data.

7

ACS Paragon Plus Environment

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Prediction of atomic elements A and B for commonly known crystal structures NaCl, ZnS(cF8), NiAs, and ReO3 is performed. From the expanded 642,952 data, eight descriptors variables containing all of the NaCl, ZnS(cF8), NiAs, and ReO3 structures are extracted. The frequency of predicted atomic elements A and B for each of the NaCl, ZnS(cF8), NiAs, and ReO3 structures are visualized into heatmaps as shown in the left column of Figure 4. Heatmaps of experimentally synthesized materials having NaCl, ZnS(cF8), NiAs, and ReO3 structures are also visualized and labeled as ”Reference“. 30 In addition, heatmaps of materials having NaCl, ZnS(cF8), NiAs, and ReO3 structures that are found in the original 2952 data are visualized as ”Original Data“. It should be noted that element combinations predicted using machine learning result in areas showing more frequency in cases where a combination of elements appears several times with differing numbers of atoms per elements A and B. This accounts for partial differences in element frequencies between heat maps of predicted element combinations and reference material in some cases. Additionally, please note that there is a difference between reference and original data in Figure 4 where reference and original data consist of experimentally reported materials and first principle calculations, respectively. One can consider that materials in reference are experimentally synthesized while materials in the original data are theoretically stable structures which are not necessarily able to be synthesized in experiments. In the case of NaCl, machine learning predicts that cases where atomic elements A and B have similar atomic numbers up to number 20, as can be seen in the square formations that appear in Figure 4 (a) under the column ”Machine Learning Prediction“. In particular, when atomic element A is around the range 1, 10, or 20, atomic element B ranges from 1 to 20; when atomic element B is around the range of 1, 10, or 20, atomic element A ranges from 1 to 10. This leads to the square patterns shown in Figure 4 (a), which also shows that the NaCl structure appears when atomic element A is within the range of 35 and 50-70 and when atomic element B is within the range of 35 and 50-60. In addition, cases where one element has a high atomic number and the other element has a low atomic number tend to form the

8

ACS Paragon Plus Environment

Page 8 of 18

Page 9 of 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

NaCl structure. Trends for predicted A and B elements of NaCl also match with commonly known A and B combinations that form the NaCl structure, which are shown under the column ”NaCl Reference“. This, therefore, demonstrates good estimation of element A and B combinations that produce the NaCl structure. Similar trends are also seen in the cases of ZnS(cF8) and NiAs structures, which are listed in Figures 4 (b) and (c). As can be seen in the figures, predictions made by the machine line up with the original data as well as commonly-understood cases reported in reference. In particular, ZnS(cF8) and NiAs seem to form when elements A and B are within the range of atomic numbers around 15, 30, 50, and 80. The case of ReO3, which can be seen within Figure 4 (d), is particularly interesting. In the original 2952 material data, ReO3 is the only material falling under the prototype structure “ReO3 “. However, prediction of combinations of atomic elements A and B that would produce the ReO3 structure show that combinations generally contain atomic A with an atomic number of approximately 8 while element B can be found within ranges of 20-35, 40-50, and 55-80 with a higher frequency shown around 70-80. This prediction matches with experimentally synthesized ReO3 structure materials such as ReO3 , MoF3 , NbF3 , and TaF3 , as shown in under the column ”ReO3 Reference“. 30 Interestingly, although the results do not predict commonly known NCu3 , it does predict CuN3 , which is found to be theoretically unstable according to the first principle calculations. Additionally, the prediction suggests the possibility of undiscovered materials if one were to investigate cases where element B falls within the atomic number range of 5-15, 20-30, and 55-70 . Given these results, the generalization ability of the trained random forest classification can be viewed as accurate as it is able to predict cases of element A and B combinations that are not included within the original data but are published elsewhere. Additionally, these results also expand the possibilities for designing materials with particular crystal structures as the machine suggests combinations of atomic elements that may not be previously published or have yet to be considered.

9

ACS Paragon Plus Environment

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 5: Visualization of a decision tree for classifying NaCl, ZnS, NiAs, and ReO3 . Gini: Gini coefficient, samples: total number of samples, value: the number of sample [NaCl, ZnS, NiAs, and ReO3 ], Class: objective structure (NaCl, ZnS, NiAs, and ReO3 ).

10

ACS Paragon Plus Environment

Page 10 of 18

Page 11 of 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

Identification of how crystal structures are classified is visualized using one of the decision trees produced by random forest classification. Classifications of NaCl, ZnS, NiAs, and ReO3 , in particular, are investigated. Data for NaCl, ZnS, NiAs, and ReO3 cases, a total of 126 data, are extracted from the original 2952 data where the number of materials with NaCl, ZnS, NiAs, or ReO3 structural data is 72, 30, 22, and 2, respectively. The following eight variables are set to descriptors: atomic number A, the number of atomic number A, atomic number B, the number of atomic number B, electronegativity of A, electronegativity of B, atomic radius of A, and atomic radius of B. Additionally, the following four crystal structures are set as objective variables for a decision tree classification: NaCl, ZnS, NiAs, and ReO3 . Details of how the four structures are classified is visualized in Figure 5. Figure 5 demonstrates that the following six variables closely classify the four chosen structures: atomic number A, number of atomic number B atoms, electronegativity of A, electronegativity of B, atomic radius of A, and atomic radius of B. In particular, information from the periodic table– electronegativity and atomic radius– have a major contribution as proposed in Figure 3. Hence, the crystal structures of single and binary compounds can be accurately classified using random forest classification and material data.

Figure 6: Structural model of the ReO3 (XY3 ) prototype structure. Color code: Blue and red represents Re(X) and O3 (Y3 ), respectively Validation of predicted materials is performed using density functional theory calculations. Grid based projector augmented wave (GPAW) method is implemented where the exchange correlation of Perdew–Burke–Ernzerhof (PBE) with spin polarization calculation is applied for all calculations. 31,32 Structural optimization is first performed in order to find 11

ACS Paragon Plus Environment

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 7: Annotated heatmap of the calculations made for the predicted ReO3 (XY3 ) prototype structure materials in Figure 4 (d). Formation energy (eV) of the materials is annotated within the heatmap. 0 represents non-predicted materials. Positive(red) and negative(blue) colors represent endothermic and exothermic formation energies, respectively.

12

ACS Paragon Plus Environment

Page 12 of 18

Page 13 of 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

the ground state lattice constant. With the ground state lattice constant, structural relaxation is performed. Formation energy(Ef ) is calculated as Equation 1:

Ef = E[XY3 ] − E[X] − E[Y3 ]).

(1)

Here predicted ReO3 prototype structure materials shown in Figure 4 (d) and 6 are investigated and calculated. This particular structure is of interest because while there are only two ReO3 prototype structures within the original database, machine learning predicted 132 possible materials that have a ReO3 prototype structure. The breakdown of 132 predicted materials is as follows: 2 hydrides, 12 carbides, 31 nitrides, 45 oxides, 36 fluorides, 1 sulfide, 4 chlorides, and 1 krypton–based materials. The predicted 132 materials are then calculated using density functional theory calculations where structural optimization is first performed in order to find the ground state lattice constant and formation energy is then calculated. Formation energy of the calculated 132 materials within ReO3 (XY3 ) prototype structure materials are visualized in Figure 7. The details of the composition, lattice constant, and formation energies of the materials are collected in Supporting Information. Figure 7 shows that 67 materials have exothermic formation energies where the breakdown of 67 materials is as follows: 27 oxides, 36 flourides, 1 sulfide, and 3 chlorides. One can consider that oxides and flourides have a potential tendency to form the ReO3 prototype structure as shown in Figure 6. More importantly, there are only two ReO3 prototype structure materials in the original dataset, yet the trained machine was able to predict 67 previously undiscovered materials having the ReO3 structure that are theoretically stable materials. These results show that the materials predicted by random forest classification might not be always the most stable energy structures, but rather are meta stable structures. Here, Figure 4 also illustrates that the machine learning predictions have a broader range than those shown for the reference and original data cases. While materials falling within the predicted range may be not always be the most stable, predictions can provide good indication of potential synthesizable materials rather than pin-point prediction when searching for novel materials. 13

ACS Paragon Plus Environment

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The physical origin behind the classification of ReO3 in machine learning can be understood in Figure 5. Note that atomic numbers A and B in Figure 5 correspond to Y and X in ReO3 (XY3 ), respectively. Figure 5 shows that electronegativity of Y, atomic radius of Y,the number of atom of X, and atomic radius of X in ReO3 (XY3 ) play a crucial role for identifying the ReO3 structure. More specifically, ReO3 (XY3 ) structures are preferred under the following conditions: number of atom X ≤ 1.5, electronegativity of Y≥ 1.58, atomic radius of X ≥ 1.53, and atomic radius of Y ≥ 1.34. As a result, the ReO3 (XY3 ) structure is classified by sifting the atomic radius, electronegativity, and the number of atoms in complex matter. Thus, the chosen 8 descriptors with random forest classification are an effective approach for classifying the crystal structures. Determining the crystal structure in materials science is performed using materials data and machine learning. Unsupervised machine learning– in particular, the gaussian mixture model– reveals two data clusters which are structures that fit within 492 prototype structures and structures that do not fit in any of the 492 prototype structures. Supervised machine learning– in particular, random forest classification– is implemented in order to estimate the crystal structures where eight descriptors for determining the crystal structures are determined. Prediction of atomic combinations via materials data and machine learning is also achieved with an average cross validation score of 79 % and a visualized decision tree is also investigated for revealing the physical meaning. In addition, the stability of predicted materials is evaluated and confirmed through the implementation of first principle calculations. Thus, applying machine learning to a materials database can act as a guide towards determining the crystal structure, leading towards the advancement of material design. This work is funded by Japan Science and Technology Agency(JST) CREST Grant Number JPMJCR17P2, JSPS KAKENHI Grant-in-Aid for Young Scientists (B) Grant Number JP17K14803, and Materials research by Information Integration (MI2 I) Initiative project of the Support Program for Starting Up Innovation Hub from JST. Computational work

14

ACS Paragon Plus Environment

Page 14 of 18

Page 15 of 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

is supported in part by Hokkaido university academic cloud,information initiative center, Hokkaido University, Sapporo, Japan.

References (1) Woodley, S. M.; Catlow, R. Crystal Structure Prediction from First Principles. Nat. Mater. 2008, 7, 937. (2) Oganov, A. R.; Lyakhov, A. O.; Valle, M. How Evolutionary Crystal Structure Prediction Works and Why. Acc. Chem. Res. 2011, 44, 227–237. (3) Mooser, E.; Pearson, W. On the Crystal Chemistry of Normal Valence Compounds. Acta crystallographica 1959, 12, 1015–1022. (4) Chelikowsky, J.; Phillips, J. Quantum-Defect Theory of Heats of Formation and Structural Transition Energies of Liquid and Solid Simple Metal Alloys and Compounds. Phys. Rev. B 1978, 17, 2453. (5) Judith John, S.; Bloch, A. N. Quantum-defect Electronegativity Scale for Nontransition Elements. Phys. Rev. Lett. 1974, 33, 1095. (6) Zunger, A. Systematization of the Stable Crystal Structure of all AB-type Binary Compounds: A Pseudopotential Orbital-radii Approach. Phys. Rev. B 1980, 22, 5839. (7) Pettifor, D. A chemical scale for crystal-structure maps. Solid State Communications 1984, 51, 31–34. (8) Oganov, A. R.; Glass, C. W. Crystal Structure Prediction using Ab Initio Evolutionary Techniques: Principles and Applications. J. Chem. Phys. 2006, 124, 244704. (9) Wang, Y.; Lv, J.; Zhu, L.; Ma, Y. CALYPSO: A Method for Crystal Structure Prediction. Comput. Phys. Commun. 2012, 183, 2063–2070.

15

ACS Paragon Plus Environment

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(10) Sheldrick, G. M. SHELXT–Integrated Space-group and Crystal-structure Determination. Acta Crystallogr. A 2015, 71, 3–8. (11) Case, D. H.; Campbell, J. E.; Bygrave, P. J.; Day, G. M. Convergence Properties of Crystal Structure Prediction by Quasi-random sampling. J. Chem. Theory Comput. 2016, 12, 910–924. (12) Yamashita, T.; Sato, N.; Kino, H.; Miyake, T.; Tsuda, K.; Oguchi, T. Crystal structure prediction accelerated by Bayesian optimization. Phys. Rev. Mat. 2018, 2, 013803. (13) Kirklin, S.; Saal, J. E.; Meredig, B.; Thompson, A.; Doak, J. W.; Aykol, M.; R¨ uhl, S.; Wolverton, C. The Open Quantum Materials Database (OQMD): Assessing the Accuracy of DFT Formation Energies. npj Com. Mat. 2015, 1, 15010. (14) Castelli, I. E.; Olsen, T.; Datta, S.; Landis, D. D.; Dahl, S.; Thygesen, K. S.; Jacobsen, K. W. Computational Screening of Perovskite Metal Oxides for Optimal Solar Light Capture. Energy Environ. Sci. 2012, 5, 5814–5819. (15) others,, et al. AFLOW: an Automatic Framework for High-throughput Materials Discovery. Comput. Mater. Sci. 2012, 58, 218–226. (16) Curtarolo, S.; Hart, G. L.; Nardelli, M. B.; Mingo, N.; Sanvito, S.; Levy, O. The Highthroughput Highway to Computational Materials Design. Nat. Mater. 2013, 12, 191. (17) others,, et al. Two-dimensional Materials from High-throughput Computational Exfoliation of Experimentally Known Compounds. Nat. Nanotechnol. 2018, 13, 246. (18) Hautier, G.; Fischer, C. C.; Jain, A.; Mueller, T.; Ceder, G. Finding Nature’s Missing Ternary Oxide Compounds using Machine Learning and Density Functional heory. Chem. Mater. 2010, 22, 3762–3767. (19) Seko, A.; Maekawa, T.; Tsuda, K.; Tanaka, I. Machine Learning with Systematic

16

ACS Paragon Plus Environment

Page 16 of 18

Page 17 of 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry Letters

Density-Functional Theory Calculations: Application to Melting Temperatures of Single-and Binary-Component Solids. Phys. Rev. B 2014, 89, 054303. (20) Raccuglia, P.; Elbert, K. C.; Adler, P. D.; Falk, C.; Wenny, M. B.; Mollo, A.; Zeller, M.; Friedler, S. A.; Schrier, J.; Norquist, A. J. Machine-Learning-Assisted Materials Discovery Using Failed Experiments. Nature 2016, 533, 73. (21) Takahashi, K.; Takahashi, L.; Miyazato, I.; Tanaka, Y. Searching for Hidden Perovskite Materials for Photovoltaic Systems by Combining Data Science and First Principle Calculations. ACS Photonics 2018, 5, 771–775. (22) Zhou, Q.; Tang, P.; Liu, S.; Pan, J.; Yan, Q.; Zhang, S.-C. Learning Atoms for Materials Discovery. Proc. Natl. Acad. Sci. U.S.A. 2018, 201801181. (23) others,, et al. Commentary: The Materials Project: A Materials Genome Approach to Accelerating materials innovation. Apl Materials 2013, 1, 011002. (24) Walsh, A. Inorganic Materials: The Quest for New Functionality. Nat. Chem. 2015, 7, 274. (25) Ghiringhelli, L. M.; Vybiral, J.; Levchenko, S. V.; Draxl, C.; Scheffler, M. Big Data of Materials Science: Critical Role of the Descriptor. Phys. Rev. Lett. 2015, 114, 105503. (26) Takahashi, K.; Tanaka, Y. Materials Informatics: A Journey Towards Material Design and Synthesis. Dalton Trans. 2016, 45, 10497–10499. (27) Saal, J. E.; Kirklin, S.; Aykol, M.; Meredig, B.; Wolverton, C. Materials Design and Discovery with High-throughput Density Functional Fheory: the Open Quantum Materials Database (OQMD). npj comp. mat. 2013, 65, 1501–1509. (28) others,, et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830.

17

ACS Paragon Plus Environment

The Journal of Physical Chemistry Letters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ˇ ep´ankov´a, O. Visualization of Trends using RadViz. Journal of Intel(29) Nov´akov´a, L.; Stˇ ligent Information Systems 2011, 37, 355. (30) West, A. R. Solid State Chemistry and Its Applications; John Wiley & Sons, 2014. (31) Mortensen, J. J.; Hansen, L. B.; Jacobsen, K. W. Real-space Grid Implementation of the Projector Augmented Wave Method. Phys. Rev. B 2005, 71, 035109. (32) Perdew, J. P.; Burke, K.; Ernzerhof, M. Generalized Gradient Approximation Made Simple. Phys. Rev. Lett. 1996, 77, 3865. (33) Graser, J.; Kauwe, S. K.; Sparks, T. D. Machine Learning and Energy Minimization Approaches for Crystal Structure Predictions: A Review and New Horizons. Chem. Mater. 2018, 30, 3601–3612.

18

ACS Paragon Plus Environment

Page 18 of 18