PubChemQC Project: A Large-Scale First-Principles Electronic

Jump to Analysis of the PubChemQC Database and Machine Learning Results - (47) We selected a data set containing 1 million molecules from the PubChemQ...
1 downloads 11 Views 3MB Size
Subscriber access provided by CORNELL UNIVERSITY LIBRARY

Article

PubChemQC Project: a Large-Scale First-Principles Electronic Structure Database for Data-driven Chemistry Maho Nakata, and Tomomi Shimazaki J. Chem. Inf. Model., Just Accepted Manuscript • Publication Date (Web): 08 May 2017 Downloaded from http://pubs.acs.org on May 13, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

PubChemQC Project: a Large-Scale First-Principles Electronic Structure Database for Data-driven Chemistry NAKATA Maho∗,† and SHIMAZAKI Tomomi‡ †Advanced Center for Computing and Communication, RIKEN, 2-1, Hirosawa, Wako-City, Saitama, 351-0198 JAPAN ‡Advanced Institute for Computational Science, RIKEN, 7-1-26, Minatojima-minami-machi, Chuo-ku, Kobe-City, Kobe, 650-0047 JAPAN E-mail: [email protected] Abstract Large-scale molecular databases play an essential role in the investigation of various subjects such as the development of organic materials, in silico drug designs, and data-driven studies with machine learning. We developed a large-scale quantum chemistry database based on first-principles methods. Our database currently contains the ground-state electronic structure of three million molecules based on the density functional theory (DFT) method at the B3LYP/6-31G* level, and we successively calculated ten low-lying excited states of over two million molecules via the time-dependent DFT method with the B3LYP functional and the 6-31+G* basis set. To select the molecules calculated in our project, we referred to the PubChem project, and it was used as the source of the molecular structures in short strings using the InChI and the SMILES representations. Accordingly, we named our quantum chemistry database project as “PubChemQC” (http://pubchemqc.riken.jp/) and placed it in the public domain. In this paper, we showed the fundamental features of the PubChemQC database and discussed the techniques used to construct the dataset for large-scale quantum chemistry calculations. We also presented a machine learning approach to predict the electronic structure of molecules as an example to demonstrate the suitability of the large-scale quantum chemistry database.

1 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1. Introduction The design, discovery, and fabrication of new chemical compounds are essential to solve various issues such as environmental pollution, global warming, and CO2 to O2 conversion. Building basic knowledge on these chemical compounds has become an important task to tackle these issues. Database systems are constructed to store and reuse the collected data. For gathering molecular properties and constructing chemical databases, experimental measurements are often employed. However, the experimental approaches are sometimes very expensive. In some cases, it is difficult or impossible to measure the molecular properties of some compounds. Therefore, even if these large-scale molecular databases are highly necessary, it is not easy to experimentally construct them. On the contrary, the first-principles quantum chemistry calculations have become very accurate and economical. Back in 1929, Dirac quoted as follows: “the fundamental laws necessary for the mathematical treatment of a large part of physics and the whole of chemistry are thus completely known...”.1 Subsequently, the quantum chemistry approach has enabled us to obtain and predict various molecular properties with (sometimes) better results than experimental approaches.2, 3 In addition, there are many good implementations for quantum chemistry calculations, such as Gaussian,4 QChem,5 MolPro,6 NWChem,7 Turbomole,8 and GAMESS.9 These suggest that the first-principles quantum chemistry methods can be applicable to construct large-scale molecular databases. Thus, we started constructing a database by storing quantum chemistry calculation results. To develop the database, we only used the IUPAC International Chemical Identifier (InChI)10, Input Line Entry Specification (SMILES)12,

13

11

and the Simplified Molecular

representations of molecules. We

obtained the InChI and SMILES representations from the PubChem project,14 and

ACS Paragon Plus Environment

Page 2 of 41

Page 3 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

therefore, our project was named as “PubChemQC”.15 Large-scale databases have many useful applications such as virtual screening or expert system to find molecules with desired properties or fast molecular property estimations. On the contrary, large datasets play essential roles for the machine- and deep-learning methods.16-18 In chemistry, data-driven studies and developments based on the machine learning approach have become imperative.19, 20 In such data-driven chemical studies, the size of the dataset is a critical factor. At present, the PubChemQC project provides ca. three million molecular structures optimized by the density functional theory (DFT) method. We have also calculated the excited states for over two million molecules based on the time-dependent DFT (TDDFT) method. The number of molecules stored in the PubChemQC database is currently increasing. Thus, the PubChemQC database can potentially become one of the essential tools to deal with various chemical problems, especially in the data-driven chemistry area. The previous paper described only a brief introduction to our project.15 We did not present details of the calculation methodologies used in this project. Moreover, the analysis of database and the machine learning calculations were not discussed. Conversely, this paper discusses the fundamental characteristics of the PubChemQC database. We also describe in detail the techniques used to construct the large-scale database based on the first-principles quantum chemistry approach, which will be helpful in preparing the datasets used for data-driven chemical studies. In Section 2, we briefly describe two molecular encoding systems, namely the InChI and SMILES representations. Then, we describe the PubChem database along with the molecules selected and calculated in our project. We also discuss the techniques used to calculate the electronic structure of these molecules based on the first-principles 3 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

quantum chemistry approach. In Section 3, we do an analysis to demonstrate the characteristics of our database while also presenting a machine learning approach used to predict molecular electronic structures. In Sections 4 and 5, we present the discussions and concluding remarks, respectively.

2. The PubChemQC project A. The PubChem project as a source of molecular InChI and SMILES representations First, we briefly describe the molecular encoding systems. Since our project needs to handle a huge number of molecules, the machine-readable notation is essential to develop a large-scale database. The InChI representation is a non-proprietary identifier of chemical substances for electronic data sources.10, 11 The InChI representation can be standardized (i.e., canonicalized) while being unique (with some exceptions) and human-readable (with effort). Here, we demonstrated some examples of molecules represented by InChI. Each compound in the PubChem Compound database has its unique non-zero integer number.14 Hereafter, we denoted the number as CID. Thus, ethanol, whose CID is 702, is represented as follows: InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3 whereas L-ascorbic acid (CID 54670067) is described as follows: InChI=1S/C6H8O6/c7-1-2(8)5-3(9)4(10)6(11)12-5/h2,5,7-8,10-11H,1H2/t2-,5+/m0/s1. The InChI representation has three sub-layers: main, charge, and stereochemical layers. The main layer contains the chemical formula, atom connections, and hydrogen atoms. The charge layer is used to describe the charges and protons of the molecule. The stereochemical layer contains information of double or multiple bonds and tetrahedral

ACS Paragon Plus Environment

Page 4 of 41

Page 5 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

stereochemistry, among other aspects. From the quantum chemical viewpoint, the main layer is the most important sub-layer since the charge and stereochemical layers may contain ambiguities for quantum chemistry calculations. The SMILES representation, which is used for almost the same purpose of InChI, was developed by Weininger12 and Daylight Chemical Information Systems13. It is easier to read by human and more popular as compared to the InChI representation. However, there are many variants, extensions, and canonicalization methods. O’Boyle proposed a canonicalization of SMILES and implemented it in Open Babel.21,

22

For example,

ethanol is represented as the canonical SMILES representation as follows:22 CCO. Note that “C(O)C” also represents ethanol, but this representation is not canonical. L-ascorbic acid in PubChem (CID 54670067) is represented as follows: C(C(C1C(=C(C(=O)O1)O)O)O)O. It has an isomeric representation as follows: C([C@@H]([C@@H]1C(=C(C(=O)O1)O)O)O)O. When standardized by Open Babel, we have the following description: OC[C@@H]([C@H]1OC(=O)C(=C1O)O)O. It should be noted that there is no consensus on the representation of organometallics. A clear example may be ferrocene molecules (e.g., CID 7611 and CID 10219726). These molecules can be represented by following two canonical SMILES representations, although it is difficult to distinguish them by the current algorithm. [CH-]1C=CC=C1.[CH-]1C=CC=C1.[Fe+2], [CH]1[CH][CH][CH][CH]1.[CH]1[CH][CH][CH][CH]1.[Fe]. We note that these problems are found in the InChI representation as well. 5 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 41

Quantum chemistry programs require the three-dimensional (3D) coordinate of a molecule in the input file. Therefore, we generated the 3D coordinates of the molecules from the InChI or SMILES representations. In addition, the reverse transformation (i.e., from the 3D coordinate to the InChI and SMILES representation) was used to analyze and validate the first-principles calculations. Thus, our project required the one-on-one map between the machine-readable representation and the 3D coordinate for each molecule. With this purpose, the InChI and SMILES representations can be used properly. Both representations provided extremely similar behaviors for this mapping, although, in our experience, the SMILES representation provides more stable results in the reverse transformation. To build the first-principles electronic structure database, we employed the PubChem project as the source of the InChI and SMILES molecular representations.14 The PubChem database is the largest public database for InChI and SMILES representations, and it is maintained by the National Institutes of Health (NIH) through the NIH Molecular Libraries Roadmap initiative. There are three subprojects in the PubChem database, namely PubChem Substance, PubChem BioAssay, and PubChem Compound. The PubChem Substance collects all the submitted data from the science and engineering

communities,

whereas

the

PubChem

Compound

contains

pure,

standardized, and non-duplicated compounds by organizing the data of the PubChem Substance project. The PubChem Compound provides approximately 92,000,000 molecules,23 and the number is updated daily. Importantly, the PubChem project is in the

public

domain

(i.e.,

the

PubChem-generated

information

and

the

participant-provided information are available without cost and without restriction to the public24). The license terms are critical for the advancement of our project. Thus, we

ACS Paragon Plus Environment

Page 7 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

have employed the PubChem Compound for the molecular source in our project. Other databases may be used to create large-scale quantum chemistry databases. For example, the Chemical Abstracts Service database contains more than 124 million organic and inorganic substances collected from the published scientific literature.25 However, it is a proprietary database, and therefore its secondary use is difficult. In the next section, we explain the workflow followed to create our database.

B. Workflow to create calculation data for the PubChemQC database The workflow followed to construct the PubChemQC database is shown in Figure 1. We acquired the molecular information as structure-data files (SDFs) from the FTP site of the PubChem project26 in mid-July 2014. Approximately 3,000 SDFs were obtained, each one containing approximately 25,000 molecules. It is worth noting that some CIDs were not available because of deprecated molecular data in the PubChem project. Each SDF from the PubChem Compound project contained the 3D structures without hydrogen atoms generated by the PubChem3D project,27 the IUPAC names, the InChI and SMILES representations,12, 13 and the molecular weights, among other parameters. However, our project only used the CID, the InChI, and the isomeric SMILES representations. The molecular weight was used only to sort molecules. No other information was employed to create our database. We created a file containing the CIDs, the InChI representations, and the molecular weights from the SDFs. Then, we sorted the molecules in the file by molecular weight in ascending order. In our project, the first-principles molecular calculations were executed in this sorted order. Thus, our calculations started with a hydrogen atom and will finish with a very heavy molecule. Approximately, we can calculate lighter 7 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

molecules much faster and easier than heavier molecules. The molecular weight distribution in the PubChem database is shown in Figure 2, where the horizontal axis represents the molecular weights and the vertical axis represents the accumulated number of molecules. Interestingly, the sorted data did not show simple log-scale distribution. In the low-molecular-weight region, the accumulated number of molecules exponentially increased. In contrast, a linear behavior was obtained in the ca. 200–600 molecular weight region. A saturated behavior was observed at larger molecular weights. The molecules stored in the PubChem database were gathered from the scientific and engineering communities. In other words, it was created by humans and not by computer algorithms. Databases constructed by humans can provide such a linear-scale distribution when molecules are sorted by the molecular weight. Conversely, the GDB-17 database, which is created by a combinatorial algorithm, has an astronomical distribution behavior.28 There were some compounds not suitable for our database such as mixed substances like ionic salts (e.g., Ni2+ SO42- CID 24586), HCl salts (e.g., CID 56825941), water mixtures (e.g., CID 21932805), and other mixtures (e.g., CID 67855675). Unfortunately, molecules containing η5 bond (e.g., ferrocene CID 7611, 504306, and 11985121, among others) were not calculated because of issues with the InChI representation. We removed the molecules whose SMILES expressions contained. In addition, we only calculated molecules containing H, He, Li, Be, B, C, N, O, F, Ne, Na, Mg, Al, Si, P, S, Cl, Ar, K, Ca, Sc, Ti, V, Cr, Mn, Fe, Co, Ni, Cu, and Zn. This was produced by the limitations of the 6-31G* basis set. Isotopes were also ignored (CID 167583 and CID 783 gave the same results). On the contrary, isomers were separately registered as CID 5426 (thalidomide), CID 75792 ((R)-thalidomide), and CID 92142 ((S)-thalidomide) in

ACS Paragon Plus Environment

Page 8 of 41

Page 9 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

the PubChem database. CID 5426 was the racemic modification of CID 75792 and CID 92142, and we had to randomly specify one of the enantiomers when calculating CID 5426. We assumed that all the molecules were neutral (i.e., we did not distinguish CID 28179 and CID 5360525), and the number of electrons was estimated by the nuclear charges. From the InChI representation, we generated the initial (guessed) 3D molecular structures which were used for quantum chemistry calculations by using Open Babel with “-gen3d -addH” options as follows:21, 22, 29 (i) the initial 3D structure was generated using rules and ring templates, (ii) 250 steps were carried out for the steepest descent geometry optimization with the MMFF94 forcefield, (iii) 200 iterations were carried out for the Weighted Rotor conformational search (optimizing each conformer with 25 steps of the steepest descent), and (iv) 250 steps were carried out for the conjugate gradient geometry optimization. Thus, the initial 3D molecular geometry obtained was a fairly good starting point for the next quantum chemistry calculations. The first geometry optimization based on the quantum chemistry method was performed by the PM3 method.30, 31 The resultant geometry was further optimized by the Hartree–Fock method using the STO-6G basis set. Next, we optimized by the B3LYP (VWN3) functional32, 33 using the 6-31G* basis set. The B3LYP optimization process actually comprised three steps. In the first step, we roughly optimized molecular geometries by the FireFly34 or SMASH35. Next, more accurate geometry optimizations were executed by GAMESS. Those two-step optimization processes were used to reduce the calculation time. The FireFly and SMASH calculations are substantially faster but slightly less accurate. The first step leads quickly to increasing accuracy with smaller iterations in the second step. Finally, we executed the optimization calculation again to validate that the molecule 9 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

was really optimized in the process by checking whether the same structures were obtained in the input and output. As this was used for validation only, it had no effect on the final molecular geometry. The B3LYP geometry optimization processes usually provide fairly good geometries (typically bond angles of within a few degrees and bond lengths within 0.02 Å).36-39 Subsequent excited-state calculations were executed by the TDDFT method with the B3LYP functional and the 6-31+G* basis set, employing the same geometry obtained by the B3LYP optimization step. The input files and the final results were uploaded weekly at http://pubchemqc.riken.jp/. All calculations were executed on the RICC supercomputer (Intel Xeon 5570 2.93 GHz, 1024 nodes) and the QUEST supercomputer (Intel Core2 L7400 1.50 GHz, 700 nodes) at the RIKEN Advanced Center for Computer and Communication. We also employed the HOKUSAI supercomputer (Fujitsu PRIMEHPC FX100) at RIKEN whose peak performance was approximately 1PFlops. In addition, the Oakleaf-FX supercomputer (Fujitsu PRIMEHPC FX10, SPARC64 IX 1.848 GHz) at the University of Tokyo was used. With the above computational resources, calculations for several thousand of molecules per day were possible.

3. Analysis of the PubChemQC Database and Machine Learning Results In this section, we aggregated the calculation results to analyze the features of the PubChemQC database. As shown in Figure 1, we needed several steps to create the calculation data from the InChI representations. Therefore, the data may contain some errors and failures, even in the case of first-principles calculations being successfully finished. In some cases, these errors were produced by bugs in our process, whereas in other cases, they might have been originated from errors in the PubChem database. To

ACS Paragon Plus Environment

Page 10 of 41

Page 11 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

get rid of such inappropriate calculations, we generated the InChI representation from the optimized 3D molecular structure of each molecule. Then, we compared the InChI representation with the original one stored in the PubChem database. Only those molecules providing the same InChI representations for the main layers were subsequently employed to analyze the PubChemQC database in this section. This InChI-based validation technique was similar to that used in the literature.40 In this paper, we employed several relational database servers to organize calculated data; the PostgresSQL program package was adopted for the relational database management system. We parsed the output files generated from GAMESS and registered extracted values such as molecular weights, HOMO–LUMO gaps, and excitation energies into the servers. Here, CID was used as the primary key in the relations. The molecular weight, HOMO–LUMO gap, or excitation energy can be treated as a (float8-type) scalar value in the ordinary relations. We can obtain and search molecular properties by throwing SQL-based queries on the relational database servers. PostgresSQL can also handle the array-type data. We inserted molecular geometries and orbital energies as array-type values into servers, although those are not considered in the analysis of this paper. We will report an analysis of the array-type properties somewhere. First, we presented the dipole moments histogram of 2,819,910 molecules validated by the InChI-based technique (Figure 3). Here, the horizontal axis refers to the dipole moment, whereas the vertical axis represents the number of molecules. The molecular dipole moment is an important indicator to predict solubility in water or organic solvents. In the PubChemQC database, we confirmed that ca. 10,000 types of molecules were not polarized. Conversely, several molecules showed dipole moments of 1-3 11 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Debye. The number of molecules with higher dipole moment exponentially decreased, and few molecules in the PubChemQC database showed dipole moments higher than 10.0. Next, we discussed the energy difference (gap) between the highest occupied molecular orbital (HOMO) and the lowest unoccupied molecular orbital (LUMO) and the excitation energy calculation results. The HOMO–LUMO gap property is a fundamental parameter to design organic electronic devices such as electroluminescent displays, image sensors, and photocells. The excitation energy is especially important for photon absorption phenomena.41, 42 Figure 4 summarizes the HOMO–LUMO gap histogram of 2,248,895 molecules (blue series). The red histogram refers to the excitation energy. We easily confirmed that the HOMO–LUMO gaps were more widely distributed as compared to the excitation energy, and the center of gravity for the HOMO–LUMO gap property was larger than that of the excitation energy. The HOMO–LUMO gap was considered as the energy difference between free quasi-particles (i.e., electron and hole).43,

44

Conversely, the excitation energy

corresponded to the optical gap, which is typically observed by photon absorption experiments. Thus, the optical gap becomes lower than the HOMO–LUMO gap owing to the strong binding energy between holes and electrons. In other words, we can evaluate the binding energy by substituting the excitation energy from the HOMO– LUMO gap.43 Figure 5 shows the relation between the HOMO–LUMO gap and the binding energy. The exciton binding energy is an important parameter when developing the materials used in organic photocell devices.43, 44 The binding energy of different organic materials can easily be estimated by a database search. Here, the horizontal and vertical axes indicate the HOMO–LUMO gap and the excitation energy, respectively.

ACS Paragon Plus Environment

Page 12 of 41

Page 13 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

The small black points represent the calculated results, and a huge number of data formed the thick band observed in Figure 5. The differences between the HOMO– LUMO gap and the excitation energy increased with the HOMO–LUMO gap. A molecule with a large HOMO–LUMO gap tends to present the strong electron–hole binding interaction. We can instantly obtain molecular information such as the dipole moment and excitation energy by searching the PubChemQC database provided that the first-principles calculation for a target molecule is already finished and stored. However, if a target molecule does not exist in the database, we cannot obtain any information from the query. To overcome this problem, we considered a machine learning approach to predict molecular electronic structures from already calculated data without time- and resource-consuming quantum chemistry calculations. To train machines (predictors), we employed “Topological” fingerprints with 1024 bits as feature molecular vectors, using the RDKit library.45 The molecular fingerprint is a series of binary bits representing the characteristics of a molecule, in a form that is suitable for processing by computers. Similarities and differences among molecules can be calculated by comparing these fingerprints. A number of algorithms have been proposed for generating molecular fingerprints. The Topological fingerprint is generated in a similar way to the Daylight algorithm,13 in which the substructures of the molecule are identified and hashed. These substructures are by default seven bonds long and are represented in a bit stream of user-specified length, folding down to obtain a particular density of set bits. The “Topological” fingerprint is therefore created from information on the bond connections in the molecule. The machine learning approaches used in this study were executed using the scikit-learn library.46 Here, the fingerprints (feature vectors) can be generated 13 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

only from the SMILES representation. Therefore, we can predict molecular electronic structures exclusively from the SMILES representation. The HOMO–LUMO gap was employed as the target property. The support vector machine (SVM) and the ridge regression algorithms were employed using the kernel technique for the machine learning-based prediction.47 Here, the Gaussian radial basis function (RBF) was used for the kernel.47 We also examined the second- and third-order polynomials for the kernel technique.47 We selected a dataset containing one million of molecules from the PubChemQC database. Those molecules were randomly chosen, although the HOMO– LUMO gaps were uniformly distributed between 4.5 and 6.5 eV in the dataset. We trained the machines (predictors) using 20.0 thousand of molecules, which were randomly chosen from the dataset. Then, we calculated the HOMO–LUMO gaps for the rest molecules using the trained machines (predictors). We summarized the root-mean-square errors (RMSEs) between the predictions and the exact HOMO– LUMO gaps in Table 1. When the RBF kernel was employed, the SVM and the ridge regressions provided RMSE values of 0.36 and 0.37 eV, respectively. In the SVM method, some polynomial kernels provided slightly worse predictions as compared to the RBF kernel. For example, the second- and third-order polynomials were 0.39 and 0.43 eV for RMSEs, respectively. Conversely, the second-order polynomial kernel yielded an RMSE of 0.36 eV when the ridge regression was employed, which was similar to the results obtained using the RBF kernel. In the dataset, molecules are uniformly distributed, so that to estimate the RMSE, the average value (5.5 eV) was used throughout. The use of the average value gave an RMSE of 0.58 eV. From these calculations, we can confirm that the machine learning approach can roughly predict the DFT results, even if only the SMILES were provided. The SMILES representation

ACS Paragon Plus Environment

Page 14 of 41

Page 15 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

contains the bonding information between the atoms in the molecule, and these atoms may give some hints for the estimations. The predictive ability will be enhanced by improving the algorithms and feature vectors. We are currently investigating prediction of the HOMO, LUMO, and excitation energies. The more advanced studies based on the machine learning approach are in progress, and results will be reported in the near future.

4. Discussions Here, we discuss other database projects using computer simulation techniques. The NIST Chemistry WebBook contains the IR spectra of over 16,000 molecules and the UV spectra and quantum chemical calculations of 1,600 molecules.48 The PubChem3D27 lists optimized molecular geometries containing H, C, N, O, F, Si, P, S, Cl, Br, and I with less than 50 hydrogens by the MMFF94s empirical method.49 In our project, the molecular structures were optimized using the DFT method, which usually provides more reliable results as compared to empirical classical calculations. The Harvard Clean Energy Project was intended to store molecules using the combinatorial approach and quantum chemical calculations.50, 51 This project provided 2.3 million candidates for organic electronic materials. The largest database using the combinatorial approach is the GDB-17.28 It enumerates all possible molecules (containing C, N, O, S, Cl, and H) up to 17 atoms by the SMILES notation. Actually, there are 166 billion molecules in this database! Ramakrishanan et al. calculated 134,000 molecules carefully chosen from the GDB-17 database, and the molecular electronic structures of these molecules were calculated using first-principles methods.40 On the contrary, at nearly the same time, we launched the PubChemQC project (http://pubchehmqc.riken.jp/).15 At 15 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 41

that time, our project already provided ca. 1.5 million molecules. At present, our database provides molecular excited states as well. However, the most important difference between both projects lies in the source of the molecules. Thus, the molecules stored in the GDB-17 project are artificially created by the combinational algorithm. In the GDB-17 database, a huge number of molecules are included, although many of them might be less important from the chemical viewpoint. For example, in the Ramakrishanan’s work, molecules containing up to nine heavy atoms excluding S, Br, Cl, or I were selected.40 Such a loose filter removed almost all molecules from the GDB-17 database, and only 134,000 molecules remained. Conversely, for the PubChem project, we preferred to choose the molecules that should be calculated. Thus, the molecules stored in the PubChem database were actually synthesized molecules by the science and engineering communities and chemical vendors, among others.52 In addition, PubChem obtained molecules from other important databases such as ChEMBL53 (imported 1,686,695 live substances54), ZINC55 (imported 25,758,525 live substances56), ChemSpider57 (imported 14,642,781 live Substances58), KEGG59 (39,051 live substances), and Aurora Fine Chemicals LLC (34,304,433 live substances60). Thus, the PubChem database contains a very wide variety of molecules. Therefore, we believe that our project includes essential molecules for chemistry and material science. We started this project in 2013 and presented 13,000 molecules on 1/15/2014, and this number increased to 25,000 molecules on 7/2/2014. At that time, we calculated and presented

only

ground-state

optimized

geometries

and

uploaded

them

at

http://pubchemqc.sourceforge.net/. However, it was shut down as a result of an over bandwidth on 2/24/2014. We moved our site to http://pubchemqc.riken.jp/ on 3/30/2014. We started the excited-state calculations by TDDFT on 4/24/2014. On 5/20/2014, we

ACS Paragon Plus Environment

Page 17 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

sorted the SDFs by molecular weight and restarted the whole calculations with the previously calculated 116,869 molecules. On 7/29/2014, we presented 155,792 molecules containing ground-state geometry optimizations and 55,456 molecules with excited states. On 11/11/2014, we presented over 1,000,000 molecules (1,001,704 ground-state and 1,001,133 excited-state calculations). On 5/27/2015, we presented over 2,000,000 molecules (2,016,173 ground-state and 2,016,173 excited-state calculations). Thus, the number of molecules in the PubChemQC database has been steadily increasing, although we faced some problems. These results prove that the techniques discussed in this paper are effective in constructing the electronic structure database of several million organic molecules. However, we noticed that our present approach may not be able to apply to several tens of millions of molecules owing to time-consuming first-principles calculations and huge data handling, among other aspects. To overcome these obstacles, we intend to adopt large-scale parallel supercomputers such as the K-computer and parallel database systems. Such studies will be discussed in subsequent papers.

5. Concluding Remarks We have been developing a large-scale molecular electronic structure database based on first-principles quantum chemistry methods. Unlike other projects, our project did not employ machine- and algorithm-generated molecular chemical structures. In this project, we targeted molecules listed in the PubChem Compound database and, therefore, named our database as PubChemQC, and provided a variety and diversity of molecular frameworks. The 3D molecular structures obtained by the first-principles 17 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

methods were prepared only from the InChI and the SMILES representations. The B3LYP functional with the 6-31G* and the 6-31+G* basis sets were used for the ground-state optimizations and the excited-state TDDFT calculations, respectively. Thus, we did not use any experiment to construct the database. We analyzed the molecular dipole moment and the relation between the HOMO–LUMO gap and the excitation energy. This paper also demonstrated that a machine learning approach is useful to roughly predict the electronic structure of molecules, even if only the SMILES representation was provided. Currently, ca. three million optimized molecular structures and ground-state wave functions are stored in the PubChemQC database. We also provide the excited-state electronic structures of over two million molecules. The number of molecules stored in the database is increasing at this moment, and novel data are updated weekly at our site (http://pubchemqc.riken.jp/). These large-scale molecular databases will be useful for various chemical studies such as search and data mining for materials and drugs, training machines by large datasets, and data-driven chemical studies. Actually, there are strong demands for large-scale databases in those applications. This project employed the B3LYP method because it can produce reasonable calculation results for a wide range of molecules using limited computer resources. It also allows a balance to be struck between calculation time and accuracy.36, 38, 39 B3LYP can be applied to a wide range of problems.38 This allows it to be used to identify features of the calculation results stored in our database, even by researchers who are not specialists in quantum chemistry. However, some molecules demand more accurate methods. For example, it is well known that the B3LYP performs poorly when describing Rydberg and charge transfer states, and a long-range term is often required to

ACS Paragon Plus Environment

Page 18 of 41

Page 19 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

correct the results.61 More accurate and sophisticated methods have been proposed in quantum chemistry, such as Møller–Plesset perturbation theories and the coupled cluster methods. Other databases, based on more accurate quantum chemistry methods, may be required in the future, although those are beyond the scope of this study. We did not attempt to provide globally optimized conformation of all molecules registered in the database, because of enormous computational costs. Instead, the project provides reasonable (local) molecular geometries. The goal was to roughly capture the electronic structures of as many molecules as possible; therefore, we did not consider all aspects of the molecules. More exact treatment of the changes in molecular properties based on molecular conformations may be supplied by other database projects. In practice, it is not possible for a single database to satisfy all requests across the full range of scientific and engineering fields. More specific databases may be required to address more detailed problems. However, the calculation data stored in the PubChemQC project can provide a substructure for the construction of such problem-specific databases. The large-scale database construction techniques discussed in this paper can also support the development of such databases. The PubChemQC database is one of the largest existing databases for first-principle quantum chemical calculations. However, the number of molecules listed in our database may be insufficient. The data size is one of the critical factors determining the usefulness of a database. We are, therefore, considering employing massively parallel supercomputing to accelerate the implementation of our project. The focus of this paper was on neutral molecules. In the near future, we will include charged molecules, which are especially important in biological applications. In addition, we have been trying to add other molecular properties such as the vibrating structure, the nuclear magnetic 19 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

resonance chemical shift, the optimized structure in the excited state, and the solvent effects. Although this project employs relational database servers to organize and analyze the calculated data, the website does not offer a query service. Instead, it provides mainly input and output files for the calculated results. The provision of more useful tools will require substantial upgrading, including the use of high-performance servers and parallel database systems, with the associated server maintenance burden, the development of web applications, improved site design, enhanced security, and the development of a licensing policy. Despite these difficulties, the website must be upgraded to allow our calculated data to be utilized more effectively. We are, therefore, developing a more useful and user-friendly website. The format (structure) of the database is crucial to this. However, it is nontrivial to identify the most suitable format for our project. The format must have a number of features, including flexibility, extensibility, and usability. We are currently researching appropriate database structures, including NoSQL-type formats, working alongside experts in information science. These trials are reported elsewhere.

ACS Paragon Plus Environment

Page 20 of 41

Page 21 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Acknowledgments The calculations were performed using the RIKEN Integrated Cluster of Clusters (RICC) and the HOKUSAI facility, and the research was partially supported by the Initiative on Promotion of Supercomputing for Young or Women Researchers, Supercomputing Division, Information Technology Center, The University of Tokyo. This work was partially supported by the JSPS KAKENHI, Grant Number 15K05403.

Contributions M.N. and T.S. are equally contributed to this manuscript.

21 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 41

Table 1. HOMO–LUMO gap predictions based on the machine learning approach. Method

Kernel

RMSE [eV]

SVM regression

RBF

0.36

second-order polynomial

0.39

third-order polynomial

0.43

RBF

0.37

second-order polynomial

0.38

third-order polynomial

0.36

fourth-order polynomial

0.48

Ridge regression

ACS Paragon Plus Environment

Page 23 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure captions

Figure 1. Workflow to create calculation data stored in the PubChemQC database.

Figure 2. Molecular weight distribution of the PubChem Compounds.

Figure 3. Histogram of molecular dipole moments stored in the PubChemQC database.

Figure 4. Histogram of the HOMO–LUMO gaps (blue) and the excitation energies (red) stored in the PubChemQC database.

Figure 5. The relation between the HOMO–LUMO gap and the excitation energy.

23 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 1. M. Nakata and T. Shimazaki

ACS Paragon Plus Environment

Page 24 of 41

Page 25 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 2. M. Nakata and T. Shimazaki

25 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3. M. Nakata and T. Shimazaki

ACS Paragon Plus Environment

Page 26 of 41

Page 27 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 4. M. Nakata and T. Shimazaki

27 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 5. M. Nakata and T. Shimazaki

ACS Paragon Plus Environment

Page 28 of 41

Page 29 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

References

1. Dirac, P. A. M., Quantum mechanics of many-electron systems. Proceedings of the Royal Society of London A 1929, 123, 714-733. 2. Pople, J. A., Energy, structure, and reactivity. In Proceedings of the 1972 Boulder Summer Research Conference on Theoretical Chemistry, Smith, D. and McRae W., Eds.; John Wiley & Sons Ltd.: New York, 1973; pp 51-61. 3. Helgaker, T.; Jorgensen, P.; Olsen, J., Molecular electronic-structure theory. John Wiley & Sons: New York, 2000. 4. Frisch, M. J.; Trucks, G. W.; Schlegel, H. B.; Scuseria, G. E.; Robb, M. A.; Cheeseman, J. R.; Scalmani, G.; Barone, V.; Petersson, G. A.; Nakatsuji, H.; X. Li, M. C.; Marenich, A.; Bloino, J.; Janesko, B. G.; Gomperts, R.; Mennucci, B.; Hratchian, H. P.; Ortiz, J. V.; Izmaylov, A. F.; Sonnenberg, J. L.; Williams-Young, D.; Ding, F.; Lipparini, F.; Egidi, F.; Goings, J.; Peng, B.; Petrone, A.; Henderson, T.; Ranasinghe, D.; Zakrzewski, V. G.; Gao, J.; Rega, N.; Zheng, G.; Liang, W.; Hada, M.; Ehara, M.; Toyota, K.; Fukuda, R.; Hasegawa, J.; Ishida, M.; Nakajima, T.; Honda, Y.; Kitao, O.; Nakai, H.; Vreven, T.; Throssell, K.; Montgomery, J. A.; Jr., J. E. P.; Ogliaro, F.; Bearpark, M.; Heyd, J. J.; Brothers, E.; Kudin, K. N.; Staroverov, V. N.; Keith, T.; Kobayashi, R.; Normand, J.; Raghavachari, K.; Rendell, A.; Burant, J. C.; Iyengar, S. S.; Tomasi, J.; Cossi, M.; Millam, J. M.; Klene, M.; Adamo, C.; Cammi, R.; Ochterski, J. W.; Martin, R. L.; Morokuma, K.; Farkas, O.; Foresman, J. B.; Fox, D. J., Gaussian 09,. Gaussian Inc.: Wallingford CT, 2009. 5. Shao, Y. H.; Gan, Z. T.; Epifanovsky, E.; Gilbert, A. T. B.; Wormit, M.; Kussmann, J.; Lange, A. W.; Behn, A.; Deng, J.; Feng, X. T.; Ghosh, D.; Goldey, M.; Horn, P. R.; Jacobson, L. D.; Kaliman, I.; Khaliullin, R. Z.; Kus, T.; Landau, A.; Liu, J.; Proynov, E. I.; Rhee, Y. M.; Richard, R. M.; Rohrdanz, M. A.; Steele, R. P.; Sundstrom, E. J.; Woodcock, H. L.; Zimmerman, P. M.; Zuev, D.; Albrecht, B.; Alguire, E.; Austin, B.; Beran, G. J. O.; Bernard, Y. A.; Berquist, E.; Brandhorst, K.; Bravaya, K. B.; Brown, S. T.; Casanova, D.; Chang, C. M.; Chen, Y. Q.; Chien, S. H.; Closser, K. D.; Crittenden, D. L.; Diedenhofen, M.; DiStasio, R. A.; Do, H.; Dutoi, A. D.; Edgar, R. G.; Fatehi, S.; Fusti-Molnar, L.; Ghysels, A.; Golubeva-Zadorozhnaya, A.; Gomes, J.; Hanson-Heine, M. W. D.; Harbach, P. H. P.; Hauser, A. W.; Hohenstein, E. G.; Holden, Z. C.; Jagau, T. C.; Ji, H. J.; Kaduk, B.; Khistyaev, K.; Kim, J.; Kim, J.; King, R. A.; Klunzinger, P.; Kosenkov, D.; Kowalczyk, T.; Krauter, C. M.; Lao, K. U.; Laurent, A. D.; Lawler, K. V.; Levchenko, S. V.; Lin, C. Y.; Liu, F.; Livshits, E.; Lochan, R. C.; 29 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Luenser, A.; Manohar, P.; Manzer, S. F.; Mao, S. P.; Mardirossian, N.; Marenich, A. V.; Maurer, S. A.; Mayhall, N. J.; Neuscamman, E.; Oana, C. M.; Olivares-Amaya, R.; O'Neill, D. P.; Parkhill, J. A.; Perrine, T. M.; Peverati, R.; Prociuk, A.; Rehn, D. R.; Rosta, E.; Russ, N. J.; Sharada, S. M.; Sharma, S.; Small, D. W.; Sodt, A.; Stein, T.; Stuck, D.; Su, Y. C.; Thom, A. J. W.; Tsuchimochi, T.; Vanovschi, V.; Vogt, L.; Vydrov, O.; Wang, T.; Watson, M. A.; Wenzel, J.; White, A.; Williams, C. F.; Yang, J.; Yeganeh, S.; Yost, S. R.; You, Z. Q.; Zhang, I. Y.; Zhang, X.; Zhao, Y.; Brooks, B. R.; Chan, G. K. L.; Chipman, D. M.; Cramer, C. J.; Goddard, W. A.; Gordon, M. S.; Hehre, W. J.; Klamt, A.; Schaefer, H. F.; Schmidt, M. W.; Sherrill, C. D.; Truhlar, D. G.; Warshel, A.; Xu, X.; Aspuru-Guzik, A.; Baer, R.; Bell, A. T.; Besley, N. A.; Chai, J. D.; Dreuw, A.; Dunietz, B. D.; Furlani, T. R.; Gwaltney, S. R.; Hsu, C. P.; Jung, Y. S.; Kong, J.; Lambrecht, D. S.; Liang, W. Z.; Ochsenfeld, C.; Rassolov, V. A.; Slipchenko, L. V.; Subotnik, J. E.; Van Voorhis, T.; Herbert, J. M.; Krylov, A. I.; Gill, P. M. W.; Head-Gordon, M., Advances in molecular quantum chemistry contained in the Q-Chem 4 program package. Molecular Physics 2015, 113, 184. 6. Werner, H. J.; Knowles, P. J.; Knizia, G.; Manby, F. R.; Schutz, M., Molpro: a general-purpose quantum chemistry program package. Wiley Interdisciplinary Reviews-Computational Molecular Science 2012, 2, 242-253. 7. Valiev, M.; Bylaska, E. J.; Govind, N.; Kowalski, K.; Straatsma, T. P.; Van Dam, H. J. J.; Wang, D.; Nieplocha, J.; Apra, E.; Windus, T. L.; de Jong, W., NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations. Computer Physics Communications 2010, 181, 1477-1489. 8. Furche, F.; Ahlrichs, R.; Hattig, C.; Klopper, W.; Sierka, M.; Weigend, F., Turbomole. Wiley Interdisciplinary Reviews-Computational Molecular Science 2014, 4, 91-100. 9. Schmidt, M. W.; Baldridge, K. K.; Boatz, J. A.; Elbert, S. T.; Gordon, M. S.; Jensen, J. H.; Koseki, S.; Matsunaga, N.; Nguyen, K. A.; Su, S. J.; Windus, T. L.; Dupuis, M.; Montgomery, J. A., General atomic and molecular electronic-structure system. Journal of Computational Chemistry 1993, 14, 1347-1363. 10. The IUPAC International Chemical Identifier (InChI). http://www.iupac.org/home/publications/e-resources/inchi.html (accessed April 5 2017). 11. The InChI Trust, InChI and InChIKeys for chemical structures. http://www.inchi-trust.org/ (accessed April 5, 2017). 12. Weininger, D., SMILES, a chemical language and information-system .1. introduction to methodology and encoding rules. Journal of Chemical Information and

ACS Paragon Plus Environment

Page 30 of 41

Page 31 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Computer Sciences 1988, 28, 31-36. 13. Daylight Chemical Information Systems. http://daylight.com (accessed April 5, 2017). 14. Kim, S.; Thiessen, P. A.; Bolton, E. E.; Chen, J.; Fu, G.; Gindulyte, A.; Han, L. Y.; He, J. E.; He, S. Q.; Shoemaker, B. A.; Wang, J. Y.; Yu, B.; Zhang, J.; Bryant, S. H., PubChem Substance and Compound databases. Nucleic Acids Research 2016, 44, D1202-D1213. 15. Nakata, M., A large chemical database from the first principle calculations. AIP Conference Proceedings 2015, 1702, 090058. 16. Le, Q. V.; Ranzato, M.; Monga, R.; Devin, M.; Chen, K.; Corrado, G. S.; Dean, J.; Ng, A. Y., Building high-level features using large scale unsupervised learning. Proceedings of the 29th International Conference on Machine Learning. 2012, 507. 17. Krizhevsky, A.; Sutskever, I.; Hinton, G. E., Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 2012, 1097. 18. Seide, F.; Li, G.; Yu, D., Conversational speech transcription using context-dependent deep neural networks. Interspeech 2011, 437. 19. Ramakrishnan, R.; Dral, P. O.; Rupp, M.; von Lilienfeld, O. A., Big data meets quantum chemistry approximations: the delta-machine learning approach. Journal of Chemical Theory and Computation 2015, 11, 2087-2096. 20. Ramakrishnan, R.; Hartmann, M.; Tapavicza, E.; von Lilienfeld, O. A., Electronic spectra from TDDFT and machine learning in chemical space. Journal of Chemical Physics 2015, 143. 21. O'Boyle, N. M.; Banck, M.; James, C. A.; Morley, C.; Vandermeersch, T.; Hutchison, G. R., Open Babel: An open chemical toolbox. Journal of Cheminformatics 2011, 3, 33. 22. O'Boyle, N. M., Towards a universal SMILES representation - a standard method to generate canonical SMILES based on the InChI. Journal of Cheminformatics 2012, 4, 22. 23. PubChem project, PubChem Compound all structures. https://www.ncbi.nlm.nih.gov/pccompound (accessed April 5, 2017). 24. PubChem project, NLM PUBCHEM PROJECT DATA SUBMISSION POLICY (DSP). https://pubchem.ncbi.nlm.nih.gov/upload/html/dsp.html (accessed April 5, 2017). 25. Chemical Abstracts Service, CAS REGISTRY and CAS RN FAQs. http://www.cas.org/content/chemical-substances/faqs (accessed April 5, 2017). 31 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

26. PubChem project, PubChem FTP site. ftp://ftp.ncbi.nih.gov/pubchem/Compound/CURRENT-Full/SDF/ (accessed April 5, 2017). 27. Bolton, E. E.; Kim, S.; Bryant, S. H., PubChem3D: conformer generation. Journal of Cheminformatics 2011, 3, 1. 28. Ruddigkeit, L.; van Deursen, R.; Blum, L. C.; Reymond, J. L., Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. Journal of Chemical Information and Modeling 2012, 52, 2864-2875. 29. Open Babel online document. https://open-babel.readthedocs.io/en/latest/3DStructureGen/SingleConformer.html (accessed April 12). 30. Stewart, J. J. P., Optimization of parameters for semiempirical methods. I. method. Journal of Computational Chemistry 1989, 10, 209-220. 31. Stewart, J. J. P., Optimization of parameters for semiempirical methods. III. Extension of PM3 to Be, Mg, Zn, Ga, Ge, as, Se, Cd, in, Sn, Sb, Te, Hg, Tl, Pb, and Bi. Journal of Computational Chemistry 1991, 12, 320-341. 32. Becke, A. D., Density-functional thermochemistry. III. The role of exact exchange. Journal of Chemical Physics 1993, 98, 5648-5652. 33. Vosko, S. H.; Wilk, L.; Nusair, M., Accurate spin-dependent electron liquid correlation energies for local spin-density calculations - a critical analysis. Canadian Journal of Physics 1980, 58, 1200-1211. 34. Granovsky, A. A., Firefly. http://classic.chem.msu.su/gran/firefly/index.html (accessed April 5, 2017). 35. Ishimura, K., Scalable molecular analysis solver for high-performance computing systems (SMASH). http://smash-qc.sourceforge.net/ (accessed April 5, 2017). 36. Baker J.; Molecular Structure and Vibrational Spectra. In Handbook of computational chemistry: Leszczynski, J., Kaczmarek-Kedziera, A., Puzyn, T., G. Papadopoulos, M., Reis, H., K. Shukla, M., Eds.; Springer: New York, 2012; pp 923-359. 37. Johnson, B. G.; Gill, P. M. W.; Pople, J. A., The performance of a family of density functional methods. Journal of Chemical Physics 1993, 98, 5612-5626. 38. Sousa, S. F.; Fernandes, P. A.; Ramos, M. J., General performance of density functionals. Journal of Physical Chemistry A 2007, 111, 10439-10452. 39. Riley, K. E.; Op't Holt, B. T.; Merz, K. M., Critical assessment of the performance of density functional methods for several atomic and molecular properties. Journal of Chemical Theory and Computation 2007, 3, 407-433.

ACS Paragon Plus Environment

Page 32 of 41

Page 33 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

40. Ramakrishnan, R.; Dral, P. O.; Rupp, M.; von Lilienfeld, O. A., Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data 2014, 1, 140022. 41. Shimazaki, T.; Nakajima, T., Theoretical study of exciton dissociation through hot states at donor-acceptor interface in organic photocell. Physical Chemistry Chemical Physics 2015, 17, 12538-12544. 42. Shimazaki, T.; Nakajima, T., Theoretical study on the cooperative exciton dissociation process based on dimensional and hot charge-transfer state effects in an organic photocell. Journal of Chemical Physics 2016, 144, 234906. 43. Vanossi, D.; Cigarini, L.; Giaccherini, A.; da Como, E.; Fontanesi, C., An Integrated experimental/theoretical study of structurally related poly-thiophenes used in photovoltaic systems. Molecules 2016, 21, 110. 44. Shimazaki, T.; Nakajima, T., Application of the dielectric-dependent screened exchange potential approach to organic photocell materials. Physical Chemistry Chemical Physics 2016, 18, 27554-27563. 45. Landrum, G., RDkit: Open-source cheminformatics. http://www.rdkit.org/ (accessed April 5, 2017). 46. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E., Scikit-learn: machine learning in Python. Journal of Machine Learning Research 2011, 12, 2825-2830. 47. Bishop, C. M., Pattern recognition and machine learning. Springer-Verlag: New York, 2006. 48. NIST Chemistry WebBook, NIST Standard Reference Database Number 69, http://webbook.nist.gov/chemistry/ (Accessed 7 April, 2017). 49. Halgren, T. A., MMFF VI. MMFF94s option for energy minimization studies. Journal of Computational Chemistry 1999, 20, 720. 50. Hachmann, J.; Olivares-Amaya, R.; Atahan-Evrenk, S.; Amador-Bedolla, C.; Sanchez-Carrera, R. S.; Gold-Parker, A.; Vogt, L.; Brockway, A. M.; Aspuru-Guzik, A., The Harvard clean energy project: large-scale computational screening and design of organic photovoltaics on the World community grid. Journal of Physical Chemistry Letters 2011, 2, 2241-2251. 51. Hachmann, J.; Olivares-Amaya, R.; Jinich, A.; Appleton, A. L.; Blood-Forsythe, M. A.; Seress, L. R.; Roman-Salgado, C.; Trepte, K.; Atahan-Evrenk, S.; Er, S.; Shrestha, S.; Mondal, R.; Sokolov, A.; Bao, Z. A.; Aspuru-Guzik, A., Lead candidates for high-performance organic photovoltaics from high-throughput quantum 33 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

chemistry - the Harvard Clean Energy Project. Energy & Environmental Science 2014, 7, 698-704. 52. PubChem project, Data Sources. https://pubchem.ncbi.nlm.nih.gov/source/ (accessed April 5, 2017). 53. Bento, A. P.; Gaulton, A.; Hersey, A.; Bellis, L. J.; Chambers, J.; Davies, M.; Kruger, F. A.; Light, Y.; Mak, L.; McGlinchey, S.; Nowotka, M.; Papadatos, G.; Santos, R.; Overington, J. P., The ChEMBL bioactivity database: an update. Nucleic Acids Res. 2014, 42, D1083-D1090. 54. PubChem project, Data Sources (ChEMBL). https://pubchem.ncbi.nlm.nih.gov/source/ChEMBL (accessed April 5, 2017). 55. Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.; Coleman, R. G., ZINC: a free tool to discover chemistry for biology. Journal of Chemical Information and Modeling 2012, 52, 1757-1768. 56. PubChem project, Data Sources (ZINC). https://pubchem.ncbi.nlm.nih.gov/source/ZINC (accessed April 5, 2017). 57. Pence, H. E.; Williams, A., ChemSpider: an online chemical information resource. Journal of Chemical Education 2010, 87, 1123-1124. 58. PubChem project, Data Sources (ChemSpider). https://pubchem.ncbi.nlm.nih.gov/source/ChemSpider (accessed April 5, 2017). 59. Kanehisa, M.; Goto, S., KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 2000, 28, 27-30. 60. PubChem project, Data Sources (Aurora Fine Chemicals LLC). https://pubchem.ncbi.nlm.nih.gov/source/11831 (accessed April 5, 2017). 61. Iikura, H.; Tsuneda, T.; Yanai, T.; Hirao, K., A long-range correction scheme for generalized-gradient-approximation exchange functionals. Journal of Chemical Physics 2001, 115, 3540-3544.

ACS Paragon Plus Environment

Page 34 of 41

Page 35 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

TOC for M. Nakata and T. Shimazaki

35 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table of Contents (TOC) Graphic 393x260mm (72 x 72 DPI)

ACS Paragon Plus Environment

Page 36 of 41

Page 37 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Workflow to create calculation data stored in the PubChemQC database.

381x304mm (72 x 72 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular weight distribution of the PubChem Compounds.

444x347mm (72 x 72 DPI)

ACS Paragon Plus Environment

Page 38 of 41

Page 39 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Histogram of molecular dipole moments stored in the PubChemQC database.

466x323mm (72 x 72 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Histogram of the HOMO–LUMO gaps (blue) and the excitation energies (red) stored in the PubChemQC database.

461x335mm (72 x 72 DPI)

ACS Paragon Plus Environment

Page 40 of 41

Page 41 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

The relation between the HOMO–LUMO gap and the excitation energy. 448x361mm (72 x 72 DPI)

ACS Paragon Plus Environment