PubChemQC Project: A Large-Scale First ... - ACS Publications

May 8, 2017 - Maho Nakata*,† and Tomomi Shimazaki. ‡. †. Advanced Center for Computing and Communication, RIKEN, 2-1 Hirosawa, Wako, Saitama ...
1 downloads 0 Views 1MB Size
Article pubs.acs.org/jcim

PubChemQC Project: A Large-Scale First-Principles Electronic Structure Database for Data-Driven Chemistry Maho Nakata*,† and Tomomi Shimazaki‡ †

Advanced Center for Computing and Communication, RIKEN, 2-1 Hirosawa, Wako, Saitama 351-0198 Japan Advanced Institute for Computational Science, RIKEN, 7-1-26 Minatojima-minami-machi, Chuo-ku, Kobe, Hyogo 650-0047 Japan



ABSTRACT: Large-scale molecular databases play an essential role in the investigation of various subjects such as the development of organic materials, in silico drug design, and data-driven studies with machine learning. We have developed a large-scale quantum chemistry database based on firstprinciples methods. Our database currently contains the ground-state electronic structures of 3 million molecules based on density functional theory (DFT) at the B3LYP/631G* level, and we successively calculated 10 low-lying excited states of over 2 million molecules via time-dependent DFT with the B3LYP functional and the 6-31+G* basis set. To select the molecules calculated in our project, we referred to the PubChem Project, which was used as the source of the molecular structures in short strings using the InChI and SMILES representations. Accordingly, we have named our quantum chemistry database project “PubChemQC” (http://pubchemqc.riken. jp/) and placed it in the public domain. In this paper, we show the fundamental features of the PubChemQC database and discuss the techniques used to construct the data set for large-scale quantum chemistry calculations. We also present a machine learning approach to predict the electronic structure of molecules as an example to demonstrate the suitability of the large-scale quantum chemistry database.

1. INTRODUCTION The design, discovery, and fabrication of new chemical compounds are essential to solve various issues such as environmental pollution, global warming, and CO2 to O2 conversion. Building basic knowledge on these chemical compounds has become an important task to tackle these issues. Database systems are constructed to store and reuse the collected data. To gather molecular properties and construct chemical databases, experimental measurements are often employed. However, the experimental approaches are sometimes very expensive. In some cases, it is difficult or impossible to measure the molecular properties of some compounds. Therefore, even if these large-scale molecular databases are highly necessary, it is not easy to experimentally construct them. On the contrary, first-principles quantum chemistry calculations have become very accurate and economical. Back in 1929, Dirac stated that “the fundamental laws necessary for the mathematical treatment of a large part of physics and the whole of chemistry are thus completely known...”.1 Subsequently, the quantum chemistry approach has enabled us to obtain and predict various molecular properties with (sometimes) better results than experimental approaches.2,3 In addition, there are many good implementations for quantum chemistry calculations, such as Gaussian,4 QChem,5 MolPro,6 NWChem,7 Turbomole,8 and GAMESS.9 These suggest that first-principles quantum chemistry methods can be applicable to the construction of large-scale molecular databases. Thus, we started constructing a database by storing the results of © 2017 American Chemical Society

quantum chemistry calculations. To develop the database, we only used the IUPAC International Chemical Identifier (InChI)10,11 and Simplified Molecular Input Line Entry Specification (SMILES)12,13 representations of molecules. We obtained the InChI and SMILES representations from the PubChem project,14 and therefore, we have named our project “PubChemQC”.15 Large-scale databases have many useful applications, such as in virtual screening or as expert systems to find molecules with desired properties or perform fast molecular property estimations. On the contrary, large data sets play essential roles for the machine- and deep-learning methods.16−18 In chemistry, data-driven studies and developments based on the machine learning approach have become imperative.19,20 In such data-driven chemical studies, the size of the data set is a critical factor. At present, the PubChemQC project provides ca. 3 million molecular structures optimized by density functional theory (DFT). We have also calculated the excited states for over 2 million molecules using time-dependent DFT (TDDFT). The number of molecules stored in the PubChemQC database is currently increasing. Thus, the PubChemQC database can potentially become one of the essential tools to deal with various chemical problems, especially in the data-driven chemistry area. A previous paper described only a brief introduction to our project.15 We did not Received: February 16, 2017 Published: May 8, 2017 1300

DOI: 10.1021/acs.jcim.7b00083 J. Chem. Inf. Model. 2017, 57, 1300−1308

Article

Journal of Chemical Information and Modeling

It should be noted that “C(O)C” also represents ethanol, but this representation is not canonical. L-Ascorbic acid in PubChem (CID 54670067) is represented as follows:

present details of the calculation methodologies used, and analysis of the database and the machine learning calculations were not discussed. Conversely, this paper discusses the fundamental characteristics of the PubChemQC database. We also describe in detail the techniques used to construct the large-scale database using the first-principles quantum chemistry approach, which will be helpful in preparing the data sets used for data-driven chemical studies. In section 2, we briefly describe two molecular encoding systems, namely, the InChI and SMILES representations, and then we describe the PubChem database along with the molecules selected and calculated in our project. We also discuss the techniques used to calculate the electronic structures of these molecules using the first-principles quantum chemistry approach. In section 3, we do an analysis to demonstrate the characteristics of our database and present a machine learning approach used to predict molecular electronic structures. In sections 4 and 5, we discuss the results and offer concluding remarks, respectively.

C(C(C1C(=C(C(=O)O1)O)O)O)O

It has an isomeric representation as follows: C([C@@H]([C@@H]1C(=C(C(=O)O1)O)O)O)O

Upon standardization by Open Babel, we have the following description: OC[C@@H]([C@H]1OC(=O)C(=C1O)O)O

It should be noted that there is no consensus on the representation of organometallics. A clear example is provided by ferrocene molecules (e.g., CID 7611 and CID 10219726). These molecules can be represented by the following two canonical SMILES representations, although it is difficult to distinguish them by the current algorithm: [CH‐]1C=CC=C1.[CH‐]1C=CC=C1.[Fe+2]

2. THE PUBCHEMQC PROJECT A. The PubChem Project as a Source of Molecular InChI and SMILES Representations. First, we briefly describe the molecular encoding systems. Since our project needs to handle a huge number of molecules, the use of machine-readable notation is essential to develop a large-scale database. The InChI representation is a nonproprietary identifier of chemical substances for electronic data sources.10,11 The InChI representation can be standardized (i.e., canonicalized) while being unique (with some exceptions) and human-readable (with effort). Here we demonstrate some examples of molecules represented by InChI. Each compound in the PubChem Compound database has a unique nonzero integer number.14 Hereafter, we denoted this number as the CID. Thus, ethanol, whose CID is 702, is represented as

[CH]1[CH][CH][CH][CH]1.[CH]1[CH][CH][CH] [CH]1.[Fe]

We note that these problems are found in the InChI representation as well. Quantum chemistry programs require the three-dimensional (3D) coordinates of a molecule in the input file. Therefore, we generated the 3D coordinates of each molecule from the InChI or SMILES representation. In addition, the reverse transformation (i.e., from the 3D coordinate to the InChI or SMILES representation) was used to analyze and validate the first-principles calculations. Thus, our project required a oneon-one mapping between the machine-readable representation and the 3D coordinates for each molecule. With this purpose, the InChI and SMILES representations can be used properly. The two representations provide extremely similar behaviors for this mapping, although in our experience the SMILES representation provides more stable results in the reverse transformation. To build the first-principles electronic structure database, we employed the PubChem project as the source of the InChI and SMILES molecular representations.14 The PubChem database is the largest public database for InChI and SMILES representations, and it is maintained by the National Institutes of Health (NIH) through the NIH Molecular Libraries Roadmap initiative. There are three subprojects in the PubChem database, namely, PubChem Substance, PubChem BioAssay, and PubChem Compound. PubChem Substance collects all of the submitted data from the science and engineering communities, whereas PubChem Compound contains pure, standardized, and nonduplicated compounds obtained by organizing the data of the PubChem Substance project. PubChem Compound contains approximately 92 000 000 molecules,23 and the number is updated daily. Importantly, the PubChem project is in the public domain (i.e., the PubChem-generated information and the participantprovided information are available to the public without cost and without restriction24). The license terms are critical for the advancement of our project. Thus, we have employed the PubChem Compound database as the source of molecules in our project. Other databases may be used to create large-scale quantum chemistry databases. For example, the Chemical

InChI = 1S/C2H6O/c1‐2‐3/h3H,2H2,1H3

whereas L-ascorbic acid (CID 54670067) is described as InChI = 1S/C6H8O6/c7‐1‐2(8)5‐3(9)4(10)6(11)12‐5 /h2,5,7‐8,10‐11H,1H2/t2‐,5+/m0/s1

The InChI representation has three sublayers: main, charge, and stereochemical layers. The main layer contains the chemical formula, atom connections, and hydrogen atoms. The charge layer is used to describe the charges and protons of the molecule. The stereochemical layer contains information on double or multiple bonds and tetrahedral stereochemistry, among other aspects. From the quantum chemistry viewpoint, the main layer is the most important sublayer since the charge and stereochemical layers may contain ambiguities for quantum chemistry calculations. The SMILES representation, which is used for almost the same purpose as InChI, was developed by Weininger12 and Daylight Chemical Information Systems.13 It is easier to read by humans and more popular compared with the InChI representation. However, there are many variants, extensions, and canonicalization methods. O’Boyle proposed a canonicalization of SMILES and implemented it in Open Babel.21,22 For example, ethanol is represented in the canonical SMILES representation as follows:22

CCO 1301

DOI: 10.1021/acs.jcim.7b00083 J. Chem. Inf. Model. 2017, 57, 1300−1308

Article

Journal of Chemical Information and Modeling Abstracts Service database contains more than 124 million organic and inorganic substances collected from the published scientific literature.25 However, it is a proprietary database, and therefore, its secondary use is difficult. In the next section, we explain the workflow followed to create our database. B. Workflow To Create Calculation Data for the PubChemQC Database. The workflow followed to construct the PubChemQC database is shown in Figure 1. We acquired

Figure 2. Molecular weight distribution of the PubChem Compounds.

database was created by humans and not by computer algorithms. Databases constructed by humans can provide such a linear-scale distribution when molecules are sorted by the molecular weight. Conversely, the GDB-17 database, which was created by a combinatorial algorithm, has an astronomical distribution behavior.28 Some compounds were not suitable for our database, such as mixed substances like ionic salts (e.g., Ni2+SO42−, CID 24586), HCl salts (e.g., CID 56825941), water mixtures (e.g., CID 21932805), and other mixtures (e.g., CID 67855675). Unfortunately, molecules containing η5 bonds (e.g., ferrocene, CIDs 7611, 504306, and 11985121) were not calculated because of issues with the InChI representation. We removed the molecules whose SMILES expressions contained period “.”. In addition, we performed calculations only on molecules containing H, He, Li, Be, B, C, N, O, F, Ne, Na, Mg, Al, Si, P, S, Cl, Ar, K, Ca, Sc, Ti, V, Cr, Mn, Fe, Co, Ni, Cu, and Zn. This was a result of the limitations of the 6-31G* basis set. Isotopes were also ignored (e.g., CID 167583 and CID 783 gave the same results). On the contrary, isomers are separately registered in the PubChem database. For example, CID 5426 for thalidomide, CID 75792 for (R)-thalidomide, and CID 92142 for (S)-thalidomide appear in the database; CID 5426 is the racemic modification of CID 75792 and CID 92142, and we had to randomly specify one of the enantiomers when calculating CID 5426. We assumed that all of the molecules were neutral (i.e., we did not distinguish between CID 28179 and CID 5360525), and the number of electrons was estimated from the nuclear charges. From the InChI representation, we generated the initial (guessed) 3D molecular structures that were used for quantum chemistry calculations by using Open Babel with the “-gen3d -addH” options as follows:21,22,29 (i) the initial 3D structure was generated using rules and ring templates; (ii) 250 steps of steepest-descent geometry optimization with the MMFF94 force field were carried out; (iii) 200 iterations of weighted rotor conformational search were carried out (optimizing each conformer with 25 steps of steepest-descent); and (iv) 250 steps of conjugate-gradient geometry optimization were carried out. Thus, the initial 3D molecular geometry obtained was a fairly good starting point for the subsequent quantum chemistry calculations. The first geometry optimization based on the quantum chemistry method was performed using the PM3 method.30,31 The resultant geometry was further optimized by the Hartree−Fock method using the STO-6G basis set. Next, we optimized the geometry using the B3LYP

Figure 1. Workflow to create calculation data stored in the PubChemQC database.

the molecular information as structure data files (SDFs) from the FTP site of the PubChem project26 in mid-July 2014. Approximately 3000 SDFs were obtained, each one containing approximately 25 000 molecules. It is worth noting that some CIDs were not available because of deprecated molecular data in the PubChem project. Each SDF from the PubChem Compound project contained the 3D structures without hydrogen atoms generated by the PubChem3D project,27 the IUPAC names, the InChI and SMILES representations,12,13 and the molecular weights, among other parameters. However, our project used only the CID and the InChI and isomeric SMILES representations. The molecular weight was used only to sort molecules. No other information was employed to create our database. We created a file containing the CIDs, the InChI representations, and the molecular weights from the SDFs, and then we sorted the molecules in the file by molecular weight in ascending order. In our project, the first-principles molecular calculations were executed in this sorted order. Thus, our calculations started with a hydrogen atom and finished with a very heavy molecule. Approximately, we can calculate lighter molecules much faster and more easily than heavier molecules. The molecular weight distribution in the PubChem database is shown in Figure 2, where the horizontal axis represents the molecular weight and the vertical axis represents the accumulated number of molecules. Interestingly, the sorted data did not show a simple log-scale distribution. In the lowmolecular-weight region, the accumulated number of molecules exponentially increased. In contrast, linear behavior was obtained in the ca. 200−600 Da region, and saturation behavior was observed at higher molecular weights. The molecules stored in the PubChem database were gathered from the scientific and engineering communities. In other words, the 1302

DOI: 10.1021/acs.jcim.7b00083 J. Chem. Inf. Model. 2017, 57, 1300−1308

Article

Journal of Chemical Information and Modeling (VWN3) functional32,33 with the 6-31G* basis set. The B3LYP optimization process actually comprised three steps. In the first step, we roughly optimized the molecular geometry using Firefly34 or SMASH.35 Next, more accurate geometry optimizations were executed by GAMESS. Those two-step optimization processes were used to reduce the calculation time. The Firefly and SMASH calculations are substantially faster but slightly less accurate. The first step leads quickly to increasing accuracy with smaller iterations in the second step. Finally, we executed the optimization calculation again to validate that the molecule was really optimized in the process by checking whether the same structures were obtained in the input and output. As this was used for validation only, it had no effect on the final molecular geometry. The B3LYP geometry optimization processes usually provide fairly good geometries (typically bond angles within a few degrees and bond lengths within 0.02 Å).36−39 Subsequent excited-state calculations were executed by the TDDFT method with the B3LYP functional and the 6-31+G* basis set, employing the geometry obtained from the B3LYP optimization step. The input files and the final results were uploaded weekly at http://pubchemqc.riken.jp/. All of the calculations were executed on the RICC supercomputer (Intel Xeon 5570 2.93 GHz, 1024 nodes) and the QUEST supercomputer (Intel Core2 L7400 1.50 GHz, 700 nodes) at the RIKEN Advanced Center for Computing and Communication. We also employed the HOKUSAI supercomputer (Fujitsu PRIMEHPC FX100) at RIKEN, whose peak performance was approximately 1PFlops. In addition, the Oakleaf-FX supercomputer (Fujitsu PRIMEHPC FX10, SPARC64 IX 1.848 GHz) at the University of Tokyo was used. With the above computational resources, calculations on several thousand molecules per day were possible.

molecular geometries and orbital energies as array-type values into the servers, although those are not considered in the analysis of this paper. We will report an analysis of the arraytype properties elsewhere. First, we present a histogram of the dipole moments of 2 819 910 molecules validated by the InChI-based technique (Figure 3). Here the horizontal axis refers to the dipole

Figure 3. Histogram of molecular dipole moments stored in the PubChemQC database.

moment and the vertical axis to the number of molecules. The molecular dipole moment is an important indicator to predict solubility in water or organic solvents. In the PubChemQC database, we confirmed that ca. 10 000 types of molecules were not polarized. Conversely, several molecules showed dipole moments of 1−3 D. The number of molecules with higher dipole moment exponentially decreased, and few molecules in the PubChemQC database showed dipole moments higher than 10.0. Next, we examined the energy difference (gap) between the highest occupied molecular orbital (HOMO) and the lowest unoccupied molecular orbital (LUMO) and the excitation energy calculation results. The HOMO−LUMO gap is a fundamental parameter for the design of organic electronic devices such as electroluminescent displays, image sensors, and photocells. The excitation energy is especially important for photon absorption phenomena.41,42 Figure 4 summarizes the HOMO−LUMO gap histogram for 2 248 895 molecules (blue series). The red histogram refers to the excitation energy. We

3. ANALYSIS OF THE PUBCHEMQC DATABASE AND MACHINE LEARNING RESULTS In this section, we aggregate the calculation results to analyze the features of the PubChemQC database. As shown in Figure 1, we needed several steps to create the calculation data from the InChI representations. Therefore, the data may contain some errors and failures, even in cases where the first-principles calculations were successfully finished. In some cases, these errors were produced by bugs in our process, whereas in other cases, they might have originated from errors in the PubChem database. To get rid of such inappropriate calculations, we generated the InChI representation of each molecule from its optimized 3D molecular structure and compared it with the original InChI representation stored in the PubChem database. Only those molecules providing the same InChI representation for the main layer were subsequently employed to analyze the PubChemQC database. This InChI-based validation technique was similar to that used in the literature.40 In this work, we employed several relational database servers to organize the calculated data; the PostgresSQL program package was adopted for the relational database management system. We parsed the output files generated from GAMESS and registered extracted values such as molecular weights, HOMO−LUMO gaps, and excitation energies into the servers. Here the CID was used as the primary key in the relations. The molecular weight, HOMO−LUMO gap, or excitation energy can be treated as a (float8-type) scalar value in the ordinary relations. We can obtain and search molecular properties by throwing SQL-based queries on the relational database servers. PostgresSQL can also handle array-type data. We inserted

Figure 4. Histograms of the HOMO−LUMO gaps (blue) and the excitation energies (red) stored in the PubChemQC database. 1303

DOI: 10.1021/acs.jcim.7b00083 J. Chem. Inf. Model. 2017, 57, 1300−1308

Article

Journal of Chemical Information and Modeling easily confirmed that the HOMO−LUMO gaps were more widely distributed than the excitation energies and that the center of gravity for the HOMO−LUMO gap was higher than that of the excitation energy. The HOMO−LUMO gap was obtained as the energy difference between free quasi-particles (i.e., electron and hole).43,44 Conversely, the excitation energy corresponds to the optical gap, which is typically observed by photon absorption experiments. Thus, the optical gap becomes lower than the HOMO−LUMO gap because of the strong binding energy between holes and electrons. In other words, we can evaluate the binding energy by subtracting the excitation energy from the HOMO−LUMO gap.43 Figure 5 shows the

of user-specified length, folding down to obtain a particular density of set bits. The “Topological” fingerprint is therefore created from information on the bond connections in the molecule. The machine learning approaches used in this study were executed using the scikit-learn library.46 Here the fingerprints (feature vectors) can be generated only from the SMILES representation. Therefore, we can predict molecular electronic structures exclusively from the SMILES representation. The HOMO−LUMO gap was employed as the target property. The support vector machine (SVM) and ridge regression algorithms were employed using the kernel technique for the machine-learning-based prediction.47 The Gaussian radial basis function (RBF) was used for the kernel.47 We also examined the use of second- and third-order polynomials for the kernel technique.47 We selected a data set containing 1 million molecules from the PubChemQC database. Those molecules were randomly chosen, although the HOMO−LUMO gaps were uniformly distributed between 4.5 and 6.5 eV in the data set. We trained the machines (predictors) using 20 000 molecules that were randomly chosen from the data set. Then we calculated the HOMO− LUMO gaps for the rest of the molecules using the trained machines (predictors). The root-mean-square errors (RMSEs) between the predictions and the exact HOMO−LUMO gaps are summarized in Table 1. When the RBF kernel was

Figure 5. Relation between the HOMO−LUMO gap and the excitation energy.

Table 1. HOMO−LUMO Gap Predictions Based on the Machine Learning Approach

relation between the HOMO−LUMO gap and the binding energy. The exciton binding energy is an important parameter for the development of materials used in organic photocell devices.43,44 The binding energies of different organic materials can easily be estimated by a database search. Here, the horizontal and vertical axes indicate the HOMO−LUMO gap and the excitation energy, respectively. The small black points represent the calculated results, and a huge number of data points form the thick band observed in Figure 5. The difference between the HOMO−LUMO gap and the excitation energy increases with the HOMO−LUMO gap. Molecules with large HOMO−LUMO gaps tend to present strong electron−hole binding interactions. We can instantly obtain molecular information such as the dipole moment and excitation energy by searching the PubChemQC database provided that the first-principles calculation for a target molecule is already finished and stored. However, if a target molecule does not exist in the database, we cannot obtain any information from the query. To overcome this problem, we considered a machine learning approach to predict molecular electronic structures from already calculated data without time- and resource-consuming quantum chemistry calculations. To train machines (predictors), we employed “Topological” fingerprints with 1024 bits as feature molecular vectors using the RDKit library.45 A molecular fingerprint is a series of binary bits representing the characteristics of a molecule in a form that is suitable for processing by computers. Similarities and differences among molecules can be calculated by comparing these fingerprints. A number of algorithms have been proposed for generating molecular fingerprints. The Topological fingerprint is generated in a similar way to the Daylight algorithm,13 in which the substructures of the molecule are identified and hashed. These substructures are by default seven bonds long and are represented in a bit stream

method

kernel

RMSE [eV]

SVM regression

RBF second-order polynomial third-order polynomial RBF second-order polynomial third-order polynomial fourth-order polynomial

0.36 0.39 0.43 0.37 0.38 0.36 0.48

ridge regression

employed, the SVM and ridge regression methods provided RMSE values of 0.36 and 0.37 eV, respectively. In the SVM method, some polynomial kernels provided slightly worse predictions compared with the RBF kernel. For example, the RMSEs for the second- and third-order polynomials were 0.39 and 0.43 eV, respectively. Conversely, when ridge regression was employed, the second-order polynomial kernel yielded an RMSE of 0.36 eV, which was similar to the result obtained using the RBF kernel. In the data set, the molecules were uniformly distributed, so to estimate the RMSE the average value (5.5 eV) was used throughout. The use of the average value gave an RMSE of 0.58 eV. From these calculations, we can confirm that the machine learning approach can roughly predict the DFT results, even if only the SMILES representation is provided. The SMILES representation contains the bonding information between the atoms in the molecule, and these atoms may give some hints for the estimations. The predictive ability will be enhanced by improving the algorithms and feature vectors. We are currently investigating prediction of the HOMO, LUMO, and excitation energies. More advanced studies based on the machine learning approach are in progress, and the results will be reported in the near future. 1304

DOI: 10.1021/acs.jcim.7b00083 J. Chem. Inf. Model. 2017, 57, 1300−1308

Article

Journal of Chemical Information and Modeling

4. DISCUSSION Here, we discuss other database projects using computer simulation techniques. The NIST Chemistry WebBook contains the IR spectra of over 16 000 molecules and the UV spectra and quantum-chemical calculations for 1600 molecules.48 PubChem3D27 lists optimized geometries of molecules containing H, C, N, O, F, Si, P, S, Cl, Br, and I with less than 50 hydrogens obtained using the MMFF94s empirical method.49 In our project, the molecular structures were optimized using DFT, which usually provides more reliable results compared with empirical classical calculations. The Harvard Clean Energy Project was intended to store molecules using the combinatorial approach and quantum-chemical calculations.50,51 This project provided 2.3 million candidates for organic electronic materials. The largest database using the combinatorial approach is the GDB-17 database.28 It enumerates all possible molecules containing C, N, O, S, Cl, and H with up to 17 atoms by the SMILES notation. Actually, there are 166 billion molecules in this database! Ramakrishanan et al.40 performed calculations on 134 000 molecules carefully chosen from the GDB-17 database, and the molecular electronic structures of these molecules were calculated using first-principles methods. On the contrary, at nearly the same time, we launched the PubChemQC project (http://pubchemqc.riken.jp/).15 At that time, our project already provided data on ca. 1.5 million molecules. At present, our database provides molecular excited states as well. However, the most important difference between the two projects lies in the source of the molecules. Thus, the molecules stored in the GDB-17 project are artificially created by the combinational algorithm. In the GDB-17 database, a huge number of molecules are included, although many of them might be less important from the chemical viewpoint. For example, in Ramakrishanan’s work, molecules containing up to nine heavy atoms excluding S, Br, Cl, or I were selected.40 Such a loose filter removed almost all of the molecules from the GDB-17 database, and only 134 000 molecules remained. Conversely, for the PubChem project, we preferred to choose the molecules that should be calculated. Thus, the molecules stored in the PubChem database were actually molecules synthesized by the science and engineering communities and chemical vendors, among others.52 In addition, PubChem obtained molecules from other important databases such as ChEMBL53 (imported 1 686 695 live substances54), ZINC55 (imported 25 758 525 live substances56), ChemSpider57 (imported 14 642 781 live substances58), KEGG59 (39 051 live substances), and Aurora Fine Chemicals LLC (34 304 433 live substances60). Thus, the PubChem database contains a very wide variety of molecules. Therefore, we believe that our project includes essential molecules for chemistry and material science. We started this project in 2013 and presented 13 000 molecules on 1/15/2014, and this number increased to 25 000 molecules on 7/2/2014. At that time, we calculated and presented only ground-state optimized geometries and uploaded them at http://pubchemqc.sourceforge.net/. However, that site was shut down for exceeding the bandwidth on 2/24/2014. We moved our site to http://pubchemqc.riken.jp/ on 3/30/2014. We started the excited-state calculations using TDDFT on 4/24/2014. On 5/20/2014, we sorted the SDFs by molecular weight and restarted the whole set of calculations with the previously calculated 116 869 molecules. On 7/29/ 2014, we presented 155 792 molecules containing ground-state

geometry optimizations and 55 456 molecules with excited states. On 11/11/2014, we presented over 1 million molecules (1 001 704 ground-state and 1 001 133 excited-state calculations). On 5/27/2015, we presented over 2 million molecules (2 016 173 ground-state and 2 016 173 excited-state calculations). Thus, the number of molecules in the PubChemQC database has been steadily increasing, although we have faced some problems. These results prove that the techniques discussed in this paper are effective in constructing the electronic structure database of several million organic molecules. However, we noticed that our present approach may not be able to apply to several tens of millions of molecules because of time-consuming first-principles calculations and huge data set handling, among other aspects. To overcome these obstacles, we intend to adopt large-scale parallel supercomputers such as the K-computer and parallel database systems. Such studies will be discussed in subsequent papers.

5. CONCLUDING REMARKS We have been developing a large-scale molecular electronic structure database based on first-principles quantum chemistry methods. Unlike other projects, our project did not employ machine- and algorithm-generated molecular chemical structures. Instead, in this project we targeted molecules listed in the PubChem Compound database. Therefore, we have named our database PubChemQC, and it provides a variety and diversity of molecular frameworks. The 3D molecular structures obtained by the first-principles methods were prepared only from the InChI and SMILES representations. The B3LYP functional was used with the 6-31G* and 6-31+G* basis sets for the ground-state optimizations and the excited-state TDDFT calculations, respectively. Thus, we did not use any experimental data to construct the database. We have analyzed the molecular dipole moments and the relation between the HOMO−LUMO gaps and the excitation energies. This paper has also demonstrated that a machine learning approach is useful to roughly predict the electronic structures of molecules, even if only the SMILES representation is provided. Currently, ca. 3 million optimized molecular structures and ground-state wave functions are stored in the PubChemQC database. We also provide the excited-state electronic structures of over 2 million molecules. The number of molecules stored in the database is increasing at this moment, and novel data are updated weekly at our site (http://pubchemqc.riken.jp/). These large-scale molecular databases will be useful for various chemical studies such as search and data mining for materials and drugs, training machines with large data sets, and datadriven chemical studies. Actually, there are strong demands for large-scale databases in those applications. This project employed the B3LYP method because it can produce reasonable calculation results for a wide range of molecules using limited computer resources. It also allows a balance to be struck between calculation time and accuracy.36,38,39 B3LYP can be applied to a wide range of problems.38 This allows it to be used to identify features of the calculation results stored in our database, even by researchers who are not specialists in quantum chemistry. However, some molecules demand more accurate methods. For example, it is well-known that the B3LYP performs poorly when describing Rydberg and charge-transfer states, and a longrange term is often required to correct the results.61 More accurate and sophisticated methods have been proposed in 1305

DOI: 10.1021/acs.jcim.7b00083 J. Chem. Inf. Model. 2017, 57, 1300−1308

Article

Journal of Chemical Information and Modeling Notes

quantum chemistry, such as Møller−Plesset perturbation theory and the coupled cluster methods. Other databases based on more accurate quantum chemistry methods may be required in the future, although those are beyond the scope of this study. We did not attempt to provide globally optimized conformations of all molecules registered in the database because of enormous computational costs. Instead, the project provides reasonable (local) molecular geometries. The goal was to roughly capture the electronic structures of as many molecules as possible; therefore, we did not consider all aspects of the molecules. More exact treatment of the changes in molecular properties based on molecular conformations may be supplied by other database projects. In practice, it is not possible for a single database to satisfy all requests across the full range of scientific and engineering fields. More specific databases may be required to address more detailed problems. However, the calculation data stored in the PubChemQC project can provide a substructure for the construction of such problem-specific databases. The large-scale database construction techniques discussed in this paper can also support the development of such databases. The PubChemQC database is one of the largest existing databases for first-principles quantum chemistry calculations. However, the number of molecules listed in our database may be insufficient. The data size is one of the critical factors determining the usefulness of a database. We are therefore considering employing massively parallel supercomputing to accelerate the implementation of our project. The focus of this paper is on neutral molecules. In the near future, we will include charged molecules, which are especially important in biological applications. In addition, we have been trying to add other molecular properties such as the vibrational structure, NMR chemical shifts, optimized excited-state structures, and solvent effects. Although this project employs relational database servers to organize and analyze the calculated data, the Web site does not offer a query service. Instead, it provides mainly input and output files for the calculated results. The provision of more useful tools will require substantial upgrading, including the use of high-performance servers and parallel database systems, with the associated server maintenance burden, the development of web applications, improved site design, enhanced security, and the development of a licensing policy. Despite these difficulties, the Web site must be upgraded to allow our calculated data to be utilized more effectively. We are therefore developing a more useful and user-friendly Web site. The format (structure) of the database is crucial to this. However, it is nontrivial to identify the most suitable format for our project. The format must have a number of features, including flexibility, extensibility, and usability. We are currently researching appropriate database structures, including NoSQL-type formats, working alongside experts in information science. These trials will be reported elsewhere.



The authors declare no competing financial interest.



ACKNOWLEDGMENTS The calculations were performed using the RIKEN Integrated Cluster of Clusters (RICC) and the HOKUSAI facility, and the research was partially supported by the Initiative on Promotion of Supercomputing for Young or Women Researchers, Supercomputing Division, Information Technology Center, The University of Tokyo. This work was partially supported by the JSPS KAKENHI (Grant 15K05403).



REFERENCES

(1) Dirac, P. A. M. Quantum Mechanics Of Many-Electron Systems. Proc. R. Soc. London, Ser. A 1929, 123, 714−733. (2) Pople, J. A. Energy, Structure, and Reactivity. In Proceedings of the 1972 Boulder Summer Research Conference on Theoretical Chemistry; Smith, D., McRae, W., Eds.; John Wiley & Sons: New York, 1973; pp 51−61. (3) Helgaker, T.; Jorgensen, P.; Olsen, J. Molecular Electronic Structure Theory; John Wiley & Sons: New York, 2000. (4) Frisch, M. J.; Trucks, G. W.; Schlegel, H. B.; Scuseria, G. E.; Robb, M. A.; Cheeseman, J. R.; Scalmani, G.; Barone, V.; Mennucci, B.; Petersson, G. A.; Nakatsuji, H.; Caricato, M.; Li, X.; Hratchian, H. P.; Izmaylov, A. F.; Bloino, J.; Zheng, G.; Sonnenberg, J. L.; Hada, M.; Ehara, M.; Toyota, K.; Fukuda, R.; Hasegawa, J.; Ishida, M.; Nakajima, T.; Honda, Y.; Kitao, O.; Nakai, H.; Vreven, T.; Montgomery, J. A., Jr.; Peralta, J. E.; Ogliaro, F.; Bearpark, M.; Heyd, J. J.; Brothers, E.; Kudin, K. N.; Staroverov, V. N.; Kobayashi, R.; Normand, J.; Raghavachari, K.; Rendell, A.; Burant, J. C.; Iyengar, S. S.; Tomasi, J.; Cossi, M.; Rega, N.; Millam, J. M.; Klene, M.; Knox, J. E.; Cross, J. B.; Bakken, V.; Adamo, C.; Jaramillo, J.; Gomperts, R.; Stratmann, R. E.; Yazyev, O.; Austin, A. J.; Cammi, R.; Pomelli, C.; Ochterski, J. W.; Martin, R. L.; Morokuma, K.; Zakrzewski, V. G.; Voth, G. A.; Salvador, P.; Dannenberg, J. J.; Dapprich, S.; Daniels, A. D.; Farkas, Ö .; Foresman, J. B.; Ortiz, J. V.; Cioslowski, J.; Fox, D. J. Gaussian 09; Gaussian, Inc.: Wallingford, CT, 2009. (5) Shao, Y. H.; Gan, Z. T.; Epifanovsky, E.; Gilbert, A. T. B.; Wormit, M.; Kussmann, J.; Lange, A. W.; Behn, A.; Deng, J.; Feng, X. T.; Ghosh, D.; Goldey, M.; Horn, P. R.; Jacobson, L. D.; Kaliman, I.; Khaliullin, R. Z.; Kus, T.; Landau, A.; Liu, J.; Proynov, E. I.; Rhee, Y. M.; Richard, R. M.; Rohrdanz, M. A.; Steele, R. P.; Sundstrom, E. J.; Woodcock, H. L.; Zimmerman, P. M.; Zuev, D.; Albrecht, B.; Alguire, E.; Austin, B.; Beran, G. J. O.; Bernard, Y. A.; Berquist, E.; Brandhorst, K.; Bravaya, K. B.; Brown, S. T.; Casanova, D.; Chang, C. M.; Chen, Y. Q.; Chien, S. H.; Closser, K. D.; Crittenden, D. L.; Diedenhofen, M.; DiStasio, R. A.; Do, H.; Dutoi, A. D.; Edgar, R. G.; Fatehi, S.; FustiMolnar, L.; Ghysels, A.; Golubeva-Zadorozhnaya, A.; Gomes, J.; Hanson-Heine, M. W. D.; Harbach, P. H. P.; Hauser, A. W.; Hohenstein, E. G.; Holden, Z. C.; Jagau, T. C.; Ji, H. J.; Kaduk, B.; Khistyaev, K.; Kim, J.; Kim, J.; King, R. A.; Klunzinger, P.; Kosenkov, D.; Kowalczyk, T.; Krauter, C. M.; Lao, K. U.; Laurent, A. D.; Lawler, K. V.; Levchenko, S. V.; Lin, C. Y.; Liu, F.; Livshits, E.; Lochan, R. C.; Luenser, A.; Manohar, P.; Manzer, S. F.; Mao, S. P.; Mardirossian, N.; Marenich, A. V.; Maurer, S. A.; Mayhall, N. J.; Neuscamman, E.; Oana, C. M.; Olivares-Amaya, R.; O’Neill, D. P.; Parkhill, J. A.; Perrine, T. M.; Peverati, R.; Prociuk, A.; Rehn, D. R.; Rosta, E.; Russ, N. J.; Sharada, S. M.; Sharma, S.; Small, D. W.; Sodt, A.; Stein, T.; Stuck, D.; Su, Y. C.; Thom, A. J. W.; Tsuchimochi, T.; Vanovschi, V.; Vogt, L.; Vydrov, O.; Wang, T.; Watson, M. A.; Wenzel, J.; White, A.; Williams, C. F.; Yang, J.; Yeganeh, S.; Yost, S. R.; You, Z. Q.; Zhang, I. Y.; Zhang, X.; Zhao, Y.; Brooks, B. R.; Chan, G. K. L.; Chipman, D. M.; Cramer, C. J.; Goddard, W. A.; Gordon, M. S.; Hehre, W. J.; Klamt, A.; Schaefer, H. F.; Schmidt, M. W.; Sherrill, C. D.; Truhlar, D. G.; Warshel, A.; Xu, X.; Aspuru-Guzik, A.; Baer, R.; Bell, A. T.; Besley, N. A.; Chai, J. D.; Dreuw, A.; Dunietz, B. D.; Furlani, T. R.; Gwaltney, S. R.; Hsu, C. P.; Jung, Y. S.; Kong, J.; Lambrecht, D. S.; Liang, W. Z.; Ochsenfeld, C.; Rassolov, V. A.; Slipchenko, L. V.; Subotnik, J. E.; Van

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Maho Nakata: 0000-0002-5430-841X Tomomi Shimazaki: 0000-0001-8707-6056 Author Contributions

M.N. and T.S. contributed equally. 1306

DOI: 10.1021/acs.jcim.7b00083 J. Chem. Inf. Model. 2017, 57, 1300−1308

Article

Journal of Chemical Information and Modeling Voorhis, T.; Herbert, J. M.; Krylov, A. I.; Gill, P. M. W.; Head-Gordon, M. Advances In Molecular Quantum Chemistry Contained In The QChem 4 Program Package. Mol. Phys. 2015, 113, 184. (6) Werner, H. J.; Knowles, P. J.; Knizia, G.; Manby, F. R.; Schutz, M. MolPro: A General-Purpose Quantum Chemistry Program Package. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2012, 2, 242−253. (7) Valiev, M.; Bylaska, E. J.; Govind, N.; Kowalski, K.; Straatsma, T. P.; Van Dam, H. J. J.; Wang, D.; Nieplocha, J.; Apra, E.; Windus, T. L.; de Jong, W. NWChem: A Comprehensive And Scalable Open-Source Solution For Large Scale Molecular Simulations. Comput. Phys. Commun. 2010, 181, 1477−1489. (8) Furche, F.; Ahlrichs, R.; Hattig, C.; Klopper, W.; Sierka, M.; Weigend, F. Turbomole. Wiley Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2014, 4, 91−100. (9) Schmidt, M. W.; Baldridge, K. K.; Boatz, J. A.; Elbert, S. T.; Gordon, M. S.; Jensen, J. H.; Koseki, S.; Matsunaga, N.; Nguyen, K. A.; Su, S. J.; Windus, T. L.; Dupuis, M.; Montgomery, J. A. General Atomic and Molecular Electronic-Structure System. J. Comput. Chem. 1993, 14, 1347−1363. (10) The IUPAC International Chemical Identifier (InChI). http:// www.iupac.org/home/publications/e-resources/inchi.html (accessed April 5, 2017). (11) The InChI Trust. InChI and InChIKeys for chemical structures. http://www.inchi-trust.org/ (accessed April 5, 2017). (12) Weininger, D. SMILES: A Chemical Language and InformationSystem 0.1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Model. 1988, 28, 31−36. (13) Daylight Chemical Information Systems. http://daylight.com (accessed April 5, 2017). (14) Kim, S.; Thiessen, P. A.; Bolton, E. E.; Chen, J.; Fu, G.; Gindulyte, A.; Han, L. Y.; He, J. E.; He, S. Q.; Shoemaker, B. A.; Wang, J. Y.; Yu, B.; Zhang, J.; Bryant, S. H. PubChem Substance and Compound databases. Nucleic Acids Res. 2016, 44, D1202−D1213. (15) Nakata, M. A Large Chemical Database From The First Principle Calculations. AIP Conf. Proc. 2015, 1702, 090058. (16) Le, Q. V.; Ranzato, M.; Monga, R.; Devin, M.; Chen, K.; Corrado, G. S.; Dean, J.; Ng, A. Y. Building High-Level Features Using Large Scale Unsupervised Learning. In Proceedings of the 29th International Conference on Machine Learning; Langford, J., Pineau, J., Eds.; Omnipress: Madison, WI, 2012; pp 81−88. (17) Krizhevsky, A.; Sutskever, I.; Hinton, G. E. Imagenet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Processing Syst. 2012, 25, 1097−1105. (18) Seide, F.; Li, G.; Yu, D. Conversational Speech Transcription Using Context-Dependent Deep Neural Networks. Interspeech 2011, 437. (19) Ramakrishnan, R.; Dral, P. O.; Rupp, M.; von Lilienfeld, O. A. Big Data Meets Quantum Chemistry Approximations: The DeltaMachine Learning Approach. J. Chem. Theory Comput. 2015, 11, 2087−2096. (20) Ramakrishnan, R.; Hartmann, M.; Tapavicza, E.; von Lilienfeld, O. A. Electronic Spectra from TDDFT and Machine Learning in Chemical Space. J. Chem. Phys. 2015, 143, 084111. (21) O’Boyle, N. M.; Banck, M.; James, C. A.; Morley, C.; Vandermeersch, T.; Hutchison, G. R. Open Babel: An Open Chemical Toolbox. J. Cheminf. 2011, 3, 33. (22) O’Boyle, N. M. Towards a Universal SMILES Representation: A Standard Method To Generate Canonical SMILES Based on the InChI. J. Cheminf. 2012, 4, 22. (23) PubChem Project. PubChem Compound. https://www.ncbi. nlm.nih.gov/pccompound (accessed April 5, 2017). (24) PubChem Project. NLM PUBCHEM PROJECT DATA SUBMISSION POLICY (DSP). https://pubchem.ncbi.nlm.nih.gov/ upload/html/dsp.html (accessed April 5, 2017). (25) Chemical Abstracts Service. CAS REGISTRY and CAS Registry Number FAQs. http://www.cas.org/content/chemical-substances/ faqs (accessed April 5, 2017).

(26) PubChem Project. PubChem FTP site. ftp://ftp.ncbi.nih.gov/ pubchem/Compound/CURRENT-Full/SDF/ (accessed April 5, 2017). (27) Bolton, E. E.; Kim, S.; Bryant, S. H. PubChem3d: Conformer Generation. J. Cheminf. 2011, 3, 4. (28) Ruddigkeit, L.; van Deursen, R.; Blum, L. C.; Reymond, J. L. Enumeration Of 166 Billion Organic Small Molecules In The Chemical Universe Database GDB-17. J. Chem. Inf. Model. 2012, 52, 2864−2875. (29) Open Babel online documentation: Generate a single conformer. https://open-babel.readthedocs.io/en/latest/ 3DStructureGen/SingleConformer.html (accessed April 12, 2017). (30) Stewart, J. J. P. Optimization Of Parameters For Semiempirical Methods. I. Method. J. Comput. Chem. 1989, 10, 209−220. (31) Stewart, J. J. P. Optimization Of Parameters For Semiempirical Methods. III. Extension Of PM3 To Be, Mg, Zn, Ga, Ge, as, Se, Cd, in, Sn, Sb, Te, Hg, Tl, Pb, and Bi. J. Comput. Chem. 1991, 12, 320−341. (32) Becke, A. D. Density-Functional Thermochemistry. III. The Role Of Exact Exchange. J. Chem. Phys. 1993, 98, 5648−5652. (33) Vosko, S. H.; Wilk, L.; Nusair, M. Accurate Spin-Dependent Electron Liquid Correlation Energies For Local Spin-Density Calculations - A Critical Analysis. Can. J. Phys. 1980, 58, 1200−1211. (34) Granovsky, A. A. Firefly. http://classic.chem.msu.su/gran/ firefly/index.html (accessed April 5, 2017). (35) Ishimura, K. Scalable molecular analysis solver for highperformance computing systems (SMASH). http://smash-qc. sourceforge.net/ (accessed April 5, 2017). (36) Baker, J. Molecular Structure and Vibrational Spectra. In Handbook of Computational Chemistry; Leszczynski, J., KaczmarekKedziera, A., Puzyn, T. G., Papadopoulos, M., Reis, H., Shukla, M. K., Eds.; Springer: New York, 2012; pp 923−359. (37) Johnson, B. G.; Gill, P. M. W.; Pople, J. A. The Performance Of A Family Of Density Functional Methods. J. Chem. Phys. 1993, 98, 5612−5626. (38) Sousa, S. F.; Fernandes, P. A.; Ramos, M. J. General Performance Of Density Functionals. J. Phys. Chem. A 2007, 111, 10439−10452. (39) Riley, K. E.; Op’t Holt, B. T.; Merz, K. M. Critical Assessment Of The Performance Of Density Functional Methods For Several Atomic And Molecular Properties. J. Chem. Theory Comput. 2007, 3, 407−433. (40) Ramakrishnan, R.; Dral, P. O.; Rupp, M.; von Lilienfeld, O. A. Quantum Chemistry Structures and Properties of 134 Kilo Molecules. Sci. Data 2014, 1, 140022. (41) Shimazaki, T.; Nakajima, T. Theoretical Study Of Exciton Dissociation Through Hot States At Donor-Acceptor Interface In Organic Photocell. Phys. Chem. Chem. Phys. 2015, 17, 12538−12544. (42) Shimazaki, T.; Nakajima, T. Theoretical Study On The Cooperative Exciton Dissociation Process Based On Dimensional And Hot Charge-Transfer State Effects In An Organic Photocell. J. Chem. Phys. 2016, 144, 234906. (43) Vanossi, D.; Cigarini, L.; Giaccherini, A.; da Como, E.; Fontanesi, C. An Integrated Experimental/Theoretical Study Of Structurally Related Poly-Thiophenes Used In Photovoltaic Systems. Molecules 2016, 21, 110. (44) Shimazaki, T.; Nakajima, T. Application Of The DielectricDependent Screened Exchange Potential Approach To Organic Photocell Materials. Phys. Chem. Chem. Phys. 2016, 18, 27554−27563. (45) Landrum, G. RDKit: Open-Source Cheminformatics. http:// www.rdkit.org/ (accessed April 5, 2017). (46) Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E. Scikit-Learn: Machine Learning In Python. J. Mach. Learn. Res. 2011, 12, 2825−2830. (47) Bishop, C. M. Pattern Recognition and Machine Learning; Springer: New York, 2006. (48) NIST Chemistry WebBook; NIST Standard Reference Database Number 69; National Institute of Standards and Technology: 1307

DOI: 10.1021/acs.jcim.7b00083 J. Chem. Inf. Model. 2017, 57, 1300−1308

Article

Journal of Chemical Information and Modeling Gaithersburg, MD; http://webbook.nist.gov/chemistry/ (accessed April 7, 2017). (49) Halgren, T. A. MMFF VI. MMFF94s Option for Energy Minimization Studies. J. Comput. Chem. 1999, 20, 720. (50) Hachmann, J.; Olivares-Amaya, R.; Atahan-Evrenk, S.; AmadorBedolla, C.; Sanchez-Carrera, R. S.; Gold-Parker, A.; Vogt, L.; Brockway, A. M.; Aspuru-Guzik, A. The Harvard Clean Energy Project: Large-Scale Computational Screening And Design Of Organic Photovoltaics On The World Community Grid. J. Phys. Chem. Lett. 2011, 2, 2241−2251. (51) Hachmann, J.; Olivares-Amaya, R.; Jinich, A.; Appleton, A. L.; Blood-Forsythe, M. A.; Seress, L. R.; Roman-Salgado, C.; Trepte, K.; Atahan-Evrenk, S.; Er, S.; Shrestha, S.; Mondal, R.; Sokolov, A.; Bao, Z. A.; Aspuru-Guzik, A. Lead Candidates For High-Performance Organic Photovoltaics From High-Throughput Quantum Chemistry - The Harvard Clean Energy Project. Energy Environ. Sci. 2014, 7, 698−704. (52) PubChem Project. Data Sources. https://pubchem.ncbi.nlm.nih. gov/source/ (accessed April 5, 2017). (53) Bento, A. P.; Gaulton, A.; Hersey, A.; Bellis, L. J.; Chambers, J.; Davies, M.; Kruger, F. A.; Light, Y.; Mak, L.; McGlinchey, S.; Nowotka, M.; Papadatos, G.; Santos, R.; Overington, J. P. The ChEMBL Bioactivity Database: An Update. Nucleic Acids Res. 2014, 42, D1083−D1090. (54) PubChem Project. Data Sources: ChEMBL. https://pubchem. ncbi.nlm.nih.gov/source/ChEMBL (accessed April 5, 2017). (55) Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.; Coleman, R. G. ZINC: A Free Tool To Discover Chemistry For Biology. J. Chem. Inf. Model. 2012, 52, 1757−1768. (56) PubChem Project. Data Sources: ZINC. https://pubchem.ncbi. nlm.nih.gov/source/ZINC (accessed April 5, 2017). (57) Pence, H. E.; Williams, A. ChemSpider: An Online Chemical Information Resource. J. Chem. Educ. 2010, 87, 1123−1124. (58) PubChem Project. Data Sources: ChemSpider. https:// pubchem.ncbi.nlm.nih.gov/source/ChemSpider (accessed April 5, 2017). (59) Kanehisa, M.; Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000, 28, 27−30. (60) PubChem Project. Data Sources: Aurora Fine Chemicals LLC. https://pubchem.ncbi.nlm.nih.gov/source/11831 (accessed April 5, 2017). (61) Iikura, H.; Tsuneda, T.; Yanai, T.; Hirao, K. A Long-Range Correction Scheme For Generalized-Gradient-Approximation Exchange Functionals. J. Chem. Phys. 2001, 115, 3540−3544.

1308

DOI: 10.1021/acs.jcim.7b00083 J. Chem. Inf. Model. 2017, 57, 1300−1308