A Comparison Study on the Prediction of Multiple Molecular Properties

Oct 4, 2018 - The accuracies of predictions have been compared among SLNN, DNN and CNN by analyzing their mean absolute errors (MAEs)...
2 downloads 0 Views 2MB Size
Article pubs.acs.org/JPCA

Cite This: J. Phys. Chem. A 2018, 122, 9128−9134

Comparison Study on the Prediction of Multiple Molecular Properties by Various Neural Networks Fang Hou,† Zhenyao Wu,§ Zheng Hu,† Zhourong Xiao,† Li Wang,†,‡ Xiangwen Zhang,†,‡ and Guozhu Li*,†,‡

J. Phys. Chem. A 2018.122:9128-9134. Downloaded from pubs.acs.org by KAOHSIUNG MEDICAL UNIV on 11/26/18. For personal use only.



Key Laboratory for Green Chemical Technology of Ministry of Education, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China ‡ Collaborative Innovation Center of Chemical Science and Engineering (Tianjin), Tianjin 300072, China § School of Computer Software, Tianjin University, Tianjin University, Tianjin 300072, China S Supporting Information *

ABSTRACT: Various neural networks, including a single layer neural network (SLNN), a deep neural network (DNN) with multilayers, and a convolution neural network (CNN) have been developed and investigated to predict multiple molecular properties simultaneously. The data set of this work contains∼134 kilo molecules and their 15 properties (including rotational constant A, B, and C, dipole moment, isotropic polarizability, energy of HOMO, energy of LUMO, HOMO−LUMO gap energy, electronic spatial extent, zero point vibrational energy, internal energy at 0 K, internal energy at 298.15 K, enthalpy at 298.15 K, free energy at 298.15 K, and heat capacity at 298.15 K) at the hybrid density functional theory (DFT) level from the QM9 database. Coulomb matrix (CM) converted from the database representing every molecule uniquely and its eigenvalue are respectively used as the input of machine learning. The accuracies of predictions have been compared among SLNN, DNN and CNN by analyzing their mean absolute errors (MAEs). Using eigenvalues as input, both SLNN and DNN can give higher accuracy for the prediction of specific energy properties (U0, U, H, and G). For the prediction of all 15 molecular properties at a time, DNN with a 3-layers network exhibits the best results using the full CM as input. The number of layers in DNN play a key role in the prediction of multiple molecular properties simultaneously. This work may provide possibility and guidance for the selection of different neural networks and input data forms for prediction and validation of multiple parameters according to different needs.

1. INTRODUCTION

In machine learning, neural networks have been developed into various rich models with modern general-purpose graphical processing unit (GP-GPU) computing, which can make accurate models with high dimensional functions. Neural network models have been applied in both single-objective and multiobjective tasks.7,12,14−22 Single-objective properties, including atomization energy,7−9 octane number of organic molecules,15 and the band gap of inorganic solids21 have been predicted with high accuracy by using neural networks. One neural network can also predict multiple properties. For instance, the density and viscosity of biofuel compounds are predicted simultaneously by one neural network.20 A total of 13 electronic ground-state properties of organic molecules were also predicted via machine learning compared to DFT calculations.22 These applications all show the powerful ability of neural networks in learning and prediction. Neural networks can directly mine the information from the data, then train and

Physical and chemical properties of some compounds can not be obtained directly or immediately from experiments due to experimental safety and efficiency.1−5 With the rapid development of computers, high-throughput calculation of molecular properties has progressed enormously. For several decades, the methods of quantum chemistry, molecular dynamics, Monte Carlo, and empirical correlation (e.g., group contribution method) for the prediction of molecular properties have developed.4−6 For example, QM7 database7 and QM9 database8,9 with plenty of molecular properties have been set up by DFT calculations. Although these methods are more efficient than experiments, their workloads are huge and the computation is time-consuming. Moreover, the accuracy of the method and its scope of application still need to be improved. On the basis of these databases, data analyzing and processing will be helpful for more efficient prediction of molecular properties. Machine learning (ML) in the field of artificial intelligence (AI) has proven to be a fast and effective way.10−13 © 2018 American Chemical Society

Received: September 25, 2018 Published: October 4, 2018 9128

DOI: 10.1021/acs.jpca.8b09376 J. Phys. Chem. A 2018, 122, 9128−9134

Article

The Journal of Physical Chemistry A

Figure 1. Molecular coulomb matrix tensor and eigenvalues of coulomb matrix.

We designed codes to convert the form of nuclear charges Zi and atomic coordinates of each molecules into the coulomb matrix effectively, which is used as the input to predict molecular properties. 1150 molecules specified in QM9 database were excluded in this work, because we failed to generate coulomb matrix based on their atomic coordinates XYZ formats which may be incorrect. 2.2. Data Representation. Data representation is a key step to start machine learning process. Bag of bonds,31,32 coulomb matrix,22,31−33 and extended connectivity fingerprints22,34 are typical methods used to identify various molecules. In this work, coulomb matrix containing molecular structure information has been selected as the input, which was initially confirmed to be an effective representation for predicting molecular atomization energy.7,31 The coulomb matrix was calculated using eq 1 as follows.

learn. Finally, new results can be predicted based on new inputs.7,12,15−22 DNN approaches have been found to be more suitable for complex quantum-chemical systems with successful applications than traditional neural network, which has some hidden layers and more neurons.23−25 DNN is promising to predict multiple molecular properties simultaneously in a more accurate and convenient way. While quick, rough and specific prediction can be easily achieved by using simple SLNN. The focus of this work is on the comparison of different neural networks to predict multiple molecular properties simultaneously. On the basis of the QM9 database, various neural networks will be developed for efficient machine learning and accurate prediction of molecular properties. (i) QM9 containing ∼134 kilomolecules have been chosen as the training and learning database in machine learning. Coulomb matrix and its eigenvalues have been chosen as two kinds of input of neural network. (ii) Single layer neural network has been first established in MATLAB to predict multiple molecular properties. The feasibility of predicting multiple properties at a time by using one neural network will be evaluated during this process. (iii) Suitable multilayer deep neural networks and convolution neural network have been developed using Python Programming Language26,27 and TensorFlow28−30 to gain deep insight of the database and increase the accuracy of the prediction. The as-developed various neural networks have been assessed and compared to gain deep insights of the machine learning process for the prediction of multiple properties.

l o o 0.5Zi 2.4 ∀ i = j | o o o o o o o o o o Z Z Cij = m } i j o o o ∀ i ≠ jo o o o o o o |R i − R j| o o n ~

(1)

Here Ri is Cartesian coordinates and Zi is nuclear charges, offdiagonal elements correspond to the coulomb repulsion between atoms i and j, while diagonal elements encode a polynomial (0.5Zi2.4) fit of atomic energies to nuclear charges. This method was applied to all molecules in QM9 database. The numbers of atoms are different for the different molecules in the database. We fill zeros to fit these matrices to the same dimension, and all molecules from QM9 were converted to a new data set which contains molecular information tensor (132700 × 30 × 30) and properties tensor (132700 × 15), as shown in Figure 1. The eigenvalues of coulomb matrix were also calculated. The eigenvalues of coulomb matrix were calculated using eq 2 as follows.

2. METHOD 2.1. Data Set. In this work we have used the QM98 data set consisting of 133850 organic molecules (http://quantummachine.org). These molecules are all out of the GDB-179 chemical universe of 166 billion organic molecules. Molecules in the data set consist of H, C, O, N, and F elements and contain up to 9 heavy atoms. All 15 kinds of molecular properties were calculated at the DFT level of theory (B3LYP/ 6-31G (2df, p)), including rotational constants A, B, and C (GHz), dipole moment μ (D), isotropic polarizability α (a03), energy of HOMO εHOMO (Ha), energy of LUMO εLUMO (Ha), HOMO−LUMO gap energy εgap (Ha), electronic spatial extent ⟨R2⟩ (a02), zero point vibrational energy zpe (Ha), internal energy at 0 K U0 (Ha), internal energy at 298.15 K U (Ha), enthalpy at 298.15 K H (Ha), free energy at 298.15 K G (Ha), and heat capacity at 298.15 K CV (cal/mol K). The molecular identification for QM9 is given with the form of nuclear charges Zi, and atomic coordinates XYZ formats (a widespread plain text format for encoding Cartesian coordinates of molecules8).

Cijv = λiv

(2)

Cij is the coulomb matrix of every molecule; the calculated eigenvalues can also represent every molecule.7 Then the obtained data set, including coulomb matrix and eigenvalue of coulomb matrix, are used in later machine learning. 2.3. Neural Networks. Neural networks 35−37 have complex and varied models for different calculations and identify tasks. Herein, single layer neural network (SLNN) and multilayer deep neural network (DNN) are focused. 2.3.1. Single Layer Neural Network (SLNN). Here we used MATLAB software to construct single layer neural network. First, we made a dimensional reduction by converting the coulomb matrix to eigenvalue array so that we can easily make 9129

DOI: 10.1021/acs.jpca.8b09376 J. Phys. Chem. A 2018, 122, 9128−9134

Article

The Journal of Physical Chemistry A

Figure 2. Schematic diagram for the models of (a) single layer neural network and (b) deep neural network with three layers.

Figure 3. Prediction of B and U0 by SLNN via training, validation, and test.

neural network. The training set is sent to a well-designed model which is used for testing the data. The concrete deep neural network model flow is as follows. 2.4. Data Processing, Data Normalization, Data Enhancement, and Data Segmentation. First, the obtained coulomb matrix was changed to satisfy the requirements of input dimension and format for the neural network model. Various ways of converting the coulomb matrix to the input were tested. Finally, the matrix of the coulomb matrix 30*30 was flattened into an one-dimensional vector of 900. In addition, we also removed redundant data and calculated the eigenvalues of coulomb matrix. In the normalization section, we designed multiple normalization schemes to process the data. (i) Normalization. Scaling the data makes the conversed data fall into (0,1). (ii) Logarithm normalization. The data are subjected to logarithmic operations, and the new data are approximately evenly assigned to (0, 1). (iii) Hierarchical normalization. To cut the data hierarchically using a variety of segmentation plans guarantees that various types of data evenly are distributed among (0,1). In the data enhancement section, we expanded the data set in several ways, including row and column scrambling on the coulomb matrix. Finally, 90% of the data were randomly selected as the training set, and the other 10% were as used as the experimental test set.

follow-up regression and reduce the computation complexity. To find out the nonlinear relationship between two variables, the neural network toolbox (Nntool) was used. Nntool provides a variety of parameter optimizing methods. Considering its outstanding stability and reliability among most of single layer neural network models, we chose Nntool to accomplish the construction and training of single layer neutral network. The network with 10 hidden neurons and one output layer has been focused (Figure 2a). Nntool can randomly divide the input group into three parts, including training group, validation group, and test group, with a ratio of 0.70:0.15:0.15. We started to train it with Levenberg− Marquardt model and got the best parameters which can minimize the loss function (L2-norms). In other field, because of the difficulty of train which derive from the depth of network, some works tend to choose other loss function, such as cross entropy and L1-norm. However, the toolbox in MATLAB has a mature system to find suitable hyperparameter and give a preferable result. After training, we obtained the regression function and plotted the performance in each group as mentioned earlier and found that there is not obvious underfitting or overfitting phenomenon 2.3.2. Multi-Layer Deep Neural Networks (DNN). TensorFlow26−28 is used to construct a multitype multiscale deep 9130

DOI: 10.1021/acs.jpca.8b09376 J. Phys. Chem. A 2018, 122, 9128−9134

Article

The Journal of Physical Chemistry A

Figure 4. Prediction of (a) B and (b) U0 by 3-layer DNN using the full CM as input.

Figure 5. Comparison of MAEs for the prediction of multiple molecular properties simultaneously using different neural networks.

2.5. The Model To Train and Predict. Multiform and multisize neural networks were built. TensorFlow is used to construct an inverted pyramid neural network. The first layer is the input coulomb matrix deformation value. The second layer is with 140 neurons, and the third layer outputs one neuron as the output characteristic (Figure 2b). After validation and test, we determined that the loss function is mean absolute error (MAE) and the activation function is sigmoid eq 3. f (x ) =

constructed, in which the other sets were the same as those of DNN.

3. RESULTS AND DISCUSSION 3.1. Prediction of Multiple Properties by SLNN. Figure 3 shows the fitting degree of target rotational constant (B) and internal energy at 0 K (U0) in test set using SLNN. The other 13 properties (A, C, μ, α, εHOMO, εLUMO, εgap, ⟨R2⟩, zpe, U, H, G, and CV) were also fitted as shown in Figures S1−S13. The data points distributed near the diagonal (y = x) indicate good prediction. Specifically, the energy properties, zpe, U0, U, H, G, and CV, are predicted with high accuracy by SLNN. On the basis of learning, SLNN can realize quick and preliminary prediction of multiple molecular properties and give an overview of the process. However, it is still hard to predict the other properties (A, B, C, μ, α, εHOMO, εLUMO, εgap, ⟨R2⟩) accurately by SLNN at a time, in which the MAEs of prediction are very high. For instance, the MAEs for the prediction of A and ⟨R2⟩ are as high as 41.47 GHz and 65.17 a02, respectively. Therefore, SLNN can predict energy properties quickly and accurately. However, the errors will be high when all 15 properties are predicted simultaneously by SLNN. 3.2. Comparison among Various DNNs for Predicting Multiple Properties Simultaneously. In order to achieve higher accuracy of prediction and gain deep insight into the ability of neural networks, more complex neural networks, including multilayer neural networks (DNN) and convolutional neural network (CNN), were focused to conduct machine learning on QM9 database. DNNs with the networks of 2 layers, 3 layers, and 5 layers have been built and investigated. CNN with 4 convolution layers, 4 pooling layers, and 3 × 3 convolution kernel was also developed. Constructing

1 1 + e −w

T

x

(3)

Here x is an independent variable, and f(x) is a dependent variable. The optimizer is Adam.38 This optimization method of Adam combines superiorities in two recently popular optimization methods: RMSProp39 and AdaGrad.40 Adam is suitable for dealing with the problems that are large in terms of data, and the parameters and nonstationary objectives with very noisy and sparse gradients. Therefore, Adam was employed to process the QM9 data in this work. The training verification ratio of 0.9:0.1 and random scrambled in the iterations were used. Various neural networks were tested, including convolution deep neural network (DNN) with variable layers. Depending on the type of QM9 database, the training of DNN models in our work are supervised. The training set was input into the model and trained on a GPU computer. Learning is the process of adjusting the weights so that the training data are reproduced as accurately as possible. Then the MAE between the prediction set and the real value was calculated based on the test set. Finally, the model that gives the best fitting result can be obtained via comparison. A convolution neural network (CNN) with 4 convolution layers, 4 pooling layers and 3 × 3 convolution kernel was also 9131

DOI: 10.1021/acs.jpca.8b09376 J. Phys. Chem. A 2018, 122, 9128−9134

Article

The Journal of Physical Chemistry A

NNs should perform similarly given the same training data.41,42 Therefore, computing the eigenvalue of CM for each molecular is one effective method of dimensionality reduction, which can sometimes positively influence the prediction accuracy by providing some regularization in machine learning.14,32 In comparison, when the full coulomb matrix is used as input, the values of MAEs for the other 11 molecular properties predicted by DNN are much lower. Such a drastic dimensionality reduction in using the eigenvalue may cause loss of information and introduce unfavorable noise for the prediction of multiple properties simultaneously. Herein, DNN can generate suitable high-dimensional descriptors from the input of full coulomb matrix, which leads to good models for the prediction of multiple molecular properties. The input form should match with the neural network to realize high-accuracy prediction. Overall, MAEs have been reduced by 5.95%− 99.75% using 3-layer DNN for most properties, which reveals that the DNN model performs well to predict multiple properties at a single time. 3.4. Prediction of Multiple Properties at One Time by One DNN. We also pay attention to the specific prediction of different properties. The prediction accuracy of constant A, B, C, μ, and α (physical properties of molecules) have been increased greatly by using DNN. Herein, more accurate prediction of molecular properties can be achieved by the DNN with the network of 3 layers compared with that by SLNN and CNN. Lowest unoccupied molecular orbital (LUMO) and highest unoccupied molecular orbital (HOMO) are crucial in molecular reaction. More than 80% of the errors have been reduced by DNN for the prediction of εHOMO, εLUMO, and εgap. These significantly improved accuracies implied that DNN is a powerful approach to predicting εHOMO, εLUMO, and εgap. The MAEs of U0, U, H and G are all about 0.07000 Ha by DNN using eigenvalue of coulomb matrix as input. The means of U0 U, H, and G are about −412.5 Ha (Table S2) and their relative error (MAE/mean) is 0.0001697. The advantageous model can be used to predict molecular energy parameters and assess the energy properties of a target molecular.20,43

these neural networks is to perfectly transfer digital information on QM9 database and get the accurate results. In this part, the full coulomb matrix is used as input to compare different DNNs and CNN. All the networks were tested and corresponding MAEs are compared in Figure 4. The values of MAEs for these neural networks have been summarized in Table S1. For the predictions of multiple molecular properties, the accuracy is roughly in the order of 3layer DNN > 5-layer DNN > 2-layer DNN > CNN, which is independent of particular molecular property. DNN with 3 layers of network exhibits the lowest MAEs for the prediction of all 15 properties. In our tested models, 3-layer DNN is the most suitable model for machine learning of QM9 database using the full coulomb matrix as input. The accuracy of prediction by 3-layer DNN using the full CM as input were also calculated. Figure 4 shows the fitting degree of target rotational constant (B) and internal energy at 0 K (U0) by using 3-layer DNN. The other 13 properties (A, C, μ, α, εHOMO, εLUMO, εgap, ⟨R2⟩, zpe, U, H, G, and CV) were also fitted, and the results are shown in Figures S14−S26. A neural network with a suitable structure can predict properties more accurately and reduce certain training costs. As shown in Figure 5, multilayer DNN has lower MAEs compared with CNN. All 15 properties are calculated using one DNN model, which shows that the DNN model has a strong advantage to predict multiple molecular properties simultaneously. Furthermore, the number of layers in DNN plays an important role in affecting the accuracy of prediction. Both more and less layers of DNN will increase MAEs. Therefore, it will be a promising approach that constructing one suitable DNN achieves accurate and convenient prediction of the full information at a single time for other applications. 3.3. Comparison of SLNN and DNN. Table 1 compares the MAEs for the prediction of multiple properties by SLNN Table 1. MAE Values for the Prediction of 15 Molecular Properties Based on QM9 Data (∼132K molecules) Using SLNN and DNNa properties

unit

SLNN used eigenvalue

3-layer DNN used eigenvalue

3-layer DNN

A B C μ α εHOMO εLUMO εgap ⟨R2⟩ zpe U0 U H G CV

GHz GHz GHz D a03 Ha Ha Ha a02 Ha Ha Ha Ha Ha cal/mol K

41.47 0.1484 0.082 20 0.8832 1.325 0.011 00 0.016 80 0.020 60 65.17 0.002600 0.094 80 0.074 50 0.1019 0.089 50 0.5901

0.6032 0.1684 0.1061 0.9672 1.932 0.013 90 0.023 30 0.026 60 80.92 0.005400 0.07019 0.07007 0.07038 0.07063 0.8606

0.09990 0.01612 0.009461 0.3043 0.5716 0.002582 0.001806 0.003210 2.056 0.0003177 0.3626 0.2759 0.3200 0.4794 0.1243

4. CONCLUSIONS In summary, we have conducted a comparison study on the prediction of multiple molecular properties at a single time using single neural network via machine learning. We established and compared various neural networks (SLNN, DNN, CNN) to gain new insights into the method and the process. Conclusively, neural network implements a powerful and efficient prediction of multiple molecular properties. SLNN can predict energy properties (zpe, U0, U, H, G and CV) quickly and accurately. When the eigenvalue of CM is used as input, both SLNN and DNN can accurately predict specific energy properties (U0, U, H, and G), which are even better than the DNN using the full coulomb matrix as input. It indicates the importance of input form for some specific prediction. But the errors will be high when all 15 properties are predicted simultaneously by one SLNN. The predictions of all 15 molecular properties have been improved by DNN using the full CM as input. DNN with 3 layers of network exhibits the lowest MAEs, indicating its good performance for the prediction of multiple molecular properties. MAE of rotational constant A by 3-layer DNN is 99.75% lower than that by SLNN. Therefore, both the number of layers in DNN and the input data form play key roles to achieve accurate prediction.

a

The best predicted results for each property is highlighted in bold.

and DNN (3 layers). When the eigenvalue of coulomb matrix is used as input, both SLNN and DNN perform well for the prediction of U0, U, H, and G. Both models yield very small errors, about 0.07000 Ha. It is consistent with the Universal Approximation Theorem, which states that deep and shallow 9132

DOI: 10.1021/acs.jpca.8b09376 J. Phys. Chem. A 2018, 122, 9128−9134

Article

The Journal of Physical Chemistry A

Universe Database GDB-17. J. Chem. Inf. Model. 2012, 52, 2864− 2875. (10) Snyder, J. C.; Rupp, M.; Hansen, K.; Blooston, L.; Müller, K. R.; Burke, K. Orbital-Free Bond Breaking Via Machine Learning. J. Chem. Phys. 2013, 139, 224104. (11) Kolb, B.; Lentz, L. C.; Kolpak, A. M. Discovering Charge Density Functionals and Structure-Property Relationships with PROPhet: A General Framework for Coupling Machine Learning and First-Principles Methods. Sci. Rep. 2017, 7, 1192. (12) Chmiela, S.; Tkatchenko, A.; Sauceda, H. E.; Poltavsky, I.; Schutt, K. T.; Müller, K. R. Machine Learning of Accurate EnergyConserving Molecular Force Fields. Sci. Adv. 2017, 3, e1603015. (13) Schneider, W. F.; Guo, H. Machine Learning. J. Phys. Chem. Lett. 2018, 9, 569. (14) Butler, K. T.; Davies, D. W.; Cartwright, H.; Isayev, O.; Walsh, A. Machine Learning for Molecular and Materials Science. Nature 2018, 559, 547−555. (15) Yao, K.; Parkhill, J. Kinetic Energy of Hydrocarbons as a Function of Electron Density and Convolutional Neural Networks. J. Chem. Theory Comput. 2016, 12, 1139−1147. (16) Abdul Jameel, A. G.; Van Oudenhoven, V.; Emwas, A. H.; Sarathy, S. M. Predicting Octane Number Using Nuclear Magnetic Resonance Spectroscopy and Artificial Neural Networks. Energy Fuels 2018, 32, 6309−6329. (17) Behler, J. Atom-Centered Symmetry Functions for Constructing High-Dimensional Neural Network Potentials. J. Chem. Phys. 2011, 134, 074106. (18) Gastegger, M.; Kauffmann, C.; Behler, J.; Marquetand, P. Comparing the Accuracy of High-Dimensional Neural Network Potentials and the Systematic Molecular Fragmentation Method: A Benchmark Study for All-Trans Alkanes. J. Chem. Phys. 2016, 144, 194110. (19) Smith, J. S.; Isayev, O.; Roitberg, A. E. ANI-1: An Extensible Neural Network Potential with DFT Accuracy at Force Field Computational Cost. Chem. Sci. 2017, 8, 3192−3203. (20) Saldana, D. A.; Starck, L.; Mougin, P.; Rousseau, B.; Ferrando, N.; Creton, B. Prediction of Density and Viscosity of Biofuel Compounds Using Machine Learning Methods. Energy Fuels 2012, 26, 2416−2426. (21) Zhuo, Y.; Mansouri Tehrani, A.; Brgoch, J. Predicting the Band Gaps of Inorganic Solids by Machine Learning. J. Phys. Chem. Lett. 2018, 9, 1668−1673. (22) Faber, F. A.; Hutchison, L.; Huang, B.; Gilmer, J.; Schoenholz, S. S.; Dahl, G. E.; Vinyals, O.; Kearnes, S.; Riley, P. F.; von Lilienfeld, O. A. Prediction Errors of Molecular Machine Learning Models Lower than Hybrid DFT Error. J. Chem. Theory Comput. 2017, 13, 5255−5264. (23) Indiveri, G.; Liu, S. C. Memory and Information Processing in Neuromorphic Systems. Proc. IEEE 2015, 103, 1379−1397. (24) Liu, W.; Wang, Z.; Liu, X.; Zeng, N.; Liu, Y.; Alsaadi, F. E. A Survey of Deep Neural Network Architectures and Their Applications. Neurocomputing 2017, 234, 11−26. (25) Schütt, K. T.; Arbabzadah, F.; Chmiela, S.; Müller, K. R.; Tkatchenko, A. Quantum-Chemical Insights from Deep Tensor Neural Networks. Nat. Commun. 2017, 8, 13890. (26) Oliphant, T. E. Python for Scientific Computing. Comput. Sci. Eng. 2007, 9, 10−20. (27) Bird, S.; Klein, E.; Loper, E. Natural Language Processing with Python; 2009; Vol. 14, pp 581−592. (28) Abadi, M. TensorFlow: Learning Functions at Scale. ACM SIGPLAN Notices 2016, 51, 1. (29) Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Matthieu, D.; Sanjay, G.; Geoffrey, I.; Michael, I.; et al. TensorFlow: A System for Large-Scale Machine Learning. Proc. 12th USENIX Conf. Operating Syst. Des. Implement. 2016, 265−283. (30) Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G. S.; Davis, A.; Dean, J.; Devin, M., et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. https://arxiv.org/pdf/1603.04467.pdf (2016).

The selection of suitable neural network model and input data form is important when applying machine learning to various applications, e.g., catalyst design, materials development, chemical synthesis, and process optimization.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jpca.8b09376.



The means of 15 properties for QM9 database, the prediction and fitting degree of the other 13 properties (A, C, μ, α, εHOMO, εLUMO, εgap, ⟨R2⟩, zpe, U, H, G, and CV) via training, validation and test using SLNN, the prediction and fitting degree of the other 13 properties (A, C, μ, α, εHOMO, εLUMO, εgap, ⟨R2⟩, zpe, U, H, G, and CV) by 3-layer DNN using the full CM as input, and the values of MAEs for the prediction of 15 properties using various deep neural networks (PDF)

AUTHOR INFORMATION

Corresponding Author

*(G.L.) Telephone/fax: +86 22 27892340. E-mail: gzli@tju. edu.cn. ORCID

Guozhu Li: 0000-0003-1329-0548 Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS The authors gratefully acknowledge financial support from the National Key Research and Development Program of China (2016YFB0600305) and the National Natural Science Foundation of China (21306132).



REFERENCES

(1) Kier, L. B.; Hall, L H. Molecular Connectivity in Structure-Activity Analysis; Research Studies Press: Letchworth, U.K., 1986. (2) Richard, L. Calculation of the Standard Molal Thermodynamic Properties as a Function of Temperature and Pressure of Some Geochemically Important Organic Sulfur Compounds. Geochim. Cosmochim. Acta 2001, 65, 3827−3877. (3) van Speybroeck, V.; Gani, R.; Meier, R. J. The Calculation of Thermodynamic Properties of Molecules. Chem. Soc. Rev. 2010, 39, 1764−1779. (4) van Speybroeck, V.; Gani, R.; Meier, R. J. The Calculation of Thermodynamic Properties of Molecules. Chem. Soc. Rev. 2010, 39, 1764. (5) Nietodraghi, C.; Fayet, G.; Creton, B.; Rozanska, X.; Rotureau, P.; de Hemptinne, J.-C.; Ungerer, P.; Rousseau, B.; Adamo, C. A General Guidebook for the Theoretical Prediction of Physicochemical Properties of Chemicals for Regulatory Purposes. Chem. Rev. 2015, 115, 13093−13164. (6) Bicerano, J. Prediction of Polymer Properties. Russ. J. Gen. Chem. 2002, 81, 268−276. (7) Rupp, M.; Tkatchenko, A.; Müller, K. R.; von Lilienfeld, O. A. Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning. Phys. Rev. Lett. 2012, 109, No. 058301, DOI: 10.1103/PhysRevLett.109.059802. (8) Ramakrishnan, R.; Dral, P. O.; Rupp, M.; von Lilienfeld, O. A. Quantum Chemistry Structures and Properties of 134 Kilo Molecules. Sci. Data 2014, 1, 140022. (9) Ruddigkeit, L.; Van Deursen, R.; Blum, L. C.; Reymond, J. L. Enumeration of 166 Billion Organic Small Molecules in the Chemical 9133

DOI: 10.1021/acs.jpca.8b09376 J. Phys. Chem. A 2018, 122, 9128−9134

Article

The Journal of Physical Chemistry A (31) Hansen, K.; Biegler, F.; Ramakrishnan, R.; Pronobis, W.; von Lilienfeld, O. A.; Müller, K. R.; Tkatchenko, A. Machine Learning Predictions of Molecular Properties: Accurate Many-Body Potentials and Nonlocality in Chemical Space. J. Phys. Chem. Lett. 2015, 6, 2326−2331. (32) Hansen, K.; Montavon, G.; Biegler, F.; Fazli, S.; Rupp, M.; Scheffler, M.; von Lilienfeld, O. A.; Tkatchenko, A.; Müller, K. R. Assessment and Validation of Machine Learning Methods for Predicting Molecular Atomization Energies. J. Chem. Theory Comput. 2013, 9, 3404−3419. (33) Huang, B.; von Lilienfeld, O. A. Communication: Understanding Molecular Representations in Machine Learning: The Role of Uniqueness and Target Similarity. J. Chem. Phys. 2016, 145, 161102. (34) Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742−754. (35) Ziegel, E. R. Neural Networks in Computer Intelligence. Technometrics 1995, 37, 470. (36) Bishop, J. M.; Bushnell, M. J.; Westland, S. Application of Neural Networks to Computer Recipe Prediction. Color Res. Appl. 1991, 16, 3−9. (37) Roth, H. R.; Lu, L.; Liu, J.; Yao, J.; Seff, A.; Cherry, K.; Kim, L.; Summers, R. M. Improving Computer-aided Detection using Convolutional Neural Networks and Random View Aggregation. IEEE Trans. Med. Imag. 2016, 35, 1170−1181. (38) Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, https://arxiv.org/pdf/1412.6980.pdf (39) Graves, A. Generating Sequences With Recurrent Neural Networks. arXiv 2013, https://arxiv.org/pdf/1308.0850.pdf. (40) Duchi, J.; Hazan, E.; Singer, Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Machine Learn. Res. 2011, 12, 257−269. (41) Gao, B.; Xu, Y. Univariant Approximation by Superpositions of a Sigmoidal Function. J. Math. Anal. Appl. 1993, 178, 221−226. (42) Winkler, D. A.; Le, T. C. Performance of Deep and Shallow Neural Networks, the Universal Approximation Theorem, Activity Cliffs, and QSAR. Mol. Inf. 2017, 36, 1600118. (43) Xin, J. F.; He, F. F.; Ding, Y. H. Bottom-up Design of Highenergy-density Molecules (N2CO)n (n = 2−8). RSC Adv. 2017, 7, 8533−8541.

9134

DOI: 10.1021/acs.jpca.8b09376 J. Phys. Chem. A 2018, 122, 9128−9134