Machine Learning Methods to Predict Density ... - ACS Publications

Dec 29, 2016 - Empirical Classification of Trajectory Data: An Opportunity for the Use of Machine Learning in Molecular Dynamics. Barry K. Carpenter ,...
5 downloads 12 Views 1MB Size
Article pubs.acs.org/jcim

Machine Learning Methods to Predict Density Functional Theory B3LYP Energies of HOMO and LUMO Orbitals Florbela Pereira,† Kaixia Xiao,‡ Diogo A. R. S. Latino,† Chengcheng Wu,‡ Qingyou Zhang,*,‡ and Joao Aires-de-Sousa*,† †

LAQV and REQUIMTE, Departamento de Química, Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, 2829-516 Caparica, Portugal ‡ Henan Engineering Research Center of Industrial Circulating Water Treatment, College of Chemistry and Chemical Engineering, Henan University, Kaifeng, 475004, PR China S Supporting Information *

ABSTRACT: Machine learning algorithms were explored for the fast estimation of HOMO and LUMO orbital energies calculated by DFT B3LYP, on the basis of molecular descriptors exclusively based on connectivity. The whole project involved the retrieval and generation of molecular structures, quantum chemical calculations for a database with >111 000 structures, development of new molecular descriptors, and training/validation of machine learning models. Several machine learning algorithms were screened, and an applicability domain was defined based on Euclidean distances to the training set. Random forest models predicted an external test set of 9989 compounds achieving mean absolute error (MAE) up to 0.15 and 0.16 eV for the HOMO and LUMO orbitals, respectively. The impact of the quantum chemical calculation protocol was assessed with a subset of compounds. Inclusion of the orbital energy calculated by PM7 as an additional descriptor significantly improved the quality of estimations (reducing the MAE in >30%).



INTRODUCTION The energies of the highest occupied and lowest unoccupied molecular orbitals (HOMO and LUMO) calculated by quantum chemistry methods are currently of high importance for the discovery of new materials, namely for estimating optoelectronic properties and filtering databases of candidate organic molecules.1 The demand for ultrathin, lightweight, and flexible electronic devices lead to the exploration of organic materials with possible unique combination of electronic, chemical, and mechanical properties. For example, organic materials have vital application in organic light-emitting diodes (OLEDs),2 organic photovoltaic devices (OPVs),3 and organic thin-film transistors (OTFTs).4 In organic light-emitting diodes (OLED), a current of electrons flows through the device as electrons are injected into the LUMO of the layer at the cathode and withdrawn from the HOMO at the anode; radiation emission occurs with the electron relaxation from the LUMO to the HOMO, and the frequency of the radiation depends on the HOMO−LUMO gap.5 In organic solar cells, light absorption is interpreted in terms of electron excitations from the HOMO to the LUMO orbitals, and charge transport is achieved by electron transfers between the frontier orbitals of donors and acceptors.6 The efficiency of OPVs depends on the HOMO−LUMO gap of the polymer donor, and optimization of the energy difference ΔE between the LUMO of the donor and acceptor polymers is required. Some minimal ΔE is © 2016 American Chemical Society

required to separate the energies of the excited state of the donor and the acceptor, and was suggested to be 0.3 eV.7,8 The effect of electric field on the HOMO, LUMO, and HOMO− LUMO gap were suggested as determinant parameters for the suitability of organic materials as a conducting channel in OTFTs.9 Analogous to the well-established protocols for virtual screening in drug discovery projects, usually based on simulations of biomolecular docking, chemical similarities, pharmacophore searching, or QSAR models, various approaches have recently emerged for the virtual screening of new materials based on high-throughput quantum chemistry calculations. The Harvard Clean Energy Project has screened 2 million organic compounds using DFT calculations including the energies of frontier orbitals for the discovery of highefficiency organic photovoltaic materials (OPVs).10 Ramprasad and co-workers generated and screened a virtual database of polymers with DFT calculations to efficiently identify advanced polymer dielectrics for capacitive energy storage applications11 and trained kernel ridge regressions for on-demand prediction of the bandgap (Egap) and dielectric constants.12 Using genetic algorithms and semiempirical calculations, O’Boyle et al. searched a space of synthetically accessible conjugated organic Received: June 10, 2016 Published: December 29, 2016 11

DOI: 10.1021/acs.jcim.6b00340 J. Chem. Inf. Model. 2017, 57, 11−21

Article

Journal of Chemical Information and Modeling

Natural Bond Orbital atomic charges by feed-forward neural networks trained with >35 000 atomic charges.30 More recently we explored the estimation of atom condensed Fukui functions with random forests and Bradley−Terry machine learning algorithms.31 Here we report the building of a database with the HOMO and LUMO energies calculated by B3LYP/6-31G* for 111 725 organic molecular structures, the training of machine learning methods with these data, the assessment of the models, and the impact of the theory level. After the publication of our preliminary results with a small data set,32 more extensive studies have appeared concerning the estimation of the energy of frontier orbitals. von Lilienfeld and co-workers33 described the training of four layer neural networks (with 2000, 800, 800, and 1000 nodes at each layer) for the prediction of 14 properties, including frontier orbitals energies, restricted to small structures with up to seven atoms consisting of C, N, O, S, or Cl, saturated with hydrogens. The training set was built with 5000 molecules represented by “Coulomb matrices” (encoding 3D interatomic distances and nuclear charges) and their HOMO and LUMO eigenvalues calculated by hybrid density functional theory. The models were validated with an external test set of 2211 moleculesMAE of 0.15−0.16 eV (HOMO) and 0.11−0.13 eV (LUMO) were obtained. In 2015, the Aspuru-Guzik lab reported the training of neural networks with a data set of 200 000 molecules sampled from the chemical space of the Harvard Clean Energy Project Database (CEPDB) that contains calculated data for 2.3m OPV candidate compounds. The CEPDB virtual screening library was generated from a set of 26 experimentally motivated building blocks using two combination rules,10 and their HOMO and LUMO energies range from −7.51 to −3.63 eV and −6.09 to −1.15 eV, respectively. Molecular fingerprints were used as descriptors and mean absolute errors of 0.028 eV (HOMO) and 0.032 eV (LUMO) were reported for a test set of 50 000 molecules from the same chemical space.1 We now report the application of a more diverse database, new fast molecular descriptors, and the assessment of training machine learning methods with DFT calculations obtained at a less expensive level of theory, namely with single-point B3LYP/ 6-31G* calculations after geometry optimization with semiempirical methods. The whole project involved the retrieval and generation of molecular structures, quantum chemical calculations, curation of the resulting database with >111 000 structures and properties, development of new molecular descriptors, development of machine learning models, and assessment of results.

copolymers of six or eight monomer units (from a list of 132) for the identification of polymers with optimized HOMO and LUMO energies for photovoltaic materials.7 Related initiatives, but in the field of inorganic materials (aiming at new batteries, conducting oxides, or thermoelectric materials), include the MIT and LBNL Materials Project (www.materialsproject. org),13 the Open Quantum Materials Database from the Wolverton lab,14 and the AFLOWLIB.org consortium.15 Beyond materials, the calculation of HOMO and LUMO energies finds relevant application in the assessment of chemical reactivity and the derivation of molecular descriptors for QSAR and QSPR models. Several aspects of chemical reactivity are governed by the energy of frontier orbitals. Following the Koopman’s theorem, qualitative concepts such as electronegativity, chemical potential and hardness have been provided with rigorous definitions by DFT calculations, based on the HOMO and LUMO energies.16,17 Hardness was defined as η = (εLUMO − εHOMO), chemical potential as μ = −(εHOMO + εLUMO)/2, and the Parr electrophilicity index as ω = μ2/(2η). The electrophilicity index and derivatives could be successfully correlated with experimental chemical reactivity, spectroscopic data, toxicological end points, and biological activities.18 High correlations (R > 0.94) were observed between the Mayr electrophilicity parameter and the electrophilicity index within series of compounds such as benzene diazonium ions19 and benzhydryl cations.20 The application of descriptors derived from the energy of frontier orbitals has a long history in QSAR and QSPR involving small training sets.21 More recent examples are models for the prediction of bioconcentration factors,22 rate constants of oxidations of organic contaminants,23 thiol reactivity and toxicity of water disinfection byproducts,24,25 and aromatase inhibition.26 Alternatives to the computationally demanding DFT calculations can be envisaged to obtain energies of frontier orbitals and other valuable properties for whole molecules, bonds, or atoms. Machine learning from data precalculated by DFT or ab initio methods can provide ultrafast estimations. This requires computationally inexpensive molecular descriptors, adequate machine learning algorithms, and well-designed data sets. The exact role of the resulting models in virtual screening programs largely depends on their accuracies, and is still to be explored. Machine learning models are expected to provide early stage filters that can select sets of promising molecules for further screening by other, computationally more demanding methods. Studies have been reported in which machine learning algorithms were trained with thousands of data points to predict ab initio- or DFT-calculated properties of organic molecules. Lilienfeld et al. proposed nonlinear regression methods to estimate atomization energies of compounds computed with hybrid density-functional theory, from nuclear charges and atomic positions.27 Rai and Bakken reported fast and accurate models (random forests) to generate ab initio quality electrostatic potential ESP atomic charges, in which atomic descriptors were calculated from empirical 3D models of molecules, and separate models were derived for the elements H, C, N, O, F, S, and Cl.28 Our lab trained machine learning algorithms such as neural networks, support-vector machines and random forests with >12 000 bond energies calculated by DFT;29 extremely fast predictions were enabled by bond descriptors exclusively based on the connectivity table. A similar QSPR approach was successful in the prediction of DFT



METHODS Data Sets/Selection of Training and Test Sets. The database of molecules was designed to sample the chemical space of neutral organic molecules related to applications for which the energies of frontier orbitals are expected to be useful. Examples and structural motifs were retrieved from organic electronics studies,7,34 and collections of dyes, metabolites, and electrophiles/nucleophiles.35,36 The database was populated by retrieval of similar examples from the ZINC database,37 the PubChem database,38 and by computationally combining motifs and lists of substituents with the ChemAxon Reactor software (JChem 15.4.6, 2015, ChemAxon, http://www. chemaxon.com). Computationally generated structures correspond to 65% of the database −30% of these were based on motifs from metabolites and dyes, while the other 70% were 12

DOI: 10.1021/acs.jcim.6b00340 J. Chem. Inf. Model. 2017, 57, 11−21

Article

Journal of Chemical Information and Modeling

spheres. The molecular descriptors finally consist in the counts of pairs of atoms within specific intervals of modified distances. For example, if the resolution of the code is 0.01, and a pair of atoms is at a distance of 2.86, the 286th descriptor is increased by 1. Md descriptors were calculated for 1010 intervals, using a resolution of 0.017, interatomic distances up to 4 bonds, and a distance factor of 4. Machine Learning Methods. Model Regression Trees. (M5P) were grown with the Quinlan M5 algorithm53 implemented in Weka 3.7.12 using the default parameters and unpruned trees. A tree is sequentially constructed by partitioning compounds from a parent node into two child nodes. Each node is produced by a logical rule, defined for a single descriptor, where compounds below a certain descriptor’s value fall into one of the two child nodes, and compounds above fall into the other child node. M5 trees have multivariate linear models at their leavesthey are analogous to piecewise linear functions. Random Forests (RF)..54,55 A Random Forest, RF, is an ensemble of unpruned regression or classification trees which are created using bootstrap samples of the training set. In this process for each individual tree the best split at each node is defined using a randomly selected subset of descriptors. Each individual tree is created using a different training and validation set. The final prediction for an object from a random forest is obtained as an average of the predictions of the individual regression trees in the forest. The predictions obtained for the objects left out of the training are compared to the target values, and deviations are averaged in the out-ofbag (OOB) error estimation. In the experiments presented here, RFs were used for the development of regression models to estimate HOMO, LUMO, and GAP energies. RFs were grown with the R program,56 version 3.0.2 using the RandomForest library.57 The number of trees in the forest was set to 200 and the other parameters were used with default values. Support Vector Machines (SVMs).58 SVM map multidimensional data into a hyperspace (a boundary or hyperplane) through a nonlinear transformation (kernel function) and then apply a linear regression in this space. The boundary is positioned using examples in the training set which are known as the support vectors. In this study, SVM models were explored with the Weka59 (version 3.7.12) implementation of the LIBSVM software.60 The epsilon-SVM-regression type was chosen, the kernel function was the radial basis function with the default gamma parameter, and the parameter C was optimized in the range of 10−10 000 through cross-validation with the training set. Multilayer Perceptron (MLP). An MLP is a feed-forward neural network (NN) and was used in this work as specifically implemented in Weka (version 3.7.12). The perceptron computes a single output from multiple real-valued inputsit forms a linear combination with the input values and then predicts the output through a nonlinear decision surface. The MLP can be optimized with the back-propagation algorithm. In this work, the Weka MLPerceptron options were set as default, except the number of hidden units, learning rate and momentum parameters that were optimized in cross-validation experiments with the training set.

based on motifs from organic electronics studies. The structures were standardized with ChemAxon Standardizer (JChem 15.4.6, 2015, ChemAxon, http://www.chemaxon.com) and OpenBabel (Open Babel Package, version 2.3.1 http:// openbabel.org) for neutralization and inclusion of all hydrogen atoms. Duplicated molecules were discarded, based on canonical SMILES and InChI codes. The final database consists of 111 725 molecules, which were randomly divided into a training set of 88 537 molecules, a test set of 9989 molecules, and a final prediction set of 13 199 molecules. The molecular structures include atomic elements C, H, B, N, O, F, Si, P, S, Cl, Se, and Br. The data concerning molecular structures and orbital energies were deposited in a public repository.39 Geometry Optimization and DFT Calculations. The calculation of the HOMO and LUMO parameters by DFT methods was performed in a semiautomatic way. Starting from SMILES strings or SDF files, the workflow consisted in the generation of the most stable conformer with JChem CXCALC (JChem 15.4.6, 2015, ChemAxon Ltd., Budapest, Hungary), optimization of the 3D structure with MOPAC40 using the PM6 or PM7 semiempirical method,41 calculation of the harmonic vibrational frequencies to confirm that the optimized geometry is a minimum on the potential energy surface (all real frequencies) at the same theory level, and single point energy calculations using the hybrid B3LYP method42,43 with the 631G* basis set44,45 using the GAMESS package.46,47 The HOMO (energy of the highest occupied molecular orbital, εHOMO) and LUMO (energy of the lowest unoccupied molecular orbital, εLUMO) parameters were extracted directly from the GAMESS output. Calculation of Molecular Descriptors. PaDEL and CDK Descriptors/Fingerprints. Fingerprints and molecular descriptors were calculated by PaDEL-Descriptor version 2.11.48 Different types of fingerprints with different sizes were calculated and explored: 79 Estate (E-State fragments), 166 MACCS (MACCS keys), 307 Substructure (presence and count of SMARTS patterns for Laggner functional group classificationSub and SubC respectively), 881 PubChem fingerprints,49 and a total of 151 1D and 2D molecular descriptors (including electronic, topological, and constitutional descriptors). Morgan circular fingerprints50 were calculated with size 1024 by CDK Descriptor Calculator version 1.4.6.51 Modified Distance Descriptors. Descriptors were designed exclusively based on the molecular connectivity, in order to avoid the generation of a 3D conformer, and making no use of bond orders and atomic formal charges in order to avoid the application of an aromaticity definition and the standardization of mesomers. Particularly with large aromatic and conjugated systems, such standardization procedures are prone to create inconsistencies in bond orders and valence errors. Modified distances (Md) descriptors were implemented that count the pairs of atoms in a molecule at specific “modified distances”. Modified distances were defined in terms of the van der Waals radius of the atoms and Sanderson electronegativity of neighbors.52 Values of these properties for atomic elements were used. First, for every atom, the van der Waals radius is summed to the sum of electronegativity of neighbors divided by a distance factor. Then, for each atom of the first sphere, the value for the kernel atom is added to the radius of the new atom and to the electronegativity of the new atom’s neighbors divided by the distance factor +1. These are the modified distances to atoms of the first sphere. And the procedure is repeated for successive spheres until the defined number of



RESULTS AND DISCUSSION Prediction of HOMO and LUMO Energies with ML Algorithms. Random Forests were used to build predictive 13

DOI: 10.1021/acs.jcim.6b00340 J. Chem. Inf. Model. 2017, 57, 11−21

Article

Journal of Chemical Information and Modeling

Figure 1. Prediction of the HOMO (a), LUMO (b), and HOMO−LUMO gap (c) energies in the test set by Random Forests trained on the basis of different molecular descriptors.

eV. For the LUMO, the best results yielded a MAE of 0.16 eV, RMSE of 0.23 eV, 27% of the structures predicted with a deviation less than 0.05 eV and 95.4% of the structures predicted with a deviation less than 0.5 eV. The HOMO− LUMO gap could be predicted with a MAE of 0.21 eV and RMSE of 0.30 eV. Prediction of the HOMO−LUMO gap as the difference of predicted LUMO and HOMO values achieved essentially the same accuracy as the prediction by models trained directly with the gap values (in experiments with Md descriptors). Md descriptors avoid several molecular standardization steps and were faster in our implementations (calculation of Md descriptors for 71 009 molecules took only 194 s on a PC using an Intel Core i7-3770 CPU 3.40 GHz and the 64 bit

models of the HOMO and LUMO energies, and the corresponding gap, exploring the new Md descriptors and other well-established fingerprints and descriptors. The training and test sets consisted of 88 537 molecules and 9989 molecules, respectivelyFigure 1 and Table S1. Md descriptors, SubstructureCount, and PubChem fingerprints achieved the best results. MACCS fingerprints also performed well with a smaller number of descriptors. Averaged predictions (consensus) obtained by Md and PubChem descriptors (CM1), or Md, SubC, and PubChem (CM2) further improved the results. In the best case, the HOMO energy was predicted for the test set with a MAE of 0.15 eV, RMSE of 0.21 eV, 27% of the structures predicted with a deviation less than 0.05 eV and 96.5% of the structures predicted with a deviation less than 0.5 14

DOI: 10.1021/acs.jcim.6b00340 J. Chem. Inf. Model. 2017, 57, 11−21

Article

Journal of Chemical Information and Modeling Ubuntu 12.04 operating system, while calculation of PubChem and SubC fingerprints showed an average speed 50 and 20 times lower, respectively). Apart from calculation speed, other features of descriptors, such as length and binary nature, contribute to the time requirements of the whole method, both in training and prediction modes. This also depends on the ML algorithm and its specific implementation. Even though, from an application point of view, fast predictions are more relevant than fast training, and descriptor calculation has a higher impact in the prediction times than in the training times. Several further experiments were restricted to Md descriptors. Figures 2 and 3 represent the plot of predicted vs DFT-calculated HOMO and LUMO energies.

NC

DAB =

∑ (CA,i − CB,i)2 i=1

(1)

where CA,i and CB,i are Md descriptors of A and B, separately; NC is the number of Md descriptors. The real significance of such a definition, i.e., the ability to reduce average deviations in inner regions of the domain, was explored for the 9989 structures of the test set with several thresholds of Euclidean distancesFigure 4 and Table S2. The deviations both for the HOMO and the LUMO predictions are clearly lower within the defined applicability domain than outside, and they increase with the threshold. For a threshold of 0.1, the RMSE for the HOMO of compounds within the domain (0.080 eV) is approximately one-third of the RMSE for the whole test set. A threshold of 1.5 puts 111 structures outside the defined applicability domain (1.1% of the 9989-structures test set) that were predicted with RMSE = 0.374 eV. For the predictions of the LUMO energy, with a threshold of 1.5, the RMSE outside the domain (0.534 eV) is almost twice the RMSE within it (0.269 eV). This also illustrates how the errors can be reduced by using training sets exhaustively sampling the same chemical space as the test set. For practical applications, reducing the occurrence of large deviations is particularly relevant, more than reducing global average deviations to values below experimental uncertainty. Large deviation of predictions often results from unbalanced training sets. In order to overcome such a possible situation, our training set was recalibrated by removing a fraction of compounds yielding very small errors and repeating compounds with large errors (the errors obtained in the RF OOB predictions for the training set were used). Md descriptors were used. A random set comprising 2/3 of the compounds with errors < RMSE/2 was removed (30 415 and 33 295 compounds for HOMO and LUMO respectively), and compounds with errors >2RMSE were repeated four times (5100 and 5067 compounds for HOMO and LUMO respectively). RF models trained with the recalibrated data sets yielded predictions for the test set with R2 = 0.887, MAE = 0.161 eV, and RMSE = 0.222 eV (for HOMO) and R2 = 0.921, MAE = 0.179 eV, and RMSE = 0.256 eV (for LUMO). The percentage of compounds with errors larger than 0.5 eV decreased from 4.5% to 3.6% (for HOMO) and from 7.0% to 6.0% (for LUMO), although the percentage of compounds predicted with errors 100 000 compounds). In order to assess the impact of reducing the training set size for this particular data set, experiments were performed with random training sets consisting of ca. 1/4 of the original setTable 2and indicate an increase in average deviations of ca. 30%. Differently from other published approaches for the estimation of orbital energies,1,33 we have used random forest as the machine learning algorithm. In our experience, this has been consistently one of the best performing algorithms, with several advantages such as the efficient processing of large numbers of descriptors and minimum requirements concerning optimization of parameters. However, we also trained support

Figure 2. Predicted vs DFT-calculated energy of the HOMO orbital for the 9989 molecular structures of the external test set (results obtained with the RF model trained with the whole training set encoded with the Md descriptors).

Figure 3. Predicted vs DFT-calculated energy of the LUMO orbital for the 9989 molecular structures of the external test set (results obtained with the RF model trained with the whole training set using the Md descriptors).

A definition of applicability domain based on the similarity between a molecule and its most similar compound in the training set was explored. The similarity was calculated as the Euclidean distance, DAB, between the Md descriptors of two molecules A and B: 15

DOI: 10.1021/acs.jcim.6b00340 J. Chem. Inf. Model. 2017, 57, 11−21

Article

Journal of Chemical Information and Modeling

Figure 4. Efficiency of a definition of applicability domain based on the maximum Md descriptors similarity to the closest molecule in the training set for HOMO (a) and LUMO (b) energies (experiments for the test set).

training of much smaller RF models with even better prediction accuracies than the models trained with the whole set of descriptors. Finally, the RF models trained with all Md descriptors, as well as those trained with PubChem fingerprints were validated with a final prediction set consisting of 13 199 molecules not used for any task beforeTable 4. Impact of Quantum Calculation Methods. A main aspect of this study is the exploration of a relatively inexpensive quantum chemistry calculation protocol as a source of data for machine learning of frontier orbitals energies. Using a random subset of 84 molecules from the validation data set, two additional computational protocols for the calculation of the energies by DFT were followed to assess the impact of each procedure, and the ability of the developed ML models to predict orbital energies calculated at a higher level of theory Table 5. The ML models correspond to the RF trained with all training set molecules encoded with all Md descriptors. Method A consists in generating a single conformer with an empirical method (in the JChem software), optimizing the geometry with semiempirical PM7, and calculating a single-point by DFT B3LYP and 6-31G* basis (this is the original procedure followed for the whole database of 111 725 compounds). In method B, 10 conformers were generated with JChem and optimized with PM7; the most stable conformer was submitted

Table 1. Impact of Training Set Recalibration on the Predictions of Test Set Compounds in Outer Regions of the Applicability Domain distance to training set >0.5 RMSE (recalibrated) eV RMSE (no calibration) eV

0.2715 0.2856

RMSE (recalibrated) eV RMSE (no calibration) eV

0.3228 0.3534

>0.8

>1.2

HOMO 0.3209 0.3399 0.3453 0.3671 LUMO 0.3981 0.4677 0.4504 0.5267

>1.5 0.3517 0.3743 0.4801 0.5349

vector machine, multilayer perceptron, and M5P model trees with the reduced version of the training set, using different molecular descriptors and variable selection based on RF variable importanceTable 3. Variation of the ML algorithm could not achieve any consistent improvement of the results obtained with random forests. With the Md descriptors, a slight improvement with MLPerceptrons was detected, but an experience combining training set recalibration, selection of 150 descriptors by RF, and training a MLPerceptron did not yield superior results comparing to the RF trained with the whole training set. However, selection of 100 Md descriptors by RF enabled the 16

DOI: 10.1021/acs.jcim.6b00340 J. Chem. Inf. Model. 2017, 57, 11−21

Article

Journal of Chemical Information and Modeling

Table 2. Prediction of the HOMO and LUMO Energies by Random Forests Trained with a Reduced Number of Training Examples training set (OOB estimation)

test set HOMO

a

descriptors

MAE

R2/RMSE (eV)

MAE

R2/RMSE (eV)

% error > 0.5/% error < 0.05 eVa

Md MACCS SubC PubChem

0.2085 0.2053 0.2040 0.1928

0.8019/0.2847 0.8015/0.2852 0.8019/0.2847 0.8290/0.2657

0.2082 0.2024 0.2040 0.1909 LUMO

0.8103/0.2883 0.8024/0.2858 0.8004/0.2873 0.8298/0.2669

8.03/19.11 8.04/20.02 7.98/19.99 6.72/21.33

descriptors

MAE

R2/RMSE (eV)

MAE

R2/RMSE (eV)

% error > 0.5/% error < 0.05 eVa

Md MACCS SubC PubChem

0.2456 0.2392 0.2094 0.2186

0.8403/0.3544 0.8555/0.3399 0.8858/0.3009 0.8763/0.3141

0.2401 0.2361 0.2080 0.2155

0.8519/0.3483 0.8582/0.3345 0.8864/0.2978 0.8785/0.3096

12.15/19.06 11.96/18.12 8.66/20.03 9.91/20.53

Percent of molecules predicted with absolute error above or below 0.5 eV.

Table 3. Exploration of Variable Selection and Different ML Algorithms Using Reduced Training Setsa descriptor/ML

selection of descriptors (no.)

Md/RF Md/SVM Md/MLPerceptron Md/MLPerceptron Md/MLPerceptron Md/M5P MACCS/RF MACCS/SVM MACCS/MLPerceptron MACCS/M5P SubC/RF SubC/SVM SubC/MLPerceptron SubC/M5P PubChem/RF PubChem/SVM PubChem/SVM PubChem/MLPerceptron PubChem/M5P

yes (100)b yes (100)b yes (75)b yes (100)b yes (150)b yes (100)b no (166) no (166) no (166) no (166) no (307) no (307) no (307) no (307) yes (150)b yes (150)b yes (286)c yes (150)b yes (150)b

Md/RF Md/SVM Md/MLPerceptron Md/MLPerceptron Md/MLPerceptron Md/M5P MACCS/RF MACCS/SVM MACCS/MLPerceptron MACCS/M5P SubC/RF SubC/SVM SubC/MLPerceptron SubC/M5P PubChem/RF PubChem/SVM PubChem/MLPerceptron PubChem/M5P

yes (100)b yes (100)b yes (75)b yes (100)b yes (150)b yes (100)b no (166) no (166) no (166) no (166) no (307) no (307) no (307) no (307) yes (150)b yes (150)b yes (150)b yes (150)b

MAE HOMO 0.2060 0.1930 0.2186 0.2087 0.1970 0.2335 0.2024 0.2067 0.2427 0.2459 0.2040 0.1988 0.2677 0.2469 0.2002 0.1976 0.2188 0.2313 0.2499 LUMO 0.2353 0.2432 0.2566 0.2436 0.2231 0.2641 0.2361 0.2443 0.2962 0.2953 0.2080 0.2625 0.2944 0.2546 0.2216 0.2264 0.2670 0.2903

R2/RMSE (eV)

% error > 0.5/% error < 0.05 eVc

0.8113/0.2843 0.8237/0.2676 0.7973/0.2874 0.8138/0.2756 0.8321/0.2617 0.7677/0.3076 0.8024/0.2858 0.8049/0.2820 0.7605/0.3189 0.7315/0.3308 0.8004/0.2873 0.8183/0.2723 0.7646/0.3425 0.7360/0.3279 0.8145/0.2764 0.8217/0.2695 0.7785/0.3001 0.7732/0.3037 0.7328/0.3294

7.78/19.03 6.24/19.38 7.97/16.02 6.93/16.12 5.62/17.78 9.59/14.72 8.04/20.02 7.66/18.53 10.52/13.87 11.79/14.56 7.98/19.99 6.65/18.83 13.32/11.82 11.48/13.91 7.44/19.72 6.78/19.50 9.15/18.31 9.59/15.27 12.12/14.30

0.8554/0.3406 0.8477/0.3439 0.8471/0.3447 0.8614/0.3284 0.8842/0.3001 0.8347/0.3590 0.8582/0.3345 0.8527/0.3385 0.8127/0.3912 0.7856/0.4089 0.8864/0.2978 0.8333/0.3600 0.8429/0.3911 0.8471/0.3448 0.8737/0.3145 0.8722/0.3155 0.8364/0.3565 0.7986/0.3957

12.13/19.03 11.28/16.25 12.56/14.14 11.02/14.25 9.01/16.28 14.14/14.24 11.96/18.12 11.83/16.40 17.70/11.61 17.60/13.49 8.66/20.03 12.95/13.63 16.98/11.88 12.27/14.32 10.56/19.68 10.26/17.89 14.30/13.50 17.25/13.87

a

Predictions obtained for the test set. bDescriptors selected based on the importance assigned by RF models. cPercent of molecules predicted with absolute error above or below 0.5 eV. 17

DOI: 10.1021/acs.jcim.6b00340 J. Chem. Inf. Model. 2017, 57, 11−21

Article

Journal of Chemical Information and Modeling

are several orders of magnitude slower than machine learning methods, they are considerably faster than DFT calculations. Table 5 shows a high correlation between energies calculated by PM7 and DFT. Considering all structures of the test set calculated by PM7, the R2 between PM7 and DFT (Method A) are 0.78 and 0.87 for the HOMO and LUMO respectively. We retrained RF models using the same descriptors as before, but including the HOMO or the LUMO PM7 energy as an additional descriptor, and observed a significant improvement in the quality of the ML estimationsTable 6. Using Md descriptors, the HOMO and LUMO energies could be estimated with R2 = 0.95 and 0.98 respectively; more than 99.3% of the compounds were predicted with absolute deviations less than 0.5 eV and almost half were predicted with absolute deviations less than 0.05 eV. The results demonstrate the possibility of improving ML predictions (if more computational time is allowed) by including PM7 calculations, but also the possibility to better approximate PM7 to DFT calculations by using ML techniques. The considerable decrease in the number of compounds predicted with large deviations shows how the inclusion of PM7 calculations makes the models less restricted to the specificities of the training set.61 Application of the ML Models to the Selection of Molecular Structures. The deviations between ML predictions and energy values in our database are of approximately the same magnitude of reported uncertainties in the measurement of experimental values and of the deviations between calibrated DFT calculations and experimental values.62 A main application of our ML models can be the filtering of large databases to select small data sets of structures for further computationally expensive DFT calculations. In order to illustrate this concept, the RF predictions of the test set using the Md descriptors were inspected for their ability to identify the 50 structures with the lowest DFT HOMO−LUMO gap among the 100 structures with the lowest LUMO values (“positive structures”)possible criteria for OPV candidate structures in the range of values in our database.63 In the group of 50 structures selected with the same criteria on the basis of predicted values we could identify 60% of the positive structures. This percentage raises to 80% if 250 structures are selected from the 1000 predicted with the lowest LUMO, i.e., the model would suggest 2.5% of the original data set for further calculations, and these include 80% of all positive structures.

Table 4. Validation of RF Models with a Final Prediction Set Consisting of 13 199 Molecules descriptors Md PubChem Md PubChem

MAE (eV)

RMSE (eV)

HOMO 0.175 0.185 LUMO 0.204 0.199

0.246 0.255 0.305 0.289

to a single-point calculation with DFT B3LYP and 6-31G* basis. In method C, the geometry of the most stable conformer of method B was further optimized with DFT B3LYP and the same basis. Additionally to the results of Table 5, the MAE for the ML predictions for this data set against energies calculated by method A are 0.111, 0.110, and 0.155 eV for the HOMO, LUMO and gap energies. It was also found that the MAD between methods A and B are 0.059, 0.065, and 0.089 eV for the HOMO, LUMO, and gap energies. The last values provide some indications regarding how much of the ML prediction deviations can be due to noise introduced in the data by generating only one conformer. The deviations between methods A and C (only slightly lower than the ML errors for energies obtained by method A) do not exclude the possibility that training ML models with data obtained at a higher level of theory may significantly reduce the error of ML predictions for our data set. Using calculations at a higher level of theory, the Aspuru-Guzik lab trained a ML model with a data set twice our size and significantly less diverse (as indicated by the reported range of energy values and the procedure to generate structures) to report a MAE of ∼0.03 eV.1 These lower observed deviations certainly result from the three reasons: (a) a larger training set, (b) a significantly narrower (and exhaustively sampled) chemical space, and (c) more consistent data due to a higher level of theory and conformer search. Our results can also be compared to those obtained by von Lilienfeld and co-workers33 (MAE of 0.15 and 0.12 eV for HOMO and LUMO respectively) with quite a different training set and molecular descriptors: orbital energies obtained after DFT PBE0 structure optimization of only one conformer, a molecular descriptor based on the 3D structure of the conformer, a much smaller training set of very small molecules (up to seven non-hydrogen atoms, saturated with hydrogen atoms) encompassing a large diversity of structures but restricted to six elements. Improvement of Estimations with Supplementary PM7 Calculations. Even if PM7 semiempirical computations



CONCLUSION Training machine learning models with >88 000 molecules and the respective frontier orbital energies at this specific level of

Table 5. Mean Absolute Deviations (MADs) and Correlations (R2) between Orbital Energies Calculated with Method C and with Other Methods for a Data Set of 84 Molecules HOMO method A method B PM7 (A)a PM7 (B)b MLc

LUMO 2

MAD (eV)

R

0.0995 0.0813

0.936 0.947 0.799 0.803 0.850

0.1539

GAP 2

MAD (eV)

R

0.0964 0.0655

0.983 0.989 0.895 0.896 0.936

0.1396

MAD (eV)

R2

0.1344 0.1041

0.940 0.959 0.652 0.656 0.846

0.2090

a

Calculated for the optimized structure of method A. bCalculated for the optimized structure of method B. cRF trained with all training set molecules encoded with all Md molecular descriptors. 18

DOI: 10.1021/acs.jcim.6b00340 J. Chem. Inf. Model. 2017, 57, 11−21

Article

Journal of Chemical Information and Modeling Table 6. Machine Learning Prediction of Frontier Orbital Energies Making Use of PM7 Descriptors training set (63777 compounds)

test set (7232 compounds) HOMO

descriptors

MAE

R2/RMSE (eV)

MAE

R2/RMSE (eV)

Md Md+PM7 SubC SubC+PM7 PubChem PubChem+PM7

0.1332 0.0887 0.1408 0.0911 0.1306 0.0832

0.8805/0.1882 0.9492/0.1245 0.8684/0.1981 0.9449/0.1298 0.8887/0.1827 0.9514/0.1211

0.1316 0.0869 0.1421 0.0894 0.1327 0.0823

0.8879/0.1881 0.9505/0.1230 0.8620/0.2019 0.9480/0.1259 0.8826/0.1871 0.9541/0.1175 LUMO

descriptors

MAE

R2/RMSE (eV)

MAE

R2/RMSE (eV)

% error > 0.5/% error < 0.05 eV

Md Md+PM7 SubC SubC+PM7 PubChem PubChem+PM7

0.1395 0.0879 0.1439 0.0871 0.1393 0.0865

0.9408/0.2052 0.9765/0.1304 0.9400/0.2069 0.9765/0.1302 0.9436/0.2010 0.9760/0.1312

0.1352 0.0851 0.1428 0.0843 0.1363 0.0830

0.9459/0.2010 0.9776/0.1269 0.9419/0.2037 0.9790/0.1229 0.9447/0.1990 0.9792/0.1221

3.01/33.23 0.64/46.52 3.14/27.67 0.46/45.35 2.75/30.45 0.51/47.05

% error > 0.5/% error < 0.05 eV 2.13/30.45 0.33/42.08 2.75/27.49 0.40/41.63 2.27/28.86 0.32/45.04

GAP descriptors

MAE

R2/RMSE (eV)

MAE

R2/RMSE (eV)

% error > 0.5/% error < 0.05 eV

Md Md+PM7

0.1904 0.1254

0.8935/0.2716 0.9532/0.1801

0.1874 0.1222

0.8977/0.2695 0.9555/0.1758

6.96/24.83 1.87/35.55

Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry of China (No. 20091001).

theory enabled the estimation of HOMO and LUMO energies with accuracies up to 0.15 and 0.16 eV, respectively, for an external test set. Random Forests revealed to be appropriate machine learning algorithms, and the Md molecular descriptors achieved essentially the same results as PubChem, SubC, or MACCS fingerprints with the advantage of not requiring the assignment of bond orders or formal charges. Inclusion of the orbital energy calculated by PM7 as an additional descriptor significantly improved the quality of estimations (reduced the MAE in >30%).



Notes

The authors declare no competing financial interest.



ABBREVIATIONS HOMO, highest occupied molecular orbital; LUMO, lowest unoccupied molecular orbital; OLED, organic light-emitting diode; OTFT, organic thin-film transistor; OPV, organic photovoltaic device; Md, modified distances; RF, random forest; OOB, out-of-bag; SVM, support vector machine; MLP, multilayer perceptron; NN, neural network

ASSOCIATED CONTENT



S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.6b00340. Tables S1 and S2 with the numerical details of Figures 1 and 4 (PDF)



REFERENCES

(1) Pyzer-Knapp, E. O.; Li, K.; Aspuru-Guzik, A. Learning from the Harvard Clean Energy Project: The Use of Neural Networks to Accelerate Materials Discovery. Adv. Funct. Mater. 2015, 25, 6495− 6502. (2) Service, R. F. Organic LEDs Look Forward to a Bright, White Future. Science 2005, 310, 1762−1763. (3) Bessette, A.; Hanan, G. S. Design, Synthesis and Photophysical Studies of Dipyrromethene-Based Materials: Insights into their Applications in Organic Photovoltaic Devices. Chem. Soc. Rev. 2014, 43, 3342−3405. (4) Roberts, M. E.; Sokolov, A. N.; Bao, Z. Material and Device Considerations for Organic Thin-Film Transistor Sensors. J. Mater. Chem. 2009, 19, 3351−3363. (5) Organic Photovoltaics: Mechanisms, Materials, and Devices (Optical Engineering); CRC Press: Boca Raton, FL, 2005. (6) Nelson, J. Organic Photovoltaic Films. Curr. Opin. Solid State Mater. Sci. 2002, 6, 87−95. (7) O’Boyle, N. M.; Campbell, C. M.; Hutchison, G. R. Computational Design and Selection of Optimal Organic Photovoltaic Materials. J. Phys. Chem. C 2011, 115, 16200−16210. (8) Scharber, M. C.; Muhlbacher, D.; Koppe, M.; Denk, P.; Waldauf, C.; Heeger, A. J.; Brabec, C. L. Design Rules for Donors in BulkHeterojunction Solar Cells - Towards 10% Energy-Conversion Efficiency. Adv. Mater. 2006, 18, 789−794. (9) Siddiqui, S. A.; Al-Hajry, A.; Al-Assiri, M. S. ab Initio Investigation of 2,2 ′-Bis(4-trifluoromethylphenyl)5,5′-Bithiazole for

AUTHOR INFORMATION

Corresponding Authors

* E-mail: [email protected] (Q.Z.). *E-mail: [email protected] (J.A.S.). Funding

Financial support from Fundaçaõ para a Ciência e a Tecnologia (FCT/MEC) Portugal, under Project PEst-OE/QUI/UI0612/ 2013, and grants SFRH/BPD/63192/2009 (D.A.R.S.L.) and SFRH/BPD/108237/2015 (F.P.) are greatly appreciated. This work was also supported by the Associated Laboratory for Sustainable Chemistry−Clean Processes and Technologies− LAQV which is financed by national funds from FCT/MEC (UID/QUI/50006/2013) and cofinanced by the ERDF under the PT2020 Partnership Agreement (POCI-01-0145-FEDER007265). The authors thank the National Natural Science Foundation of China (No. 20875022) for financial support. The authors acknowledge the International Science and Technology Cooperation of Henan Province, China (No. 162102410012). This work was also sponsored by the Scientific 19

DOI: 10.1021/acs.jcim.6b00340 J. Chem. Inf. Model. 2017, 57, 11−21

Article

Journal of Chemical Information and Modeling the Design of Efficient Organic Field-Effect Transistors. Int. J. Quantum Chem. 2016, 116, 339−345. (10) Hachmann, J.; Olivares-Amaya, R.; Atahan-Evrenk, S.; AmadorBedolla, C.; Sanchez-Carrera, R. S.; Gold-Parker, A.; Vogt, L.; Brockway, A. M.; Aspuru-Guzik, A. The Harvard Clean Energy Project: Large-Scale Computational Screening and Design of Organic Photovoltaics on the World Community Grid. J. Phys. Chem. Lett. 2011, 2, 2241−2251. (11) Sharma, V.; Wang, C.; Lorenzini, R. G.; Ma, R.; Zhu, Q.; Sinkovits, D. W.; Pilania, G.; Oganov, A. R.; Kumar, S.; Sotzing, G. A.; Boggs, S. A.; Ramprasad, R. Rational Design of all Organic Polymer Dielectrics. Nat. Commun. 2014, 5, 4845. (12) Mannodi-Kanakkithodi, A.; Pilania, G.; Huan, T. D.; Lookman, T.; Ramprasad, R. Machine Learning Strategy for Accelerated Design of Polymer Dielectrics. Sci. Rep. 2016, 6, 20952. (13) Jain, A.; Ong, S. P.; Hautier, G.; Chen, W.; Richards, W. D.; Dacek, S.; Cholia, S.; Gunter, D.; Skinner, D.; Ceder, G. Commentary: The Materials Project: A Materials Genome Approach to Accelerating Materials Innovation. APL Mater. 2013, 1, 011002. (14) Saal, J. E.; Kirklin, S.; Aykol, M.; Meredig, B.; Wolverton, C. Materials Design and Discovery with High-Throughput Density Functional Theory: The Open Quantum Materials Database (OQMD). JOM 2013, 65, 1501−1509. (15) Taylor, R. H.; Rose, F.; Toher, C.; Levy, O.; Yang, K.; Nardelli, M. B.; Curtarolo, S. A RESTful API for Exchanging Materials Data in the AFLOWLIB.org Consortium. Comput. Mater. Sci. 2014, 93, 178− 192. (16) Chattaraj, P. K.; Giri, S.; Duley, S. Update 2 of: Electrophilicity Index. Chem. Rev. 2011, 111, PR43−PR75. (17) Chamorro, E.; Duque-Norena, M.; Perez, P. A Comparison Between Theoretical and Experimental Models of Electrophilicity and Nucleophilicity. J. Mol. Struct.: THEOCHEM 2009, 896, 73−79. (18) Deuri, S.; Phukan, P. A DFT Study on Nucleophilicity and Site Selectivity of Nitrogen Nucleophiles. Comput. Theor. Chem. 2012, 980, 49−55. (19) Chamorro, E.; Duque-Norena, M.; Perez, P. Further Relationships Between Theoretical and Experimental Models of Electrophilicity and Nucleophilicity. J. Mol. Struct.: THEOCHEM 2009, 901, 145−152. (20) Chamorro, E.; Duque-Norena, M.; Notario, R.; Perez, P. Intrinsic Relative Scales of Electrophilicity and Nucleophilicity. J. Phys. Chem. A 2013, 117, 2636−2643. (21) Karelson, M.; Lobanov, V. S.; Katritzky, A. R. QuantumChemical Descriptors in QSAR/QSPR Studies. Chem. Rev. 1996, 96, 1027−1043. (22) Peng, S.; Jian-Wei, Z.; Peng, Z.; Lin, X. QSPR Modeling of Bioconcentration Factor of Nonionic Compounds using Gaussian Processes and Theoretical Descriptors Derived from Electrostatic Potentials on Molecular Surface. Chemosphere 2011, 83, 1045−1052. (23) Xiao, R.; Ye, T.; Wei, Z.; Luo, S.; Yang, Z.; Spinney, R. Quantitative Structure-Activity Relationship (QSAR) for the Oxidation of Trace Organic Contaminants by Sulfate Radical. Environ. Sci. Technol. 2015, 49, 13394−13402. (24) Pals, J. A.; Wagner, E. D.; Plewa, M. J. Energy of the Lowest Unoccupied Molecular Orbital, Thiol Reactivity, and Toxicity of Three Monobrominated Water Disinfection Byproducts. Environ. Sci. Technol. 2016, 50, 3215−3221. (25) Li, J.; Moe, B.; Vemula, S.; Wang, W.; Li, X.-F. Emerging Disinfection Byproducts, Halobenzoquinones: Effects of Isomeric Structure and Halogen Substitution on Cytotoxicity, Formation of Reactive Oxygen Species, and Genotoxicity. Environ. Sci. Technol. 2016, 50, 6744−6752. (26) Nantasenamat, C.; Worachartcheewan, A.; Prachayasittikul, S.; Isarankura-Na-Ayudhya, C.; Prachayasittikul, V. QSAR Modeling of Aromatase Inhibitory Activity of 1-Substituted 1,2,3-Triazole Analogs of Letrozole. Eur. J. Med. Chem. 2013, 69, 99−114. (27) Hansen, K.; Montavon, G.; Biegler, F.; Fazli, S.; Rupp, M.; Scheffler, M.; von Lilienfeld, O. A.; Tkatchenko, A.; Mueller, K.-R. Assessment and Validation of Machine Learning Methods for

Predicting Molecular Atomization Energies. J. Chem. Theory Comput. 2013, 9, 3404−3419. (28) Rai, B. K.; Bakken, G. A. Fast and Accurate Generation of ab Initio Quality Atomic Charges using Nonparametric Statistical Regression. J. Comput. Chem. 2013, 34, 1661−1671. (29) Qu, X.; Latino, D. A. R. S.; Aires-de-Sousa, J. A Big Data Approach to the Ultra-Fast Prediction of DFT-Calculated Bond Energies. J. Cheminf. 2013, 5, 34. (30) Zhang, Q.; Zheng, F.; Fartaria, R.; Latino, D. A. R. S.; Qu, X.; Campos, T.; Zhao, T.; Aires-de-Sousa, J. A QSPR Approach for the Fast Estimation of DFT/NBO Partial Atomic Charges. Chemom. Intell. Lab. Syst. 2014, 134, 158−163. (31) Zhang, Q.; Zheng, F.; Zhao, T.; Qu, X.; Aires-de-Sousa, J. Machine Learning Estimation of Atom Condensed Fukui Functions. Mol. Inf. 2016, 35, 62−69. (32) Pereira, F.; Latino, D. A. R. S.; Aires-de-Sousa, J. Estimation of Mayr Electrophilicity with a Quantitative Structure-Property Relationship Approach Using Empirical and DFT Descriptors. J. Org. Chem. 2011, 76, 9312−9319. (33) Montavon, G.; Rupp, M.; Gobre, V.; Vazquez-Mayagoitia, A.; Hansen, K.; Tkatchenko, A.; Mueller, K.-R.; von Lilienfeld, O. A. Machine Learning of Molecular Electronic Properties in Chemical Compound Space. New J. Phys. 2013, 15, 095003. (34) Po, R.; Bianchi, G.; Carbonera, C.; Pellegrino, A. ″All That Glisters Is Not Gold″: An Analysis of the Synthetic Complexity of Efficient Polymer Donors for Polymer Solar Cells. Macromolecules 2015, 48, 453−461. (35) Kanehisa, M.; Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000, 28, 27−30. (36) Mayr, H.; Ofial, A. R. Kinetics of Electrophile-Nucleophile Combinations: A General Approach to Polar Organic Reactivity. Pure Appl. Chem. 2005, 77, 1807−1821. (37) Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.; Coleman, R. G. ZINC: A Free Tool to Discover Chemistry for Biology. J. Chem. Inf. Model. 2012, 52, 1757−1768. (38) Kim, S.; Thiessen, P. A.; Bolton, E. E.; Chen, J.; Fu, G.; Gindulyte, A.; Han, L.; He, J.; He, S.; Shoemaker, B. A.; et al. PubChem Substance and Compound databases. Nucleic Acids Res. 2016, 44, D1202−D1213. (39) Aires-de-Sousa, J.; Diogo, A. R. S. Energies of the HOMO and LUMO Orbitals for 111725 Organic Molecules Calculated by DFT B3LYP / 6-31G*. figshare. https://dx.doi.org/10.6084/m9.figshare. 3384184.v1. (40) Stewart, J. J. P. MOPAC2009 and MOPAC2012. http:// OpenMOPAC.net (accessed October 14, 2016). (41) Stewart, J. J. P. Optimization of Parameters for Semiempirical Methods VI: More Modifications to the NDDO Approximations and Re-Optimization of Parameters. J. Mol. Model. 2013, 19, 1−32. (42) Becke, A. D. Density-Functional Thermochemistry. III. The Role of Exact Exchange. J. Chem. Phys. 1993, 98, 5648−5652. (43) Becke, A. D. A New Mixing of Hartree−Fock and Local Density-Functional Theories. J. Chem. Phys. 1993, 98, 1372−1377. (44) Hehre, W. J.; Ditchfie, R.; Pople, J. A. Self-Consistent Molecular-Orbital Methods.XII. Further Extensions of GaussianType Basis Sets for Use in Molecular-Orbital Studies of OrganicMolecules. J. Chem. Phys. 1972, 56, 2257. (45) Hariharan, P. C.; Pople, J. A. Influence of Polarization Functions on Molecular-Orbital Hydrogenation Energies. Theor. Chim. Acta 1973, 28, 213−222. (46) Schmidt, M. W.; Baldridge, K. K.; Boatz, J. A.; Elbert, S. T.; Gordon, M. S.; Jensen, J. H.; Koseki, S.; Matsunaga, N.; Nguyen, K. A.; Su, S. J.; et al. General Atomic and Molecular Electronic-Structure System. J. Comput. Chem. 1993, 14, 1347−1363. (47) Gordon, M. S.; Schmidt, M. W. Advances in Electronic Structure Theory: GAMESS a Decade Later. In Theory And Applications of Computational Chemistry: The First Forty Years; Dykstra, C. E., Frenking, G., Kim, K. S., Scuseria, G. E., Ed.; Elsevier: Amsterdam, 2005; pp 1167−1189. 20

DOI: 10.1021/acs.jcim.6b00340 J. Chem. Inf. Model. 2017, 57, 11−21

Article

Journal of Chemical Information and Modeling (48) Yap, C. W. PaDEL-Descriptor: An Open Source Software to Calculate Molecular Descriptors and Fingerprints. J. Comput. Chem. 2011, 32, 1466−1474. (49) Updated introduction to describe how to identify the PubChem Substructure Fingerprint property in a PubChem Compound record. ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_ fingerprints.txt (accessed October 14, 2016). (50) Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742−754. (51) CDK Descriptor Calculator Web Page. http://www.rguha.net/ code/java/cdkdesc.html (accessed October 14, 2016). (52) Molecular Descriptors for Chemoinformatics; WILEY-VCH Verlag GmbH & Co. KGaA: Weinheim, 2009; Vol. Vol. I. (53) Quinlan, J. R. Learning With Continuous Classes. In Proceedings of the 5th Australian Joint Conference on Artificial Intelligence AI′92, Singapore; Sterling, A. A., Ed.; World Scientific: Singapore, 1992; pp 343−348. (54) Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5−32. (55) Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J. C.; Sheridan, R. P.; Feuston, B. P. Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947−1958. (56) R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing; Vienna, Austria, 2011; http:// www.R-project.org (accessed October 14, 2016). (57) Liaw, A.; Wiener, M. Classification and Regression by RandomForest. R News 2002, 2, 18−22. (58) Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273−297. (59) Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I. H. The WEKA data mining software: an update. SIGKDD Explor. Newsl. 2009, 11, 10−18. (60) Chang, C.-C.; Lin, C.-J. LIBSVM: A Library for Support Vector Machines. ACM Trans. Intell. Syst. Technol. 2011, 2, 1. (61) Ramakrishnan, R.; Dral, P. O.; Rupp, M.; von Lilienfeld, O. A. Big Data Meets Quantum Chemistry Approximations: The DeltaMachine Learning Approach. J. Chem. Theory Comput. 2015, 11, 2087−2096. (62) Djurovich, P. I.; Mayo, E. I.; Forrest, S. R.; Thompson, M. E. Measurement of the Lowest Unoccupied Molecular Orbital Energies of Molecular Organic Semiconductors. Org. Electron. 2009, 10, 515− 520. (63) Hachmann, J.; Olivares-Amaya, R.; Jinich, A.; Appleton, A. L.; Blood-Forsythe, M. A.; Seress, L. R.; Roman-Salgado, C.; Trepte, K.; Atahan-Evrenk, S.; Er, S.; Shrestha, S.; Mondal, R.; Sokolov, A.; Bao, Z.; Aspuru-Guzik, A. Lead Candidates for High-Performance Organic Photovoltaics from High-Throughput Quantum Chemistry - the Harvard Clean Energy Project. Energy Environ. Sci. 2014, 7, 698−704.

21

DOI: 10.1021/acs.jcim.6b00340 J. Chem. Inf. Model. 2017, 57, 11−21