Prediction of Intramolecular Reorganization Energy Using Machine

5 days ago - PDF (4 MB) ... pyrrole, pyridine, pyridazine and cyclopentadiene as building blocks and obtained the target electronic data at the level ...
0 downloads 0 Views 4MB Size
Subscriber access provided by UNIV OF SOUTHERN INDIANA

A: New Tools and Methods in Experiment and Theory

Prediction of Intramolecular Reorganization Energy Using Machine Learning Sule Atahan-Evrenk, and F. Betul Atalay J. Phys. Chem. A, Just Accepted Manuscript • DOI: 10.1021/acs.jpca.9b02733 • Publication Date (Web): 03 Jun 2019 Downloaded from http://pubs.acs.org on June 3, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

Prediction of Intramolecular Reorganization Energy Using Machine Learning Sule Atahan-Evrenk∗,† and F. Betul Atalay‡ †TOBB University of Economics and Technology, Faculty of Medicine ‡TOBB University of Economics and Technology, Department of Computer Engineering 43 Sogutozu Cad. Sogutozu Ankara TURKEY E-mail: [email protected]

1

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

May 29, 2019 Abstract Facile charge transport is desired for many applications of organic semiconductors (OSCs). To take advantage of high-throughput screening methodologies for the discovery of novel OSCs, parameters relevant to charge transport is of high interest. Intramolecular reorganization energy (RE) is one of the important charge transport parameters suitable for molecular level screening. Because the calculation of the RE with quantum chemical methods are expensive for large scale screening, we investigated the possibility of prediction of the RE from the molecular structure by means of machine learning methods. We combinatorially generated a molecular library of 5631 molecules with extended conjugated backbones using benzene, thiophene, furan, pyrrole, pyridine, pyridazine and cyclopentadiene as building blocks and obtained the target electronic data at the level of B3LYP level of theory with 6-31G* basis. We compared ridge, kernel ridge, and deep neural net (DNN) regression models based on graph- and geometrybased descriptors. We found that DNNs outperform the other methods and can predict the RE with a coefficient of determination of 0.92 and root-mean-squared-error of ≈ 12 meV. This study showed that the RE of organic semiconductor molecules could be predicted from the molecular structure with high accuracy.

Introduction Organic semiconductors (OSCs) are an important class of (opto)electronic materials which are chemically tunable, solution processable and naturally compatible with plastics. Their potential applications span a wide range from wide area photovoltaics, 1 sensors, 2 and OLEDs 3 to artificial nerves 4 and electronic-skin. 5 From the new materials design perspective, carbon chemistry provides a vast chemical space for the discovery of new materials. The computational exploration of this chemical space as an alternative to experimental trial and error is a valuable tool. Accordingly, in the data-driven materials research, which has been gaining 2

ACS Paragon Plus Environment

Page 2 of 32

Page 3 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

impetus through large-scale materials projects, 6–10 a significant effort has been dedicated to the development of OSCs through computational strategies. 11–18 From the design perspective, an advantage of the molecular OSCs is that the discovery process usually starts with the molecular level tuning of important (opto)electronic properties. A similar process could also be applied to the design of polymers where the relevant parameters are first tuned at the oligomer level. 19 Therefore, a computational funnel, usually envisioned for drug molecules, is suitable for the organization of the screening process for OSCs. 20 This provides a manageable preliminary screening strategy for the large libraries so that the computational resources can be wisely used for the molecules with the highest potential, such that more expensive solid state level modeling is saved for a smaller number of more promising molecules. 20 For the accurate procession through such a computational hierarchy, the use of correct and easy to calculate parameters relevant to the desired target property for screening is of high interest. Since facile charge transport is desired for many applications of OSCs and reorganization energy (RE) is one of the important parameters for the charge mobility, here we propose the use of the RE as a preliminary screening parameter for high-throughput approaches. Previously, we used the RE to identify the best candidates in a small set of molecules for the crystal structure prediction and mobility calculations. This approach led to the synthesis of a compound surpassing pentacene in the organic field-effect transistor (OFET) performance. 21 Two previous studies showed that the RE could be calculated from the molecular structure by methods of nonlinear regression such as partial least squares 22 and principal component regression 23 with a prediction coefficient of determination, R2 , of around 0.7. Here we present a new and much larger data set to investigate the possibility of the prediction of the RE from the molecular structure with higher accuracy by means of machine learning (ML) methods. The RE is a measure of the coupling of the electronic motion to the nuclei as the charge carriers move. Its relative strength in comparison to the transfer integral magnitude determines the transport regime. 24,25 Although the calculation of the RE at the molecular level

3

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

unduly restricts the charge transport regime to the charge carrier hopping, this approach is highly utilized as it provides a first-order approximation to the RE. 26 There are also materials classes for which the hopping transport is important such as discotic liquid crystals 27 or disordered films. 28 Therefore, for large-scale screening at the molecular structure level, the RE calculated within the site-localized charge picture is useful. However, we should note that in general, coherent transport is important for many high-performance OSCs 29 and full theoretical evaluation of a candidate compound requires prediction of the solid-state morphology, intramolecular and external contributions to the RE, 30 electronic coupling strength and charge mobility. 31–33 The potential to achieve chemical accuracy at a reduced cost is one of the driving forces of the application of machine learning (ML) methods in chemical science. 34,35 The progress in the prediction of the atomization energies 36,37 and potential energy surfaces 38 has been remarkable. The possibility of achieving out of sample errors lower than hybrid DFT has been shown. 39 Models which can handle molecules and materials on one footing have been developed. 40,41 General purpose platforms such as MoleculeNet 42 as well as specific ones, for example, for molecular dynamics such as TensorMol 43 or deep learning architectures such as SchNet 44 have been deployed. Methods which seamlessly map from molecular representations to chemical feature spaces and back facilitate smart exploration of the chemical spaces of interest. 45,46 Thanks to these recent developments, the use of ML methodologies in quantum chemical problems have become possible. Further details of many of the recent developments could be found in the recent collections and reviews dedicated to these developments. 34,35 In applications of ML methods for property prediction in quantum chemistry, two methods stand out in capturing the nonlinear relationships between molecular structure and target property of a model: Kernel based methods 39,47–49 and deep neural networks (DNN). 37,39,43,50,51 Kernel ridge regression (KRR) is ridge regression with kernel trick. Kernel trick transforms a linear regression with a regularization into a nonlinear fit where the similarity of the data points are harnessed by a kernel such as a gaussian measuring proximity. 47 Thus KRR

4

ACS Paragon Plus Environment

Page 4 of 32

Page 5 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

provides an effective and inexpensive avenue for learning nonlinear relationships between the molecular features and target values and therefore often employed in quantum chemical ML. DNNs can also address the nonlinearity of the relations between molecular features and target values. Neural networks (NNs), a successful machine learning technique in pattern recognition, 52 have found a wide range of applications in traditional materials quantitative structure-property relationship (QSPR) studies. 53,54 In particular, with the advent of in silico screening, NNs have captured interest in accelerating the prediction of electronic structure of molecules with quantum chemistry accuracy. 37,55–58 DNNs have tremendous advantages compared to the earlier shallow and narrow NNs. Use of graphic processing units (GPU) allows training of DNNs with a large number of parameters. For example, it is possible to run networks with thousands of descriptors, over hidden layers with thousands of neurons without overfitting with the help of regularization algorithms such as dropout. 59 Here, we report the prediction of the RE for the p-type OSCs with machine learning methods, in particular KRR and DNN; we compare these results with a linear regression method, ridge regression. For the application of ML methods in data-driven materials design, a large training data set where molecular structure and desired molecular/material properties are known is necessary. Our previous study involving 171 known OSCs 23 as well as the study by Misra et al involving 211 poly aromatic hydrocarbons 22 showed similar performance. The agreement we believed indicated that for better model development, a larger library is crucial. Albeit small in the scale of the big data projects, here we report a much larger data set than the previous ones, including 5631 potential OSC compounds for the prediction of the RE.

5

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

a)

Handles

Page 6 of 32

b)

c1ccc2cc3cc4 cc5ccccc5cc4 cc3cc2c1

Optimize neutral & cation geometry 3D geometry

SMILES

Enc

Enn

Frequency analysis

c

Ec

Ecn

Single point calculations for Enn , Enc, Ecn , Ecc

c)

c1ccc2cc3cc4cc 5ccccc5cc4cc3c c2c1

Molecular Transform

Ridge

Molecular signature

Kernel Ridge

Circular Fingerprint

DNN

λ

λ = Enc - Enn + Ecn - Ecc

Figure 1: (a) The building blocks and library enumeration scheme (green box) (b) the generation of the target RE values (c) 9 models to predict the reorganization energy from inexpensive (MMFF94) 3D geometry or SMILES.

6

ACS Paragon Plus Environment

Page 7 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

Computational Methods The data set generation We combinatorially built a library of potential OSCs, following the scheme designed by Olivares-Amaya et al. 11,60 (Figure 1a). We started with benzene, thiophene, furan, pyrrole, pyridine, pyridazine and cyclopentadiene as building blocks. All of the building blocks are aromatic to ensure a conjugated backbone, except cyclopentadiene, added as a variational element. In the building blocks, the point of fusion was marked with the green handles. We fused the building blocks, represented as SMILES strings, together to obtain extended conjugated backbones with two to five rings using the SMARTS reaction rule shown in Figure 2.

Figure 2: Smarts reaction schema for library enumeration. The number 5 and 11 are the handles, A is any atom, and rest are the aromatic carbons. In each fusion reaction cycle, two of the handles were removed. The remaining handles on the products ensured that the product of the combinatorial reaction can be used as a reagent for the next reaction to extend the backbone. The placement of the handles on opposite sides of the cycles and SMARTS reaction scheme, in which handles on each reactant avoided each other during fusion, ensured the generation of an elongated structure. At the end of enumeration, all the duplicates were deleted and handles were cleaned. With this combinatorial scheme we initially obtained 26560 molecules, out of which, a number of 6560 molecules were used to generate the data set. This set included all of the 2-4 ring molecules as well as some of the 5-ring ones. For the calculation of the RE, we used the standard 4-point scheme for the gas phase 7

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 32

(Figure 1b). 26 Starting with the SMILES strings, we obtained the 3D geometries at the level of the MMFF94 force field, 61 and optimized them at the level of B3LYP 62,63 level using 6-31G(d) 64–66 basis set. We checked the geometry optimizations with the frequency analysis and included only all positive frequency geometries (both for the neutral and cation states), which amounted to 5631 molecules. The spin contamination for the cation calculations was no more than 7% for any of the molecules.

Descriptors We used three basic descriptors for molecular representation: i) the Morgan circular fingerprints 67 ii) the molecular signatures 68 iii) the molecular transforms. 69 Two of these representations, namely fingerprints and signatures, are graph based representations, directly generated from the SMILES strings. We generated the circular fingerprints (radius=2) with the RDKit implementation as a 1024 bit string without the counts of the substructures. The advantage of molecular signatures compared to fingerprints here is that they preserve more of the connectivity information for the substructure 70 and also include the counts for the substructures in a molecule. The molecular transforms, on the other hand, encode the three dimensional geometries of molecules as a fixed length descriptor vector in the form of a summation as such:

I(s) =

N X i−1 X i=2 j=1

Zi Zj

sin srij . srij

(1)

The summation runs over the N atoms with interatomic distances rij and atomic numbers Zi/j . The length of the descriptor is determined by the number of increments for the parameter s, defined in the interval [1, 31]. I(s) is a damped oscillatory function converging to "0" as s gets larger. The interatomic distance measure, rij , in the denominator ensures a natural decay without an optimized cutoff for the atoms which are far away. An important advantage of the circular fingerprints and molecular transforms is their

8

ACS Paragon Plus Environment

Page 9 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

fixed length. For such descriptors, usually, the descriptor space needs not to be redefined if new molecules are added to the compound set. 49 In contrast, the signature descriptor length increases with the addition of new compounds with new atom types or substructures to the training set, resulting in a large number of signature descriptors, for example reaching up to 6025 for the height 3 for the present set (height 3 indicates that up to 3 consecutive covalent bonds from each atom are considered to identify the substructures in a molecule). To make the predictions completely independent of the quantum chemical calculations, for the descriptor generation we only used either SMILES strings or force field (MMFF94) optimized geometries.

Model training We compared performances of three regression methods, Ridge (RR), Kernel Ridge (KRR) and Deep Neural Nets (DNN) in conjunction with three descriptors mentioned above, resulting in a total of 9 models. We used RR as a reference to measure the improvement when the nonlinearity of the relationships between the molecular structure and RE is taken into account, such as in the cases of KRR and DNN. Hyperparameter optimization for RR and KRR is much more straightforward when compared to DNN. Ridge regression uses a single parameter α to scale the weight factors in linear regression. It is possible to regularize the fit and choose one model among many, which provides the best validation set performance. For KRR, we have two hyperparameters, namely the regularization parameter alpha, α, and gaussian kernel parameter lambda, λ. These parameters are chosen so as to minimize the validation set errors with a log-scale search. For KRR, the parallel algorithms implemented for GridSearch in scikit-learn were useful in speeding up the search. For DNN hyperparameter optimization, we studied different architectures with hidden layers consisting of 50, 100, 200, 500, 1000, 2000 neurons in two or three hidden layers. We denote a 2-hidden-layer NN with 500 neurons in each layer as [500, 500] and a 3-hidden-layer 9

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

as [500, 500, 500]. The number of neurons in the input layer was given by the descriptor size. The output layer was a single (linear) neuron to predict the RE. The activation function for the input and hidden layers was the rectified linear function (ReLu) whereas for the output layer we used the linear transfer function. We initialized the DNN with random weights, and the parameters were optimized through error back propagation by using the Adam optimizer. To combat overfitting, we adapted the dropout algorithm. 59 The dropout algorithm was shown to improve the validation set results, by preventing overfitting by dropping some of the units in the input and hidden layers randomly. We investigated the probability of the unit dropout (p, dropout) as an additional hyperparameter of the system. We investigated batchsizes of 32, 64 and 128 and chosen 64 as the optimum value. We used the default parameters of the Adam optimizer (β = 0.9, β = 0.99) except for epsilon, which is chosen as 1.0 or 0.1. We performed random searches to determine the learning rates. We used a learning rate of 0.001 for the signature descriptor, and 0.0003 for the molecular transforms and circular fingerprints. We performed a 5-fold cross-validation (CV) on the 80% of the data to determine the hyperparameters for the DNN and kernel and regularization parameters for the KRR and RR models. The test set performance reflects the error values for the remaining 20% of the data, which is not used in the model building. During the training all of the models were scored with the mean squared error. Therefore, we report the root mean squared error (RMSE) as the primary error metric in this study. Since the mean absolute error (MAE) is usually reported in the literature, we also included MAE and its standard deviation for the test set results. Both RMSE and MAE are in the units of meV. We also report the unitless squared Pearson correlation coefficient for the test set results (R2 ).

Hyperparameter optimization Once we established the data set, we focused on the model hyperparameter optimization. First we investigated the impact of the DNN architecture on the performance for two and

10

ACS Paragon Plus Environment

Page 10 of 32

Page 11 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

Figure 3: The change in the test score (correlation coefficient) for double and triple hidden layer DNN architectures with the dropout values 0.1, 0.2, 0.3 and signature descriptor. The test score is obtained when 80% of the data is used with 5-fold cross-validation. Inlet: Test and train errors as a function of dropout parameter for [1000, 1000, 1000] DNN configuration. three hidden layers. (See Figure 3 and Figures S1 and S2.) We see that the 3-hiddenlayer configuration usually performs better than the 2-hidden-layer one. The signatures and circular fingerprints show convergence for 500 neurons in the hidden layers. On the other hand, the models with the molecular transform show convergence only after 1000 neurons. Therefore, for all of the models, we chose the [1000, 1000, 1000] architecture to ensure convergence in R2 . We applied dropout to all layers of the network except the output. When there is no dropout, the network memorizes, the train error is close to zero but the test error is high (high variance) (See inset of Figure 3.). As the dropout probability gets larger both the train and test errors grow; a dropout value in the range of [0.1-0.3] was an effective choice. Based on these observations, and the relationship of dropout and network architecture (See Figure 3 and Figures S1 and S2), we used the [1000,1000,1000] network with dropout values 0.1 for the signature and molecular transform descriptors, and 0.2 for the circular fingerprint descriptors.

11

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

When dropout algorithm is used for regularization, the higher the number of train steps, the better for the training. 51 We used 2500 epochs as a good compromise between time and accuracy. The change in MSE in the validation and train sets as a function of the epochs is shown in Figure S3. Due to the stochastic nature of the dropout, there is fluctuation in the data; however, after about 1000 epochs, usually the fluctuations get smaller. The SMILES strings and initial geometries were obtained with the RDKit software. 71 All the density functional theory (DFT) calculations were performed using the Q-Chem software package, 72 automated using Python language. 73 For generating the molecular signatures, we used scripts developed by Faulon et al. 68 The circular fingerprints with radius 2 (roughly equal to ECFP4 descriptor 70 ) and 3D geometries for the molecular transform descriptors at the level of MMFF94 force field were generated with the RDKit software. 71 For the visualization and some of the data analysis we used Marvin by ChemAxon. 74 The KRR and RR models were optimized using the scikit learn machine learning libraries. 75 For DNN regression we used Keras neural network Python library, 76 running on top of TensorFlow 77 and taking advantage of the GPU. The DNN code is publicly available at https://github.com/osc-dnn/osc-dnn. The data sets (in .csv file format) for the training and out-of-sample testing are included as part of the supplementary material.

Results and Discussion The electronic data The computed RE values are distributed between 84 and 380 meV with an average of µ = 259 meV (Figure 4). There is a maximum in the lower RE region with values accumulated mostly around 160 meV. When we compare the present RE values with the molecule set we calculated earlier for the p-type molecular OSCs, we see a similar range. 23 The groundstate highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital 12

ACS Paragon Plus Environment

Page 12 of 32

Page 13 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

(LUMO) energies are also shown in Figure 5. The average of the HOMO energy values is around −5.24 eV and close to the one from the earlier data. 23 However, the standard deviation is comparatively large because of the increased molecular diversity here. For the present set, we have both electron donating and withdrawing (hetero)cyclic building blocks in contrast to the earlier set where the molecules were comprised of mostly benzene and thiophene rings. Overall, the present data covers a chemically relevant space for OSCs when the electronic data is considered. A set of molecules with the low RE and right HOMO placement for insertion of holes from a gold electrode are shown in the SI Table 1. We do not expect that the gas phase molecular orbital energy values should directly translate into hard cutoffs for deciding on the polarity of the transport. However, these calculations qualitatively help in the prediction of the type of the transport regime. 78,79 With these restrictions in mind, we listed the molecules with the RE values smaller than a cutoff of 200 meV (lower end of the RE histogram in Figure 4) and HOMO energy values falling into the region [−5.6, −5.1] eV. We note that most of the molecules satisfying the above criteria have four or five rings as expected.

Figure 4: Histogram of the distribution of the DFT reorganization energies. Dashed line marks 200 meV. .

13

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 5: Histogram of the distribution of the HOMO and LUMO energies. Dashed lines mark the region [−5.6, −5.1] eV.

Figure 6: The learning curves for the models obtained with 5-fold cross-validation on 80% of the data. [1000,1000,1000] network with dropout values 0.1 for the signature (Sigs) and molecular transform descriptors (MT) and 0.2 for the circular fingerprint (CF) descriptors.

14

ACS Paragon Plus Environment

Page 14 of 32

Page 15 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

Reorganization energy prediction Among the learning curves in Figure 6, the highest slope belongs to DNN regression, followed by KRR. The RR curves on the other hand show little improvement (i.e. learning) as the training set gets larger. The cost of hyperparameter optimization becomes worthwhile when a training size of 2000 for the DNN models and 3000 for the KRR is reached. The slopes of the curves indicate that with the larger data sizes further reduction of the out-of-sample-errors might be possible. The best performing model is the DNN with the circular fingerprints. Table 1 summarizes the test set results from the best models, which were trained on 80% of the data. Compared to the earlier QSPR studies on RE, 22,23 there is a significant improvement in the prediction statistics in all of our models. This is a remarkable result showing that the RE can be predicted from inexpensive force field geometries or graph-based descriptors to high accuracy. For the RR models, the coefficient of determination, R2 is 0.84, 0.85, 0.81 for signature, CF and MT descriptors, respectively. The KRR results are slightly better with an R2 of 0.85, 0.88, 0.88 for signature, CF and MT descriptors. In comparison, all of the DNN models have an R2 of above 0.9, MAE of 6 meV and RMSE of around 12 meV. Although in all of the models the standard deviation in the MAE is large, it systematically improves as more sophisticated regression methods are applied. This large standard deviation in the MAE is reflected in the larger RMSE values since the molecules with the larger errors are weighted more in the averaging due to the squaring of the errors. This is also the case in other machine learning studies but the studies showed that these differences between the MAE and RMSE can be reduced with the introduction of more elaborate molecular representations. 80 Moreover, the methods in which the descriptors themselves are learned might reduce the out-of-sample-errors as well as the standard deviations of the errors . 37,50,80 The impact of the training set expansion is clearly seen in the present models; even for the simple RR an MAE of approximately 10 meV is obtained. For KRR, the large descriptor size of the signatures has an adverse effect on the model performance. Compared to RR and KRR, the DNN models are more robust against the change of the descriptor. Although the 15

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 32

2 ) and errors of prediction. Table 1: Coefficient of determinations (Rtest

Descriptora Ridge ReSignature gression CF MT Kernel Signature Ridge Regression CF MT

2 Rtest

MAEb

RMSEb

0.84 0.85 0.81

10.2(13.2) 9.7(13.0) 11.2(14.7)

16.7 16.3 18.5

0.85 0.88 0.88

9.8(12.1) 8.0(13.2) 8.1(12.2)

16.3 15.5 14.6

Signature(0.1) 0.91 6.5(10.2) 12.0 CF(0.2) 0.92 6.5(9.6) 11.4 MT(0.1) 0.90 7.5(10.1) 12.6 a MT: molecular transform, CF: circular fingerprint. In parenthesis are the dropout values. b Test set error metrics: mean absolute error in meV (MAE) with standard deviations in parenthesis, root mean squared error in meV (RMSE). DNN

DNN model with the circular fingerprint performs slightly better than the others, due to the stochastic nature of the regularization by dropout, we can conclude that the DNN models studied here perform more or less similar. One of the major limitations of the ML methods is the fact that they are nonlinear interpolations. Therefore, the prediction capacity of the algorithms are limited to the chemical space of the training data. In order to measure the extrapolation capacity of the DNN models in terms of the size of the molecule, we generated a training set by excluding the molecules with the highest number of heavy atoms as indicated in Table 2. For example, for the training set size denoted as "< 19", the molecules with 19 or more heavy atoms are kept out of the training and used as the test set. The test set performance of the models showed that molecular transform and circular fingerprints are performing slightly better than the signature descriptor. As more of the larger molecules are moved to the test set, the performance drop is the largest for the signature descriptor. These results indicate that the fixed-length descriptors show better performance in learning with small molecules and predicting the RE for the larger ones. Overall, the performance of the models were only slightly worse than the case where the training set was chosen randomly. 16

ACS Paragon Plus Environment

Page 17 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

Table 2: RMSE (meV) for signature (the correlation coefficient in parenthesis), molecular transform (MT) and circular fingerprint (CF) descriptors for training sets with the large molecules excluded.

HeavyAtom count Train