Empirical Relationship between Chemical Structure and Redox

Apr 30, 2018 - File failed to load: ..... (27) Gross level descriptors, such as general attribute of atoms in the molecule, and properties, such as ba...
0 downloads 0 Views 5MB Size
Subscriber access provided by UNIV OF NEW ENGLAND ARMIDALE

C: Energy Conversion and Storage; Energy and Charge Transport

Empirical Relationship between Chemical Structure and Redox Properties: Mathematical Expressions Connecting Structural Features to Energies of Frontier Orbitals and Redox Potentials for Organic Molecules Piyush Tagade, Shashishekar P Adiga, Min Sik Park, Shanthi Pandian, Krishnan S Hariharan, and Subramanya Mayya Kolake J. Phys. Chem. C, Just Accepted Manuscript • DOI: 10.1021/acs.jpcc.8b03577 • Publication Date (Web): 30 Apr 2018 Downloaded from http://pubs.acs.org on April 30, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 47 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

Empirical Relationship between Chemical Structure and Redox Properties: Mathematical Expressions Connecting Structural Features to Energies of Frontier Orbitals and Redox Potentials for Organic Molecules

Piyush M. Tagade,

∗,†

Shashishekar P. Adiga,

Krishnan S Hariharan,



∗,†

Min Sik Park,



Shanthi Pandian,

and Subramanya Mayya Kolake





†Next Gen Research, Samsung Advanced Institute of Technology, Samsung R& D Institute,

Bangalore 560037, INDIA ‡Computer-Aided Engineering Group, Samsung Advanced Institute of Technology, Samsung

Electronics, 130 Samsung-ro, Suwon, Gyeonggi-do 443-803, Republic of Korea E-mail: [email protected]; [email protected]

1

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Abstract A set of mathematical expressions to predict redox potentials and frontier orbital energy levels for organic molecules as a function of structural features is proposed. This is achieved by using the principal component regression method on reduction potential (Ered), oxidation potential (Eox), HOMO, and LUMO values calculated using density functional theory (DFT) on a training set consisting of 77547 molecules from Pubchem database. The rst set of expressions allow prediction of Ered, Eox, HOMO and LUMO values using molecular ngerprints alone with R2 of ca. 0.74, 0.82, 0.92, and 0.85, respectively, which can be used for preliminary screening of molecules before performing DFT calculations. In the second set of expressions, when we include DFT calculated HOMO and LUMO values as additional descriptors, the R2 of Eox and Ered predictions increase to 0.91 and 0.90. This more accurate approach for redox potential predictions is still signicantly more computationally ecient compared to DFT calculations of redox potentials. The potential of these approaches is demonstrated by using the examples of polyacenes and quinoxaline family of molecules. These empirical relations are ideally suited for high throughput screening for a variety of optoelectronic applications. The resultant tool, QSROAR, is made available at https://github.com/piyushtagade/ qsroar_version2 .

1

Introduction

Chemical reactions involving electron transfer from one molecule to another, resulting in oxidation or reduction are important in many applications including organic solar cells, 1 organic thin lm transistors, 2 articial photosynthesis, 3 electrochemical systems 4 and organic photodetectors 5 to name a few. The knowledge of reduction and oxidation (redox) potentials of a given molecule is important since using individual materials with appropriate energy levels is essential for them to work in tandem in the desired way in a device. For example, in the case of organic semiconductors, the energy required to inject an electron or remove

2

ACS Paragon Plus Environment

Page 2 of 47

Page 3 of 47 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

an electron determine their suitability to be used as n-type or p-type semiconductors 6,7 with applications in organic thin lm transistors, organic light emitting diodes as well as organic solar cells. Similarly, the relative ease with which one can reduce or oxidize an electrolyte additive determines its voltage stability window when used in an electrochemical cell . 8 The redox potentials essentially represent the tendency of a molecule to accept or lose an electron and have turned out to be important properties for the discovery of new materials and screening databases of candidate organic molecules.

Redox potentials are experimentally determined using cyclic voltammetry measurements. 9 However, in the interest of in-silico design as well as high throughput screening of molecules and/or due to experimental diculties associated with cyclic voltammetry of certain chemicals, computational chemistry calculations are often employed to determine redox potentials. 1012 These rst-principles estimation of redox potentials involve calculation of the BornHaber thermodynamic cycle in which free energies are directly calculated and solvation energies are accounted for. 13 Since these calculations can be expensive for large molecules, a simplied approach to the Born-Haber thermodynamic cycle has been adopted. 14 In this method, the geometry optimization of molecules in both the ground state and the reduced or oxidized state (radical anion or radical cation respectively) are carried out with a continuum based solvent model in order to account for reorganization energies. This approach removes the optimization step in the gas phase that is part of the Born- Haber thermodynamic cycle, thus decreasing the number of calculations, from six to three (to obtain both oxidation and reduction potentials). Even this simplied approach amounts to a signicant computational cost when screening thousands of molecules, thus there is a great interest in further speed-up in the estimation of redox potentials.

For example, there is a good correlation between one electron HOMO and LUMO energies and experimental oxidation and reduction potentials, respectively. 15 If such a correlation is 3

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

used to establish relationship between HOMO-LUMO energies and redox potentials, it would further reduce the number of geometry optimization calculations needed to predict redox potentials from three (two open shell and one closed shell) to one (closed shell geometry optimization) with really signicant savings in computational time. However owing to the diculty in tting a simple linear relationships between the two, previous attempts at correlating HOMO-LUMO values with redox potentials have been limited to specic classes or types of molecules 16 . Thus, there has not been an attempt in the computational chemistry literature to develop a general expression to correlate redox potentials to HOMO/LUMO energies for a large variety of molecules irrespective of constituent functional groups. Obviously, the most desirable approach would be to develop mathematical emulators that predict redox potentials and frontier orbital energy levels simply based on chemical structural features of the molecules, completely eliminating the need to perform any DFT calculations.

Previously, machine learning methodologies have enabled development of mathematical emulators for chemical structure-property relationships with high accuracies. 12,17,18 For example, Mannodi-Kanakkithodi et al. 19 use kernel ridge regression for prediction of polymer energy bandgap and dielectric constant. They train their algorithm using a training set of 256 polymers and predict the properties with a reasonable accuracy. Montavon et al. 20 use articial neural network (ANN) to develop an emulator of molecular electronic properties including orbital HOMO-LUMO energies. They train their model on a database of 5000 molecules and obtain mean absolute error (MAE) of 0.15eV and 0.12eV for HOMO and LUMO energies, respectively. In, 21 Pyzer-Knapp et al. use a more advanced deep neural network (DNN) for HOMO-LUMO energy prediction. This DNN is trained using a dataset of 2,000,000 compounds and they obtain predictions with high accuracies. Pareira et al. 22 compare various machine learning algorithms on a dataset of 111,000 molecules and report predictions with minimum MAE of 0.15eV and 0.16eV for HOMO and LUMO orbital energies, respectively. Faber et al. 23 investigate dierent combinations of molecular descriptors 4

ACS Paragon Plus Environment

Page 4 of 47

Page 5 of 47 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

and the machine learning algorithms for prediction of dierent electronic ground state properties of the organic compounds. An exhaustive review of machine learning methods for computational chemistry, with a particular focus on the modern deep learning algorithms, is provided by Goh et al. 24 The machine leaning methods are primarily used in the literature for prediction of electronic ground state properties like HOMO-LUMO energies. However, redox potential prediction is more important for many applications of interest. 14 Barring few studies like Park et al., 8 where they use ANN for redox potential predictions, structureproperty correlations for redox potential are not explored in the literature.

Notwithstanding the high accuracies achieved by the machine learning algorithms, we take an alternate approach for developing empirical relationships for chemical structureproperty correlations. Despite the high accuracy, use of machine learning enabled chemical structure-property relationships is impeded by two critical limitations. First, a trained machine learning model can be used by wider community only if the access to the algorithm with optimized parameters is available. Second, if an individual research group plans to train their own machine learning algorithm, access to a highly accurate property database is required. It is therefore interesting, to establish an alternative theoretical protocol that couples speed and accuracy, wherein DFT calculated properties can be predicted using empirical expressions that relate to the molecular ngerprints, either partially or completely eliminating the need to perform expensive DFT calculations. Key contribution of this paper is a set of empirical mathematical expressions that predict redox potentials and frontier orbital energies as a function of molecular ngerprints. Such an expression is enabled by statistical correlation between the molecular properties and the descriptors.

Taking the analogy from the well-established protocols for virtual screening in pharmaceutical world, where biomolecular docking simulations, chemical similarities, pharmacophore searching, or QSAR models are used, ecient protocols for the high-throughput 5

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 47

virtual screening of new molecules without performing any quantum chemistry calculations, here we derive mathematical expressions to accurately capture the correlation between redox potentials, frontier orbital energies and molecular ngerprints that include key chemical functionalities. These expressions together with the parameters derived in this work can be used to predict redox potentials and HOMO/LUMO values of an organic molecule by providing its SMILES structure as input and the calculation takes fractions of a second. Further, we show that the redox prediction accuracy can be signicantly improved by using DFT calculated HOMO/LUMO values that require one closed shell geometry optimization calculation as opposed to one closed shell and two open shell geometry optimization calculations needed to calculate redox potential using a DFT only protocol. This new method can be potentially used for automated fast screening of molecules based on prediction of HOMO/LUMO energies and redox potentials. Together, these two capabilities represent a signicant reduction in redox potentials computation time for organic molecules. We have made the resultant python tool, and

Orbital

energies through

Quantifying

inuence of

Algebraic Relationships

Structure

on

Redox

potentials

(QSROAR), publically available at

https://github.com/piyushtagade/qsroar_version2 .

2

Methodology

In this paper, we have used principle component regression (PCR) to develop the desired empirical relations. For a given functional relationship, the PCR algorithm uses a training dataset of molecular ngerprints and corresponding properties as input, and provides a set of correlation coecients as output. The training dataset is created from a subset of PubChem materials database, 25 which is stored in terms of properties for a given molecule which is identied using its unique id (for example, CIDs are used for identifying molecules in PubChem database.) This database is rst digitized, i.e. represented using real numbers.

6

ACS Paragon Plus Environment

Page 7 of 47 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

In this digitization step, a molecule is rst represented using a molecular descriptor, and subsequently ngerprints are extracted from this descriptor. Various molecular descriptors are proposed in the literature and their relative advantages are expounded in the earlier works. 23,26,27 The molecular descriptors used in the literature can be classied into three levels of granularities. 27 Gross level descriptors, like general attribute of atoms in the molecule and properties like band gap, are used when moderate prediction accuracies are desired. When the application demands very high prediction accuracies, sub-Angstrom level ngerprints like Coulomb matrix 28 are used. On the contrary, when reasonably high accuracy across a wide chemical space is desired, molecular fragment level descriptors like presence of particular functional group in the molecular structure, are used. 8,19 In congruence with the key aim of this paper to provide simple mathematical expressions for reasonably accurate property predictions for a wide range of organic molecules, molecular fragment level descriptors for ngerprinting are used. In particular, we have used canonical SMILES (Simplied Molecular-Input Line-Entry System) representation as a molecular descriptor. 29 Subsequently, the molecular ngerprints are extracted from the SMILES representation. The molecular ngerprints are dened by a number of pre-determined chemical structures present in the molecule. We have considered these chemical structures at three levels of hierarchy. The rst level of structures include type of atoms, bonds, number of side chains etc. At the second level, presence of primary functional groups like alkene, benzene ring etc, and other basic chemical structures like double bonded oxygen and carbon atoms are identied. At the third level of hierarchy, presence of more complex functional groups like anhydride and conjugated systems of dierent length are considered. When considered together, these ngerprints encode complete geometric and Hamiltonian information of the organic molecules. 8 These chemical structures are identied using substrings of the SMILES representation. A comprehensive list of all the molecular ngerprints used in this paper is given in Table 1. The table also lists respective SMILES substrings used to identify the 7

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 47

ngerprints. In the rest of the paper, the ngerprints are identied by their respective serial number in the Table 1.

2.1

Principle Component Regression

Though various machine learning algorithms can be used to obtain the requisite correlation (for ex. 8,22,24 ), our choice of the algorithm is motivated by the key aim of this paper: to provide a simple functional form for the structure-property correlation. Thus, a linear least squares regression is used in this paper to obtain the empirical relation for the structureproperty correlation. The linear least squares regression is used to estimate parameters of the functional form

y=

m X

βi fi (x) ,

(1)

i=1

where x is a set of n input features, often derived from the ngerprints of molecular descriptors, y is a set of d output properties like redox potentials and orbital energies, and fi (x) are basis functions. The unknown parameters βi are estimated by minimizing the squared error between the data and the predictions. For a given dataset D ∈ RN ×d of N output property values and a corresponding matrix of basis functions F ∈ RN ×m , the least squares estimate of β is given by 

β = FTF

−1

F T D.

(2)

Accuracy of resultant correlation is further improved by combining the least square regression with the principle component analysis. The resultant algorithm is known as the principle component regression (PCR). The PCR is implemented by rst using singular value decomposition (SVD) of F to obtain

F = U T SV ,

(3)

where S is a diagonal rectangular matrix of singular values, while U and V are orthogonal

8

ACS Paragon Plus Environment

Page 9 of 47 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

unitary matrices. The SVD is used to obtain the eigenvalue decomposition of F T F as

F T F = V T S2V ,

(4)

where S 2 is a diagonal matrix of eigenvalues while the columns of V are the corresponding eigenvectors. As the eigenvalues decreases spectrally, only eigenmodes corresponding to the rst k eigenvalues, also known as the principle component direction, participate in the further computations. Let Vk be the matrix consisting of rst k columns of V . Thus, the corresponding matrix of the rst k principle components is given by

Φk = F Vk .

(5)

We use a least-square estimate to obtain the regression coecients of a functional mapping between the principle components Φk and the dataset D as 

γ = ΦTk Φk

−1

ΦTk D.

(6)

Finally, the PCR estimate of β is given by

β = Vk γ.

(7)

We use the estimated β in Eq. 1 to obtain desired empirical relation for the structure property correlation.

9

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

3 3.1

Results and Discussion Database Creation

To generate the database of DFT computed properties, quantum chemical calculations were performed using the Gaussian 09 software package 30 on 77,547 molecules randomly selected from the PubChem database 25 with compound identier (CID) numbers between 1−136200. The B3LYP functional and 6-311+G (d,p) basis sets were employed with the polarized continuum model 31 to describe the solvation environment. The solution environment was constructed using an eective dielectric constant of 28.86, which corresponds to the average value of ethylene carbonate, diethyl carbonate, and ethyl methyl carbonate at a ratio of 3 : 5 : 2. The geometries of all charged and neutral molecules were optimized in the solvation environment without any constraints. Frequency calculations were carried out to check for minima on the potential energy surface. The oxidation potential (Eox) was calculated from the dierence in the total free energies of the oxidized and neutral molecules, where the oxidation reaction involves removing an electron from the neutral molecule. The reduction potential (Ered) was calculated from the dierence in the total free energies of the reduced and neutral molecules, where the reduction reaction involves adding an electron to the neutral molecule. All oxidation and reduction potentials were referenced to the Standard Hydrogen Electrode potential by subtracting the value by 4.44 V, allowing us to omit `vs. SHE 'hereon. In addition to the redox potentials, the HOMO and LUMO energies of the molecule are noted to understand their correlation with redox potentials.

3.1.1 Statistical Data Analysis In Figure 1, the calculated Eox are plotted against the corresponding HOMO values and the Ered are plotted against the corresponding LUMO values for all the molecules in the 10

ACS Paragon Plus Environment

Page 10 of 47

Page 11 of 47 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

database. While correlations between Eox - HOMO and Ered - LUMO is readily seen, there is a wider spread in the Ered vs LUMO plot. To illustrate the diversity of the data, HOMO, LUMO, Eox, and Ered values are plotted using a violin plot 32 (bottom plot of the Fig. 1). Violin plot shows the probability distribution of the data obtained using the kernel density estimation. 33 Similar to the box plot, the inverted probability distribution is symmetrically overlaid on the violin plot (magenta colored shaded region in the Fig. 1). Extreme values and the mean of the sample are also shown using the blue lines in the gure. The distribution of reduction potential values in the database is symmetric and close to Gaussian. LUMO values are also symmetric, though the distribution shows multiple peaks. HOMO values show a heavy tailed distribution with a single peak, and a long tail towards a minima. However, this heavy tail is not reected in oxidation potential sample. Even though HOMO sample has a heavy-tailed distribution, the distribution of the oxidation potential is still close to Gaussian distribution. If you invert the violin plot for HOMO it resembles the violin plot for oxidation potentials in many of its features. Similarly, if you invert the violin plot for LUMO many of its features have close resemblance to that of reduction potential, in terms of asymmetry around the mean and the number of peaks. This qualitative similarity stems out of the fact that HOMO and LUMO are tightly connected to Eox and Ered, respectively. However, they are neither exact mirror images nor linearly scaled mirror images of each other. If there was a linear correlation with minimal scatter then the violin plots of the variables under consideration would qualitatively and quantitatively match. This observation necessitates use of non-linear basis functions for the desired empirical relations.

3.1.2 Outliers Detection Violin plot in the Fig. 1 shows presence of the statistical outliers that may arise due to the presence of molecules with extreme properties and inherent model uncertainties in the quantum calculations. 34 Error in these statistical outliers can propagate to the desired structure11

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 47

property relation, impacting overall accuracy of the model. To identify these statistical outliers, the dataset is divided into four equal quartiles. Subsequently, the interquartile range (IQR) 35 for each property is obtained as IQR = Q3 − Q1 , where Q1 and Q3 are upper and lower limits of the rst and the fourth quartile respectively. We use the IQR values to dene the statistical outliers as molecules with property values outside the limit given by

Q1 − 1.5IQR and Q3 + 1.5IQR. Using this methodology, 1674 molecules were identied as statistical outliers. These molecules are not considered in further analysis.

3.1.3 Data Normalization The PCR method is sensitive to the range of the database values, with the performance reducing signicantly if there is a huge variation in the input and output data ranges. To ensure robust correlation, we have normalized the dataset using respective minimum and maximum values of ngerprints and the output properties. The ngerprint normalization is done by

x = −2 +

4 (ˆ x − xmin ) , xmax − xmin

(8)

where x ˆ is an unnormalized value while xmin and xmax are the corresponding minimum and maximum values. Output properties are also similarly normalized as

y=

yˆ − ymin . ymax − ymin

(9)

Table 2 summarizes the minimum and maximum values of the HOMO-LUMO energies and the redox potentials.

12

ACS Paragon Plus Environment

Page 13 of 47 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

3.2

Basis Function Selection

Implementation of the PCR method is completed by choosing the appropriate basis functions. Our choice of the basis function is guided by the Occam's razor, 36 which prefers simpler basis functions over the complex ones. Moreover, use of the simpler basis functions avoid model over-tting. This paper uses linear, hyperbolic tangent and a sigmoid as basis functions. For an input x, the hyperbolic tangent function is given by

ex − e−x , ex + e−x

τ (x) =

(10)

where e denotes the exponential function. The hyperbolic tangent function is bounded between −1 and 1. Similarly, the sigmoid function is given by

σ (x) =

1 , 1 + e−x

(11)

which is bounded between 0 and 1. The basis functions are shown in the top plot of the Figure S2 of the supplementary material. Using these basis functions, the desired structureproperty relationship is given by

y = β0 +

N X i=1

βL,i xi +

N X

βτ,i τ (xi ) +

i=1

N X

βσ,i σ (xi ) ,

(12)

i=1

where N denotes total number of molecular ngerprints, xi is the ith normalized ngerprint and βL,i , βτ,i and βσ,i are regression coecients corresponding to the linear, hyperbolic tangent and the sigmoid basis function, respectively. Note that the basis functions non-linearly transforms the scaled input values, which are further transformed using the principle component analysis. Resultant transformation of the probability density function is analyzed in the violin plot of the Figure S1 of the supplementary material.

The selected basis functions emphasize dierent regions of the input data range, as can be 13

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

observed from the violin plot. After data transformation using the hyperbolic tangent function, transformed probability distribution closely resembles uniform distribution, although with a peak near maxima. On the contrary, sigmoid transformation approximately retains shape and features of the original probability distribution. However, the features near the extrema are enhanced after the sigmoid transformation. The violin plot also shows probability distribution of the rst eight principle components (bottom plot, Figure S1 of SI). All the probability distributions closely approximates the Gaussian distribution. This observation is consistent with a well known equivalence between PCA and the Karhunnen-Loeve expansion. 37

3.3

Feature Selection

To pursue a quantitative relationship between molecular ngerprints, redox potentials and frontier orbital energy levels it is imperative that we develop predictive models in the form of mathematical relations by using descriptors in the form of chemical information about molecules. It is possible to compute many numerical descriptors for a given molecule. Initially a set of 82 descriptors as given in Table 1, that consisted of types of atoms, bonds, or functional groups, were considered. Prediction models of Eox, Ered, HOMO, and LUMO are then constructed by using the functional form given in Eq. 12. To estimate the prediction error of the models, the whole dataset is randomly divided into training data composed of

80% of the dataset (total 60,698); the rest is used as test data. Since some of the 82 descriptors are possibly very closely related to each other and at times even capture the same information with respect to a property of interest, the selection of relevant descriptors is not necessarily intuitive. Therefore, we rst started with by progressively adding one descriptor at a time from the list to see how it aected the RMSE and R2 value of the t. (Please see Figure S2 of the supplementary information). In general adding more descriptors to the model improved the t. However, the idea was to use as few a number of ngerprints as 14

ACS Paragon Plus Environment

Page 14 of 47

Page 15 of 47 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

possible that give a reasonably accurate prediction for a property under consideration. To achieve this, we removed the least sensitive feature every step and tracked the RMSE and

R2 values of the t (as shown in Figure S3 of SI). Following this procedure, we identied individual sets of most relevant features for mathematical expressions of HOMO-LUMO energies and the redox potentials.

We rst obtain empirical relationship for LUMO energies as a function of molecular ngerprints. Subsequently, the predicted LUMO values are used with other ngerprints to obtain a mathematical expression for HOMO-LUMO gap as a function of LUMO energy and then chemical ngerprints. Finally, the HOMO energy is obtained by adding the LUMO energy to the HOMO-LUMO gap. Using the sensitivity analysis, total 25 ngerprints are selected for these mathematical expressions. For LUMO energy, secondary aldimine and the sulnyl group were removed in the initial iterations. Similarly for the HOMO-LUMO gap, imine and sulnyl groups are identied as least signicant groups and removed in the initial iterations. The selected features for the LUMO energy are summarized in the Table 3, while, the features for the HOMO-LUMO gap are summarized in the Table 4. Note that total 12 ngerprints are common for LUMO energy and HOMO-LUMO gap, signifying the fact that electronic Hamiltonians of these two properties are similar. However, the HOMO-LUMO energy also depends on the type of molecule and the functional group. Such dierences are captured by the presence of dierent features for LUMO energy and HOMO-LUMO gap correlations. We also explored a mathematical expression for HOMO energy as a function of molecular ngerprints and subsequently HOMO-LUMO gap as a function of HOMO energy and the chemical ngerprints. However, the resultant R2 and RMSE values were comparatively poor, these expressions are not reported in this paper.

Following the similar procedure, relevant ngerprints for the redox potential are identied. In addition to the molecular ngerprints, we have also included HOMO-LUMO energy 15

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

as molecular feature for the redox potential. With 20 features, R2 > 0.9 is obtained for both the oxidation and reduction potential predictions. For oxidation potential, achalide groups were removed in the initial iterations, with the acahalide bromine removed in the rst iteration. For the reduction potential, sulnyl group was removed in the rst iteration, whereas, acahalide bromine was also removed in the third iteration. Table 5 summarizes the features selected by this method for the oxidation potential, corresponding features for the reduction potential are summarized in the Table 6.

In Figure 2, we investigate the statistical correlation between the molecular properties and the selected ngerprints using the Pearson correlation coecient (heatmap relevant for redox potential is only shown). For better visibility, only the pairwise correlations with absolute value greater than 0.3 is shown in the gure. Many important insights can be drawn from this correlation plot. Oxidation potential is found to be negatively correlated with the HOMO value consistent with the fact that more negative the HOMO level is harder it is to remove an electron from the molecule and hence higher the oxidation potential. High negative correlation is also observed between oxidation potential and the number of carbon atoms and number of alkene groups in the molecular structure, which again can be explained based on the fact that HOMO has a positive correlation with both. The reduction potential has a strong negative correlation with the LUMO since a more negative LUMO value corresponds to higher electron anity and by denition a higher reduction potential. LUMO has strong negative correlation with number of double bonds, number of oxygen anions, number of double bonded oxygen (=O) groups, since all of which have the eect of increasing the electron anity.

16

ACS Paragon Plus Environment

Page 16 of 47

Page 17 of 47 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

3.4

Mathematical Expression for HOMO-LUMO Energies

After identifying the relevant features, PCR algorithm is used with the randomly selected training dataset to obtain the regression coecients. In Tables 3-4, regression coecients corresponding to the basis function of each feature are summarized for LUMO energy and HOMO-LUMO gap, respectively. Corresponding minimum-maximum values of the ngerprints are also reported in the table. The regression coecients are used with the following steps to predict the LUMO energy: 1. Obtain SMILES representation of the molecular structure. 2. Count number of ngerprints using the SMILES substrings. Normalize the count using minimum-maximum values reported in the Table 3 for each ngerprint. 3. Use regression coecients in Eq. 12 to predict normalized LUMO energy. 4. Use minimum-maximum values of the properties (reported in Table 2) to obtain unnormalized LUMO energy. Same steps are followed to predict the HOMO-LUMO gap. Subsequently, LUMO energy is added to the HOMO-LUMO gap to obtain the HOMO energy.

In Figure 3, Q-Q plots for the predicted of HOMO-LUMO gap and the LUMO energy obtained using the proposed mathematical expression are given. The Q-Q plots are given for training and test datasets, separately. A 45o line (shown using solid-black line in the gures) indicate 100% accuracy in the prediction. As can be observed from the gure, predictions are close to the 45o line, indicating high accuracy of the predictions. For the HOMO-LUMO gap, R2 = 0.921 is obtained for the training dataset and R2 = 0.923 is obtained for the test dataset. Similarly for the LUMO energy, R2 = 0.857 is obtained for the training dataset and

R2 = 0.856 is obtained for the test dataset.

17

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

3.5

Mathematical Expression for Redox Potentials

Regression coecients for the oxidation and reduction potential are provided in Tables 5-6, respectively. Minimum-maximum values of the respective functional groups are also provided in the tables. Using the regression coecients, the redox potential is predicted by following two approaches. In the rst approach (denoted by method 1 in the following), we predict the HOMO-LUMO energies as outlined in the subsection 3.4. Subsequently, these HOMO-LUMO energies are used to predict the redox potential. Using this approach, the oxidation potential is predicted with R2 = 0.731 for the training dataset and R2 = 0.737 for the test dataset. Similarly, the R2 values for reduction potential are R2 = 0.813 and

R2 = 0.814 for the training and test dataset, respectively. The resultant Q-Q plot is given in the Figure S4.

In the second approach (denoted by method 2), we use the HOMO-LUMO energies obtained using the DFT calculations to predict redox potential. Figure 4 shows the Q-Q plot for redox potential prediction. For the oxidation potential, the correlation results in the training dataset R2 = 0.909 and the test dataset R2 = 0.906. Similarly for the reduction potential, the correlation gives the training dataset R2 = 0.901 and the testing dataset

R2 = 0.894. In Table 7, RMSE and MAE for prediction of all the four properties is summarized. As the R2 values are similar for both the training and test datasets, RMSE and MAE values for the complete dataset are reported. Percentage of molecules with absolute prediction error greater than 0.5 (e)V and less than 0.1 (e)V are also reported in the table. Resultant error distribution is summarized in the Fig. S5. Pareira et al. 22 have compared accuracy of dierent machine learning algorithms for HOMO-LUMO energy predictions. For comparison, we consider average of these prediction accuracies. Using the machine learning algorithms, Pareira et al. 22 predict the HOMO energy with mean RMSE of 0.30eV and the mean MAE of 0.22eV. Similarly, they predict the LUMO energy with mean RMSE of 0.35eV and mean MAE of 0.25eV. On an average, HOMO and LUMO energies are predicted with 18

ACS Paragon Plus Environment

Page 18 of 47

Page 19 of 47 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

deviation less than 0.5eV for 91.28% and 87.09% of the molecules, respectively. Using the structure-property relationships proposed in the paper, HOMO and LUMO energies are predicted with RMSE/MAE of 0.32/0.41 and 0.30/0.37, while, the HOMO-LUMO energies are predicted with deviation less than 0.5eV for 78.15% and 80.59% of molecules, respectively. As is evident from the table, prediction accuracy is signicantly higher for the redox potentials, where oxidation potential is predicted with RMSE of 0.22V and MAE of 0.18V, while the reduction potential is predicted with RMSE of 0.28V and MAE of 0.23V.

One application of our predictors is the down-selection of candidate structures from large databases for further computationally expensive DFT calculations or experimental evaluation. Further, our method can be extended to include more solvents and other properties such as polarizability, lewis acidity/basicity, pKa etc. We emphasize that, using method 1, the prediction of all 4 properties for an out-of-sample molecule requires only milliseconds, as opposed to several CPU hours using the DFT method used for training. We further emphasize that our predictions are strictly limited to out-of-sample molecules that interpolate, that is, those molecules that resemble the molecules in the training set in terms of molecular ngerprints they contain. For molecules that bear no resemblance to the training set, our model is not expected to yield accurate predictions. To illustrate this further, we considered how relative occurrence of dierent ngerprints and how they aected the prediction accuracy. In Figure 5 we have plotted a histogram of fraction of molecules containing a given molecular ngerprint for which the absolute prediction error was > 0.5 eV for frontier orbitals and > 0.5 V for redox potentials. Also indicated are the mean error and the number of molecules containing a given molecular ngerprint. This analysis suggests that the prediction error is dependent on whether a specic molecular ngerprint was well represented in the database. This is particularly true for ngerprints that contain functional groups with hetero-atoms. Prediction error is generally high for molecules with functional groups which are not well represented in the training dataset. To avoid over-tting, further prediction 19

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

accuracy improvement for such molecules is not pursued in this paper.

Our predictors are dened using simple mathematical expressions with well-dened parameters. As such, these predictors are readily amenable to chemical insights and intuitions that can signicantly aid in high-throughput screening for materials discovery. For instance, eect of addition of a particular functional group on the molecular properties can be easily investigated by simply adding the corresponding coecients multiplied by the basis function value to the known property values. Such an insight can be used to expand the existing screening search space. On the contrary, although the more advanced machine learning approaches provide higher accuracies, these predictors are essentially black-box models. Thus, extracting chemical insights from these machine learning predictions is often dicult.

3.6

Validation for Quinoxaline based Compounds

Now we consider some special classes of molecules to demonstrate that within the same class of molecules there is a linear correlation between HOMO and Eox as well as LUMO and Ered. First we consider the example of quinoxaline based compounds as they have attracted attention of the research community due to their potential applications as organic additives for Li-ion batteries. 8 Earlier studies 15,39 have established linear relationship between the HOMO-LUMO energies and the redox potentials. In Figure 6, this linear relationship for quinoxaline based organic compounds is explored. For reference, molecular structure of the quinoxaline is also shown in the gure. Total 84 quinoxaline based compounds are identied in the available dataset. The HOMO-LUMO and redox potential values for these compounds is shown in the gure using the green dots. Using least-square error minimization, a linear correlation between the HOMO-LUMO energies and the redox potentials is obtained. The solid black line in the gure shows the resultant linear correlation. The redox potential values predicted using the proposed correlation developed in this paper are also shown in 20

ACS Paragon Plus Environment

Page 20 of 47

Page 21 of 47 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

the gure using blue triangles. The root-mean squared error for the linear relation is 0.14V for the oxidation potential and 0.25V for the reduction potential. Corresponding root-mean squared errors for the proposed correlation are 0.2 V and 0.3V respectively.

3.7

Validation for Polyacenes

The proposed correlation is also tested for polyacenes which have potential applications in optoelectronics. Figure 7 compares the true and predicted values of the HOMO-LUMO energies and the redox potentials for the polyacenes (also see Table S1 in the SI). For comparison, we performed DFT calculations for six polyacenes (benzene-hexacene) to obtain the HOMO-LUMO energies and the redox potentials. When the method 1 is used for redox potential prediction, MAE for the LUMO prediction is 0.2eV , while the corresponding MAE for HOMO prediction is 0.67eV . When these predicted HOMO-LUMO energies are used, MAE for the reduction potential prediction is 0.27 eV while the corresponding MAE for the oxidation potential prediction is 0.54V . When the method 2 is used, MAE for the oxidation potential prediction improved to 0.24V , while MAE for the reduction potential prediction remained unchanged at 0.27V .

4

Conclusions

In this study, we have developed prediction models for redox potentials and HOMO, LUMO values for a large set of organic molecules using fundamental information of constituent elements and functional groups as ngerprints. Principal component regression was used to develop a mathematical model to capture the correlation between redox potentials, HOMOLUMO values and a minimal set of molecular ngerprints that were chosen on the basis of their relative inuence on the predicted property. This model enabled the estimation of DFT HOMO-LUMO energies with accuracies upto 0.33eV and 0.35 eV, respectively. Using 21

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 47

the estimated HOMO-LUMO energies, oxidation and reduction potential can be predicted with accuracies upto 0.27V and 0.29V respectively. Redox potentials prediction accuracy was further improved by using DFT calculated HOMO-LUMO energies, where oxidation and reduction potentials were predicted with MAE up to 0.18 V and 0.23 V, respectively. This quantitative empirical relation between structure and property is useful as predictors for high throughput screening of organic molecules for a variety of applications at a fraction of computational cost as compared to full-edged quantum chemistry calculations.

Supporting Information Available Figure S1 describes feature transformations used for PCR. Figure S2 shows change in the RMSE and R2 values when the ngerprints are sequentially added. Figure S3 shows change in the RMSE and R2 values when the least sensitive ngerprints are removed sequentially. Q-Q plot for redox potential prediction using the method 1 is shown in Figure S4. Figure S5 shows probability density function of the error in property predictions. DFT calculated HOMO-LUMO energies and redox potential values are compared against the predictions in Table S1.

References (1) Zhao, J.; Li, Y.; Yang, G.; Jiang, K.; Lin, H.; Ade, H.; Ma, W.; Yan, H. Ecient organic solar cells processed from hydrocarbon solvents. Nature Energy

2016, 1, 15027.

(2) Klauk, H. Organic thin-lm transistors. Chemical Society Reviews

2010, 39, 26432666.

(3) Su, J.; Vayssieres, L. A Place in the Sun for Articial Photosynthesis? ACS Energy

Letters 2016, 1, 121135.

22

ACS Paragon Plus Environment

Page 23 of 47 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

(4) Park, M. S.; Kang, Y.-S.; Im, D.; Doo, S.-G.; Chang, H. Design of novel additives and nonaqueous solvents for lithium-ion batteries through screening of cyclic organic molecules: an ab initio study of redox potentials. Physical Chemistry Chemical Physics

2014, 16, 2239122398. (5) Baeg, K.-J.; Binda, M.; Natali, D.; Caironi, M.; Noh, Y.-Y. Organic light detectors: photodiodes and phototransistors. Advanced materials

2013, 25, 42674295.

(6) Anthony, J. E.; Facchetti, A.; Heeney, M.; Marder, S. R.; Zhan, X. n-Type Organic Semiconductors in Organic Electronics. Advanced Materials

2010, 22, 38763892.

(7) Odobel, F.; Le Pleux, L.; Pellegrin, Y.; Blart, E. New photovoltaic devices based on the sensitization of p-type semiconductors: challenges and opportunities. Accounts of

chemical research 2010, 43, 10631071. (8) Park, M. S.; Park, I.; Kang, Y.-S.; Im, D.; Doo, S.-G. A search map for organic additives and solvents applicable in high-voltage rechargeable batteries. Physical Chemistry

Chemical Physics 2016, 18, 2680726815. (9) Nicholson, R. S. Theory and Application of Cyclic Voltammetry for Measurement of Electrode Reaction Kinetics. Analytical chemistry

1965, 37, 13511355.

(10) Hafner, J.; Wolverton, C.; Ceder, G. Toward computational materials design: the impact of density functional theory on materials research. MRS bulletin

2006, 31, 659

668. (11) Ceder, G. Opportunities and challenges for rst-principles materials design and applications to Li battery materials. MRS bulletin

2010, 35, 693701.

(12) Curtarolo, S.; Hart, G. L.; Nardelli, M. B.; Mingo, N.; Sanvito, S.; Levy, O. The highthroughput highway to computational materials design. Nature materials 191201. 23

ACS Paragon Plus Environment

2013, 12,

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 47

(13) Tian, Y.-H.; Go, G. S.; Runde, W. H.; Batista, E. R. Exploring Electrochemical Windows of Room-Temperature Ionic Liquids: A Computational Study. The Journal

of Physical Chemistry B 2012, 116, 1194311952. (14) Dathar, G. K. P.; Pandian, S.; Raju, S. G.; Park, D.-H.; Kang, H.-R.; Hariharan, K. S. Electrochemical Stability of Functionalized Cyclic Phosphonium (CylP+ nA-) Ionic Liquid Based Battery Electrolytes. Journal of The Electrochemical Society

2016, 163,

A1057A1063. (15) Assary, R. S.; Brushett, F. R.; Curtiss, L. A. Reduction potential predictions of some aromatic nitrogen-containing molecules. RSC Advances

2014, 4, 5744257451.

(16) Méndez-Hernández, D. D.; Tarakeshwar, P.; Gust, D.; Moore, T. A.; Moore, A. L.; Mujica, V. Simple and accurate correlation of experimental redox potentials and DFTcalculated HOMO/LUMO energies of polycyclic aromatic hydrocarbons. Journal of

molecular modeling 2013, 19, 28452848. (17) Mueller, T.; Kusne, A. G.; Ramprasad, R. Machine learning in materials science: Recent progress and emerging applications. Rev. Comput. Chem

2015,

(18) De Luna, P.; Wei, J.; Bengio, Y.; Aspuru-Guzik, A.; Sargent, E. Use machine learning to nd energy materials. Nature

2017, 552 .

(19) Mannodi-Kanakkithodi, A.; Pilania, G.; Huan, T. D.; Lookman, T.; Ramprasad, R. Machine learning strategy for accelerated design of polymer dielectrics. Scientic reports

2016, 6, 20952. (20) Montavon, G.;

Rupp, M.;

Gobre, V.;

Vazquez-Mayagoitia, A.;

Hansen, K.;

Tkatchenko, A.; Müller, K.-R.; Von Lilienfeld, O. A. Machine learning of molecular electronic properties in chemical compound space. New Journal of Physics 095003.

24

ACS Paragon Plus Environment

2013, 15,

Page 25 of 47 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

(21) Pyzer-Knapp, E. O.; Li, K.; Aspuru-Guzik, A. Learning from the harvard clean energy project: The use of neural networks to accelerate materials discovery. Advanced

Functional Materials 2015, 25, 64956502. (22) Pereira, F.; Xiao, K.; Latino, D. A.; Wu, C.; Zhang, Q.; Aires-de Sousa, J. Machine Learning Methods to Predict Density Functional Theory B3LYP Energies of HOMO and LUMO Orbitals. Journal of chemical information and modeling

2016, 57, 1121.

(23) Faber, F. A.; Hutchison, L.; Huang, B.; Gilmer, J.; Schoenholz, S. S.; Dahl, G. E.; Vinyals, O.; Kearnes, S.; Riley, P. F.; von Lilienfeld, O. A. Prediction errors of molecular machine learning models lower than hybrid DFT error. Journal of chemical theory and

computation 2017, 13, 52555264. (24) Goh, G. B.; Hodas, N. O.; Vishnu, A. Deep learning for computational chemistry.

Journal of Computational Chemistry 2017, (25) Kim, S.; Thiessen, P. A.; Bolton, E. E.; Chen, J.; Fu, G.; Gindulyte, A.; Han, L.; He, J.; He, S.; Shoemaker, B. A. et al. PubChem substance and compound databases. Nucleic

acids research 2015, 44, D1202D1213. (26) Ghiringhelli, L. M.; Vybiral, J.; Levchenko, S. V.; Draxl, C.; Scheer, M. Big data of materials science: Critical role of the descriptor. Physical review letters

2015, 114,

105503. (27) Ramprasad, R.; Batra, R.; Pilania, G.; Mannodi-Kanakkithodi, A.; Kim, C. Machine learning in materials informatics: recent applications and prospects. npj Computational

Materials 2017, 3, 54. (28) Faber, F.; Lindmaa, A.; von Lilienfeld, O. A.; Armiento, R. Crystal structure representations for machine learning models of formation energies. International Journal of

Quantum Chemistry 2015, 115, 10941101. 25

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 47

(29) Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of chemical information and computer

sciences 1988, 28, 3136. (30) Frisch, M. J.; Trucks, G. W.; Schlegel, H. B.; Scuseria, G. E.; Robb, M. A.; Cheeseman, J. R.; Montgomery, J. A., Jr.; Vreven, T.; Kudin, K. N.; Burant, J. C. et al. Gaussian 03, Revision D.01. Gaussian, Inc., Wallingford, CT, 2013. (31) Tomasi, J.; Mennucci, B.; Cammi, R. Quantum mechanical continuum solvation models.

Chemical reviews 2005, 105, 29993094. (32) Hintze, J. L.; Nelson, R. D. Violin plots: a box plot-density trace synergism. The

American Statistician 1998, 52, 181184. (33) Parzen, E. On estimation of a probability density function and mode. The annals of

mathematical statistics 1962, 33, 10651076. (34) Zhang, G.; Musgrave, C. B. Comparison of DFT methods for molecular orbital eigenvalue calculations. The Journal of Physical Chemistry A

2007, 111, 15541561.

(35) Wan, X.; Wang, W.; Liu, J.; Tong, T. Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range. BMC medical

research methodology 2014, 14, 135. (36) MacKay, D. J. A practical Bayesian framework for backpropagation networks. Neural

computation 1992, 4, 448472. (37) Adler, R. J.; Taylor, J. E. Random elds and geometry ; Springer Science & Business Media, 2009. (38) Chen, D.; Su, S.-J.; Cao, Y. Nitrogen heterocycle-containing materials for highly ecient phosphorescent OLEDs with low operating voltage. Journal of Materials Chem-

istry C 2014, 2, 95659578. 26

ACS Paragon Plus Environment

Page 27 of 47 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

(39) Cheng, L.; Assary, R. S.; Qu, X.; Jain, A.; Ong, S. P.; Rajput, N. N.; Persson, K.; Curtiss, L. A. Accelerating electrolyte discovery for energy storage with high-throughput screening. The journal of physical chemistry letters

27

2015, 6, 283291.

ACS Paragon Plus Environment

Reduction Potential, V

4 2 0 10.0 7.5 HOMO, eV

5.0

Page 28 of 47

0 2 4 6 5 LUMO, eV

0

5 Energy, eV; Potentials, V

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Oxidation Potential, V

The Journal of Physical Chemistry

0 -5 -10 HOMO

LUMO

Oxidation

Reduction

Figure 1: Analysis of database. Top gure shows variation of Eox against HOMO (top left) and Erd against LUMO (top right). The bottom gure shows diversity of the HOMO-LUMO and the redox potential values using the violin plot.

28

ACS Paragon Plus Environment

Page 29 of 47 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

0.8

0.4

0.0

0.4

0.8

O R H L 3 4 5 6 7 8 15 17 18 20 23 24 25 26 27 28 34 36 37 40 45 48 50 53 54 63 70 76 77 82 Figure 2: The pairwise correlation heatmap of the redox potential, HOMO, LUMO and the molecular structural ngerprints. For better visibility, only the pairwise correlations with absolute value greater than 0.3 are considered. In the gure, O, R, H, L represents oxidation, reduction, HOMO, LUMO respectively. Features are identied using their corresponding serial numbers in the Table 1.

29

ACS Paragon Plus Environment

2

Train

4 6 8 7.5 5.0 2.5 0.0 True HOMO-LUMO Gap, eV 0

Train

Predicted HOMO-LUMO Gap, eV

0

Predicted LUMO, eV

Predicted LUMO, eV

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Predicted HOMO-LUMO Gap, eV

The Journal of Physical Chemistry

2 4 4 2 0 True LUMO, eV

0

Page 30 of 47

Test

2 4 6 8

7.5 5.0 2.5 0.0 True HOMO-LUMO Gap, eV

1 0 1 2 3 4

Test

4

2 0 True LUMO, eV

Figure 3: This gure shows the Q-Q plot for the prediction of the energy band gap and the LUMO energies. For the energy band gap, training R2 = 0.921 and the test R2 = 0.923 is obtained. For the LUMO, training R2 = 0.857 and the test R2 = 0.856 is obtained.

30

ACS Paragon Plus Environment

1

Predicted Oxid. Pot., V

Train

0 2 4 True Oxid. Pot., V

Train

2 3 4 5

5 4 3 2 1 0 1

Test

0 2 4 True Oxid. Pot., V

0 Predicted Red. Pot., V

5 4 3 2 1 0 1 0

Predicted Red. Pot., V

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

Predicted Oxid. Pot., V

Page 31 of 47

4 2 True Red. Pot., V

1 2 3 4 5

0

Test

4 2 True Red. Pot., V

0

Figure 4: This gure shows the Q-Q plot for redox potential prediction. For oxidation potential, training dataset R2 = 0.909 and the test dataset R2 = 0.906 is obtained. For reduction potential, the training dataset R2 = 0.901 and the testing dataset R2 = 0.894 is obtained.

31

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

HOMO 1.0

LUMO

(0.566) (0.801) 2/2 1/1

1.0 (0.48) 58/133 (0.489) (0.496) 188/495 172/478

0.5 0.0

(0.73) 1/1 (0.694) 14/15 (0.418) (0.432) 157/314 13/29 (0.43) 694/1578

0.5 0.0

Oxid. Pot. 1.0 0.5 0.0

Page 32 of 47

(0.531) 1/1

(0.625) 4/6

Red. Pot. 1.0

(0.989) 2/3 (0.417) (0.391) 33/96 46/163

0.5

(0.786) (0.581) (0.542) (0.628) 60/96 32/53 43/72 55/92 (0.657) 267/495

0.0

Figure 5: This gure shows top ve functional groups with absolute prediction error > 0.5 (e)V.

32

ACS Paragon Plus Environment

DFT Linear (y=-0.825x-3.63) Using Correlation

2.5 2.0 1.5

(Linear))=0.14 V (Correlation))=0.2 V

1.0 0.5 7.5

Reduction potential, V

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

Oxidation potential, V

Page 33 of 47

7.0

6.5 6.0 5.5 HOMO energy, eV

5.0

DFT Linear (y=-0.844x-3.70) Using Correlation

1 2 3

(Linear))=0.25 V (Correlation))=0.3 V

3

2 1 LUMO energy, eV

Figure 6: Redox potential prediction for quinoxaline based compounds. Molecular structure of the base quinoxaline is also shown in the inset of the top gure.

33

ACS Paragon Plus Environment

The Journal of Physical Chemistry

1 1

2.0 1.5 1.0 0.5

Using Method 2

22

3 4 5 6 3 4 5 6

Predicted Red. Pot., V

Predicted Oxid. Pot., V

Using Method 1

1 2 True Oxid. Pot., V

5.5 6.0 6.5 7.0

1.5 2.0

2 2

2.5 3.0

6 6 4 55 4 3 3

1 1

3.5

3 2 True Red. Pot., V

0.5

5.0

2

3

4

5

1

7

6

Predicted LUMO, eV

Predicted HOMO, eV

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 47

1.5 2.0 2.5 3.0

6 5 True HOMO, eV

1

1.0 2 3 6 5

3

4

2 1 True LUMO, eV

Figure 7: HOMO-LUMO energies and redox potential predictions for polyacenes. Digits in the bracket denotes number of benzene rings in the polyacenes.

34

ACS Paragon Plus Environment

Page 35 of 47 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

Table 1: List of all the Fingerprints used. SMILES Substring used to Identify the Fingerprints are also Listed. Sr.

Feature

No.

SMILES

Sr.

Feature

SMILES

Substring

No.

Substring

1

HOMO

-

42

CC(=O)F

2

LUMO

-

43

CC(=O)Cl

3

C

44

CC(=O)Br

4

O

45

Br

5

N

46

(=O)

6

=

47

(=C

7

#

48

[N+]

8

No. of atoms identied using `['

49

[N-]

9

No. of side chains identied using `('

50

[O-]

10 No. of rings identied using integers

51

OC(=O)OC

11

52

CNC

C=

35

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 36 of 47

12

C=CC

53

CN(C

13

C=CC=CC

54

C(=N

14

CO

55

C(=NC

15

CN

56

C(=N)

16

CNO

57

C(=NC)

17

C1=CC= CC=C1

58

C(=O)NC(=O)

18

C1CCCC1

59

N=[N+]=[N-]

19

CC

60

OC#N

20

COO

61

O[N+](=O)[O-]

21

CCC

62

ON=O

22

CCCC

63

SS

23

=O

64

SO

24

N=O

65

S(=O)(=O)O

36

ACS Paragon Plus Environment

Page 37 of 47 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

25

F

66

S(=O)(=O)

26

S

67

SC#N

27

Cl

68

N=C=S

28

C=C

69

P

29

C#C

70

P(=O)(O)

30

COC

71

OP(=O)(O)

31

C1CO1

72

C(=S

32

CC=O

73

N1CCCC1

33

C(=O)

74 No. of chains ending with O

34

C(=O)O

75 No. of chains ending with C

35

CC(=O)OC (=O)C

76 No. of chains ending with N

36

C(=O)N

77

C1=CC= CN=C1

37

C#N

78

COX

37

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 38 of 47

38

CC(=NC)C

79

NC1=CC= CC=C1

39

N=C=O

80

ClC1=CC= CC=C1

40

N=N

81

41

CS

82

Length of longest ring

C1=CN= CN=C1

Table 2: Minimum and Maximum Values of Redox Potentials and HOMOLUMO Energies. Property

Min

Max

Oxidation Potential

-0.49 V

3.68 V

Reduction Potential

-4.89 V

0.20 V

HOMO Energy

-8.50 eV

-4.63 eV

LUMO Energy

-3.91 eV

0.70 eV

38

ACS Paragon Plus Environment

Page 39 of 47 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

Table 3: Regression Coecients for the LUMO Energy. Coecients β are used with the Eq. 12 for LUMO Energy Prediction. LUMO Energy Bias Sr. No.

-1.455 Feature

SMILES substring

Min

Max

βL,i

βτ,i

βσ,i

3

C

0

46

0.3232

0.1021

-1.495

4

O

0

21.0

0.1857

0.1686

-1.119

5

N

0

15.0

0.2476

0.3437

-1.634

6

=

0

24.0

-0.667

0.2113

0.4195

7

#

0

4.0

0.0258

-0.009

-0.726

13

C=CC=CC

0

2.0

0.0718

0.0346

-0.714

0

7.0

-0.032

-0.150

0.9644

17

C1=CC= CC=C1

23

=O

0

9.0

-0.229

0.2543

-0.146

24

N=O

0

3.0

0.0362

0.0175

-0.721

25

F

0

23.0

-0.042

0.1938

-0.640

27

Cl

0

12.0

-0.0763

0.0463

-0.097

39

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 40 of 47

28

C=C

0

14.0

-0.072

-0.232

1.9756

33

C(=O)

0

8.0

-0.164

0.1045

0.0761

34

C(=O)O

0

6.0

0.2335

0.1057

-1.007

36

C(=O)N

0

6.0

0.5918

0.5421

-4.036

39

N=C=O

0

3.0

0.2348

0.1026

-0.685

40

N=N

0

3.0

-0.100

0.3247

-0.699

45

Br

0

15.0

-0.454

-0.072

1.9949

46

(=O)

0

8.0

-0.114

-0.492

2.9639

47

(=C

0

8.0

-0.135

0.0481

0.6290

50

[O-]

0

6.0

-1.842

-1.005

10.995

63

SS

0

4.0

-0.098

0.1653

-0.7156

72

C(=S

0

2.0

0.0593

0.0286

-0.717

0

15.0

0.2759

0.3586

-2.622

0

4.0

0.0798

0.0993

-0.704

74 82

No. of chains ending with O C1=CN= CN=C1

40

ACS Paragon Plus Environment

Page 41 of 47 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

Table 4: Regression Coecients for the HOMO-LUMO Gap. Coecients β are used with the Eq. 12 for HOMO-LUMO Gap Prediction. HOMO-LUMO Gap Bias Sr.

1.7381 Feature

SMILES

Min

Max

βL,i

βτ,i

βσ,i

-

-3.911

0.702

-0.140

-0.007

0.0438

3

C

0

46.0

0.1569

0.0154

-0.681

5

N

0

15.0

0.3757

0.2829

-2.363

6

=

0

24.0

0.2724

0.3040

-2.034

7

#

0

4.0

-0.183

-0.050

0.8398

(

0

23.0

-0.193

-0.174

1.4969

C=CC

0

6.0

0.0332

0.0348

-0.259

0

7.0

-0.117

-0.141

0.8746

No. 2

9 12

17

LUMO

(

substring

C1=CC= CC=C1

23

=O

0

9.0

0.0328

0.1535

-0.875

25

F

0

23.0

-0.189

-0.048

0.8195

27

Cl

0

12.0

-0.149

-0.120

0.8710

41

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 42 of 47

28

C=C

0

14.0

0.1978

0.0437

-1.001

29

C#C

0

4.0

-0.039

-0.139

0.8441

41

CS

0

8.0

0.0231

-0.053

0.1644

47

(=C

0

8.0

0.0806

-0.0001

-0.391

48

[N+]

0

6.0

-0.444

-0.599

4.1776

50

[O-]

0

6.0

0.3143

0.4623

-3.385

53

CN(C

0

5.0

0.1140

0.0269

-0.613

54

C(=N

0

6.0

0.2579

0.4634

-2.715

0

3.0

-0.148

-0.177

0.8272

59

N=[N+]=[N]

75

No. of chains ending with C

0

20.0

0.0834

0.0321

-0.575

76

No. of chains ending with N

0

6.0

0.0825

0.0110

-0.424

0

4.0

-0.131

-0.141

0.8329

0

91.0

0.1454

0.0073

-0.615

0

4.0

-0.119

-0.124

0.8367

77 81 82

C1=CC= CN=C1 Length of longest ring C1=CN= CN=C1

42

ACS Paragon Plus Environment

Page 43 of 47 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

Table 5: Regression Coecients for the Oxidation Potential Predictions. Coefcients β are used with the Eq. 12 for Oxidation Potential Prediction. Oxidation potential Bias Sr. No.

-2.443 Feature

SMILES substring

Min

Max

βL,i

βτ,i

βσ,i

1

HOMO

-

-8.495

-4.625

-0.661

-0.413

3.5691

2

LUMO

-

-3.911

0.702

-0.199

-0.132

1.2517

3

C

0

46

0.1061

0.2129

-1.228

4

O

0

21

0.0755

0.1375

-0.854

5

N

0

15

-0.499

-0.422

3.535

15

CN

0

8

-0.238

-0.203

1.6482

18

C1CCCC1

0

4

0.1401

0.1465

-1.184

23

=O

0

9

-0.061

-0.199

1.003

25

F

0

23

-0.069

-0.180

1.0141

27

Cl

0

12

0.0584

0.0280

-0.277

28

C=C

0

12

-0.234

-0.157

1.5064

43

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 44 of 47

37

C#N

0

4

0.2184

0.1199

-1.178

48

[N+]

0

6

0.3800

0.2975

-2.408

50

[O-]

0

6

-0.429

-0.396

3.0935

53

CN(C

0

5

-0.114

-0.165

0.8356

54

C(=N

0

6

0.1344

0.2369

-1.171

63

SS

0

4

0.0596

0.2061

-1.184

0

6

-0.078

-0.045

0.3761

0

4

0.1666

0.1879

-1.174

0

4

0.1708

0.1348

-1.181

No. of chains ending with N

76

C1=CC=

77

CN=C1 C1=CN=

82

CN=C1

Table 6: Regression Coecients for the Reduction Potential Prediction. Coecients β are used with the Eq. 12 for Reduction Potential Prediction. Reduction potential Bias Sr. No.

0.4974 Feature

SMILES substring

Min

Max

βL,i

βτ,i

βσ,i

1

HOMO

-

-8.495

-4.625

-0.328

-0.194

1.9638

2

LUMO

-

-3.911

0.702

-0.324

0.0879

0.6196

44

ACS Paragon Plus Environment

Page 45 of 47 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

3

C

0

46

-0.571

-0.328

3.3755

4

O

0

21

-0.171

-0.314

1.8489

6

=

0

24

0.1781

-0.079

-0.212

7

#

0

4

0.0515

-0.103

0.2399

[

0

15

-0.103

-0.023

0.5916

0

7

-0.132

-0.007

0.5998

8 17

[

C1=CC= CC=C1

20

COO

0

5

-0.055

0.0147

0.2435

23

=O

0

9

0.2697

0.0630

-1.492

24

N=O

0

3

0.0724

-0.175

0.2354

25

F

0

23

0.1377

-0.023

-0.436

26

S

0

9

0.1987

0.0453

-0.698

27

Cl

0

12

-0.156

-0.748

3.2355

34

C(=O)O

0

6

0.0691

0.1227

-0.813

36

C(=O)N

0

6

-0.104

0.0012

0.4746

45

ACS Paragon Plus Environment

The Journal of Physical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 46 of 47

40

N=N

0

3

0.0433

-0.141

0.2360

45

Br

0

15

1.3404

0.6410

-8.325

48

[N+]

0

6

0.4817

0.3038

-3.200

70

P(=O)(O)

0

5

-0.478

-0.251

2.9196

Table 7: Error in Property Predictions. HOMO

LUMO

(eV)

(eV)

Oxidation Potential (V)

Reduction Potential (V)

Method

Method

Method

Method

1

2

1

2

MAE

0.326

0.349

0.271

0.177

0.288

0.228

RMSE

0.413

0.452

0.344

0.216

0.371

0.284

% of molecules with absolute error > 0.5 (e)V 21.85

19.41

9.17

0.84

14.66

7.16

% of molecules with absolute error < 0.1 (e)V 19.21

20.54

29.86

46

31.86

ACS Paragon Plus Environment

21.76

25.69

Page 47 of 47 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The Journal of Physical Chemistry

Graphical TOC Entry

47

ACS Paragon Plus Environment