MultiDK: A Multiple Descriptor Multiple Kernel ... - ACS Publications

Mar 22, 2017 - ABSTRACT: We propose a multiple descriptor multiple kernel (MultiDK) method for efficient molecular discovery using machine learning...
0 downloads 0 Views 4MB Size
Subscriber access provided by University of Newcastle, Australia

Article

MultiDK: A Multiple Descriptor Multiple Kernel Approach for Molecular Discovery and Its Application to the Discovery of Organic Flow Battery Electrolytes Sung-Jin Kim, Adrián Jinich, and Alán Aspuru-Guzik J. Chem. Inf. Model., Just Accepted Manuscript • Publication Date (Web): 22 Mar 2017 Downloaded from http://pubs.acs.org on March 22, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

MultiDK: A Multiple Descriptor Multiple Kernel Approach for Molecular Discovery and Its Application to the Discovery of Organic Flow Battery Electrolytes Sungjin Kim, Adri´an Jinich, and Al´an Aspuru-Guzik∗ Department of Chemistry and Chemical Biology, Harvard University, 12 Oxford Street, Cambridge, Massachusetts 02138 E-mail: [email protected]

Abstract We propose a multiple descriptor multiple kernel (MultiDK) method for efficient molecular discovery using machine learning. We show that the MultiDK method improves both the speed and accuracy of molecular property prediction. We apply the method to the discovery of electrolyte molecules for aqueous redox flow batteries. Using multiple-type - as opposed to single-type - descriptors, we obtain more relevant features for machine learning. Following the principle of ’wisdom of the crowds’, the combination of multiple-type descriptors significantly boosts prediction performance. Moreover, by employing multiple kernels - more than one kernel functions for a set of the input descriptors - MultiDK exploits nonlinear relations between molecular structure and properties better than a linear regression approach. The multiple kernels consist of a Tanimoto similarity kernel and a linear kernel for a set of binary descriptors and a set of non-binary descriptors, respectively. Using MultiDK, we achieve an average

1

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 41

performance of r2 = 0.92 with a test set of molecules for solubility prediction. We also extend MultiDK to predict pH-dependent solubility, and apply it to a set of quinone molecules with different ionizable functional groups to assess their performance as flow battery electrolytes.

Introduction Aqueous organic flow batteries are emerging as a low-cost alternative to store renewable energy. 1–5 For example, Huskinson et al., Yang et al., and Liu et al. experimentally showed that high capacity energy storage can be achieved using earth abundant organic electrolytes such as quinone molecules. 6,7 Given the vast molecular space covered by all the possible quinone molecules, high-throughput computational screening 8–20 has emerged as an important strategy to find electrolytes that satisfy the stringent requirement of aqueous flow batteries. In particular, the flow battery system requires a redox potential greater than 0.9V for a catholyte and less than 0.2V for an anolyte, as well as a solubility greater than one molar for both electrolytes. 1 Moreover, quinone electrolytes operating in acidic (pH = 0) are demonstrated by Huskinson et al. 1 and Yang et al. 2 and alkaline (pH =14) environments were by Lin et al. 3 , highlighting the need to predict performance at different pH conditions. Recent high-throughput computational screening of benzo-, naphtho-, anthra-, and thiophenoquinone libraries 21,22 demonstrated that the reduction potential of these redox couples can be predicted accurately using quantum chemistry methods coupled to linear regression against experimental data sets. Using the free energy of solvation as a proxy descriptor, the molecular solubility of electrolytes was also predicted in both references. Here, we build upon this work by developing a machine learning strategy that results in stronger correlations with experimental solubility data. The computational prediction of molecular solubility has been a research topic for decades, with most research being driven by the field of drug discovery. 6,23,24 However, predicting the solubility of organic electrolytes is particularly challenging, given the stringent target solu2

ACS Paragon Plus Environment

Page 3 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

bilities and the extreme pH values of flow battery electrolyte solutions. 25 While the target solubility of drug molecules is generally less than 0.1 molar, the target for flow battery organic electrolytes can be more than 1 molar. Moreover, molecular libraries to screen potential flow battery electrolytes include extremely acidic 25 or basic organic molecules 3 while the majority of drug candidates are relatively weak acids and bases. 18,23,26,27 Both machine learning and quantum chemical approaches can be used to estimate molecular solubility. Whereas machine learning approaches predict solubility based on training to experimental data, 28–30 quantum chemistry aims to predict solubility from first principles. 21,31–33 Although quantum chemical approaches are preferable for obtaining a mechanistic understanding of underlying principles, 24,31 our focus here is on machine learning approaches which facilitate high-throughput molecular discovery with low computational cost. 28,34,35 Machine learning approaches can be categorized into three types of methods according to the types of descriptors used: property-based methods, structure-based methods, and functional group-based methods. Property-based methods predict physicochemical values based on molecular properties which can be measured experimentally or obtained from computational approaches. One such property used for solubility estimation is the partition coefficient, the logarithm of which is denoted as logP. 36–39 Several methods have been proposed to calculate logP. 40–43 The general solubility estimation method (GSE), with its extended and modified variants, is an example of a property-type method which estimates logS from logP. 36–39,44 On the other hand, structure-based methods rely on the estimation of solubility as a function of molecular structure. Structure is usually represented by a binary fingerprint, which captures molecular topology, connectivity, or fragment information. 45,46 Finally, group-based methods partition molecules into functional groups, and the contribution of each to the value of a physicochemical property is estimated by calibration with available experimental data. 47–49 Property-based methods generally involve fewer regression parameters than the other two

3

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

approaches, but require additional computation in order to estimate the required intermediate properties included in the descriptor set. If large experimental data is available for intermediate properties such as logP, property-based methods can predict solubility for a wider range of molecules than any of the other methods. 50,51 However, a significant gap between logP-based estimation and experimental solubility remains. 38 Large efforts have been devoted to reduce this gap by adding more input information to the set of descriptors, with a concomitant increase in the complexity of the regressions employed. 23 Two examples of property-based methods, the GSE approach and Delaney’s extended GSE (EGSE) approach, rely on two and three fitted parameters, respectively. Delaney shows that the performance of GSE and EGSE were r2 (GSE) = 0.67 and r2 (EGSE) = 0.69 for a dataset of 1305 compounds compiled by the authors, 38 which highlights the gap between prediction and experiment for such methodologies. Structure-based methods predict solubility directly from molecular structural information, which can be implemented by various types of descriptors. 46,52–54 Generally, binary fingerprints offer a good trade-off between simplicity and predictive power. 45,49,55 We recently developed neural fingerprints which are structure-based and application-specific with input descriptors generated for arbitrary size and shape based on a molecular graph. 54 Zhou et al. predicted molecular solubility using a binary circular fingerprint descriptor. 45 Although they demonstrated a prediction accuracy of r2 = 0.83, the authors had to carefully select the training data set in order to achieve that value. Huuskonen showed that a prediction accuracy of r2 = 0.92 can be achieved by using non-binary descriptors consisting of 53 parameters, including 39 atom-type electro-topological state (E-state) indices. 25 However, non-binary descriptors significantly increase computational cost in both the training and validation stages, especially when feature selection is encountered during the regression process. 56,57 A different binary fingerprint approach has been investigated by Lind and Maltseva, in which support vector regression (SVR) employing the Tanimoto similarity kernel is applied in order to overcome the limit of the multiple linear regression method. 55 The binary

4

ACS Paragon Plus Environment

Page 4 of 41

Page 5 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

approach developed by Lind and Maltseva achieved r2 = 0.88. The group-based methods integrate contributions of all associated functional groups mulP tiplied by the number of each functional group in a compound: C0 + N i=1 Ci Gi where Gi is the number of times the ith group appears in the compound, C0 is a constant bias parameter, and Ci is the contribution of the ith group. 47 Hou et al. proposed an atom contribution method, which overcomes the ’missing fragment’ problem in pure group contribution methods. 58 The atom contribution method categorizes atoms together with their surrounding molecular environment. Cheng et al. used functional key descriptors such as MACCS Keys and PC881 instead of directly counting the number of instances of each functional group. This approach simplifies descriptors by assigning them binary values but still requires large training data sets and can neglect certain molecular fragments. Moreover, Cheng et al. apply these descriptors for a solubility classification task with a much lower solubility requirement, 10 µg/mL, than the threshold values necessary for aqueous flow battery applications. The ability to carry out solubility predictions that account for pH-dependence is critical to discovering molecules for aqueous flow batteries. In addition to requiring very high solubility, the pH at which an organic flow battery system is designed to operate varies depending on the required redox potential values and other experimental considerations. For instance, negative electrolytes of 9,10-anthraquinone-2,7-disulphonic acid (AQDS) 1 and 2,6-dihydroxyanthraquinone (DHAQ) 3 require 1 molar solubility at pH 0 and pH 14, respectively. While prediction methods for intrinsic solubility have been widely discussed, methods to predict pH-dependent solubility have remained less explored. 24,26,27,59–61 In theory, the Henderson-Hasselbach relationship can be used to predict pH-dependent solubility based on the intrinsic solubility of a molecule. 60 However, the limitations of current pKa prediction accuracies as well as the salt plateau phenomena of ionic solubility motivates the use of a data-driven approaches. This requires significantly more experimental training data (i.e. solubility as a function of pH) than intrinsic solubility prediction. 27,61 Moreover, the intrinsic solubility of extremely strong acids with a negative pKa value has not been well

5

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

investigated in the literature. In high-throughput molecular screening, the development of an accurate and cost-effective property estimation method is a key factor to successfully find new candidate molecules. 54,62,63 In this work, we develop a fast and accurate property estimation method for high-throughput molecular discovery. We name the proposed approach a multiple descriptor multiple kernel (MultiDK) method. The method relies on combining an ensemble of different descriptors, including fingerprints, functional keys, as well as other molecular physicochemical properties. We also apply different kernels for different types of descriptors in order to capture nonlinear relations between fingerprints and properties. 55 The MultiDK approach supports intrinsic solubility estimation and, when combined with external tools to predict pH-dependent partition coefficients, can be used to predict pH-dependent solubility.

Methods Datasets and tools We evaluated the performance of MultiDK on four different datasets for intrinsic solubility, i.e., equilibrium solubility at pH conditions where the molecules are entirely neutral. The entire datasets are provided as supporting information in the four different papers. The four datasets are 496 molecules used by Willighagen et al., 64 , 1140 molecules used by Delaney, 38 , 1676 molecules used by Wang et al., 65 , 3310 molecules in a collection of ci800406y_si_ 002.xls, ci800406y_si_003.xls, ci800406y_si_004.xls, ci800406y_si_005.xls data sheets used by Wang et al. 39 We generated molecular descriptors from canonical simplified molecular-input line-entry system (SMILES) strings which are obtained from either SYBYL line notation (SLN), international chemical identifier (InChI) or SMILES strings in the original datasets using the RDKit package. 66 Also, all duplicated molecules with the same experiment solubility values are dropped in each dataset to get cross-validation performance fairly. The cross-validation were performed using the K-fold approach with K = 20 6

ACS Paragon Plus Environment

Page 6 of 41

Page 7 of 41

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 8 of 41

Page 9 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Specifically, it uses a nonlinear binary kernel for binary descriptors and linear kernel for nonbinary descriptors. To optimize kernel functions, 71–73 multiple combinatorial kernels have been used in various applications including biomedical data and YouTube video data. 74–76 . Here, we use a multiple kernel approach to apply appropriate kernels for different features instead of training the kernel. The binary kernel function of kB (·) exploits a nonlinear relationship between molecular structure and properties. The nonlinear relationships arise primarily because each bit indicates the presence or absence of a pattern rather than a quantitative value. MultiDK uses all training molecules as support vector molecules for kernel processing similar to support vector machines. We use the Tanimoto similarity kernel, which has been used in a wide range of machine learning applications, such as exploiting binary feature information to recognize white images on a black background 77 as well as a kernel for support vector and Gaussian progress regression in molecular property prediction. 8,55 In the MultiDK approach, ensemble learning is employed based on multiple combinational descriptors according to the principle of the ’wisdom of the crowds’. 78 The set of descriptors in MultiDK includes the Morgan circular fingerprints, 53 MACCS Keys 46 and three non-binary molecular properties. The three types of descriptors represent structural information (atom, path), key patterns (fragments, functional group) and associated molecular properties. We find that this ensemble combination is effective to predict molecular properties because both atom and subgroup representations are employed in the set of descriptors together with the related molecular properties. Moreover, we use different kernels for binary and non-binary descriptors. Particularly, a binary similarity kernel is applied to the binary descriptor and a linear kernel for the non-binary descriptor. We evaluate the methods with training and cross-validation phases. In the training phase, we optimize the regression parameters using Ridge regularization, which is equivalent to minimizing J=

L X

2

|yo (i) − y(i)| + α

i=1

M X

wj2

(3)

j=1

where yo (i) and y(i) are the ith experimental and predicted value, respectively, wj is the 9

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 41

jth weight belonging to either binary and non-binary parts, M is the number of the total deployed descriptors, and α is the Ridge hyperparameter. 79 The descriptor consists of 4096 binary bits of the Morgan circular fingerprint with radius 6, 117 binary bits of the MACCS Keys and a few non-binary scalar descriptors. We generate all descriptors using the RDKit tool 66 except for the partition coefficient, which we obtain from Cxcalc in the Chemaxon Marvin suite. 80 Before linear regression, we pre-process the 4213 binary bits with the binary similarity kernel by calculating Tanimoto similarity between an input vector and the set of training vectors. We pass the non-binary descriptors directly to the linear regression stage without pre-processing. Then, the binary kernel output values and the direct nonbinary output value are entered into the Ridge linear regression stage. We employ the Ridge regression routine in the scikit-learn Python package. 81 The regularization process produces the optimal regression coefficients corresponding to the maximum R2 performance. In the cross-validation phase, a combination vector of the binary kernel outputs and a direct descriptor of a test molecule is multiplied by the coefficients obtained in the training phase.

MultiDK for estimating intrinsic solubility, logS We predict solubility using MultiDK as follows:

log S =

X

WSP · xWSP ) + w0 wiCK kB (xCK , xCK i ) + (w

(4)

i=1,...,L

where kB (·) is a binary kernel function, xCK is a concatenated binary descriptor of the Morgan circular fingerprint (xC ) and the MACCS keys (xK ) and xWSP is a concatenated non-binary descriptor of the molecular weight (xW ), Labute’s approximate surface area 82 (xS ), and logP (xP ). Both wiCK and wWSP are regression coefficients corresponding to xCK and xWSP , respectively, and w0 is a regression intercept. We generate all descriptors including xC , xK , xW , xS , and xP according to the SMILES string of each molecule.

10

ACS Paragon Plus Environment

Page 11 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

MultiDK for estimating pH dependent solubility, logS(pH) In order to predict pH-dependent solubility, we extend the MultiDK method as follows:

log S(pH) = log S + log P − log D(pH)

(5)

where log P and log D(pH) are the n-octanol-to-water partition coefficient and the pHdependent distribution coefficient, respectively. Since the two coefficients can be approximated as log P = log SOct − log S and log D(pH) = log SOct − log S(pH), 36,37 we are able to extend MultiDK as in (5) where log SOct is solubility in octanol. The octanol solubility is intrinsic and therefore determined regardless of existence of ionizable groups. 83 We evaluate both log P and log D(pH) using the cxcalc plugin in the Chemaxon Marvin suite. 84 Alternatively, ACD (http://acdlabs.com/) and Chemspider (http://chemspider.com) can also be used to calculate both log P and log D(pH) as an application package and an on-line website, respectively. If precisely estimated pKa values are available, the Henderson-Hasselbalch approach can yield pH-dependent solubility from the intrinsic solubility as well. 27,60,85

Results and Discussion Cross-validation results For cross-validation, we consider five approaches which are summarized with their associated descriptors in Table 1, where x and y of MDxy and MultiDKxy represents the number of embodied binary descriptors and the number of embodied non-binary descriptors, respectively. Performance of MultiDK for solubility prediction We use the distribution of r2 values in a 20-fold cross validation as a metric of prediction performance. The r2 distribution is obtained by repeating 20 times for both training and testing 11

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table 1: Different approaches with their associated descriptors Method SD MD21 MD23 MultiDK10 MultiDK21 MultiDK23

xC O O O O O O

xK

xW

xS

xP

O O

O O

O

O

O O

O O

O

O

until 20 subsets of data are used for validation. Figure 3 shows the r2 distribution obtained with each of the methods tested as a function of the Ridge regression hyper-parameter α. For this estimation, we used the 1676 unique SMILES and solubility molecules use by Wang et al. 65 For efficient comparison, only one non-binary descriptor is considered in this evaluation. Both the MultiDK and the MD methods employ two binary and one non-binary descriptor: Morgan fingerprints (MFP), MACCS Keys (MACCS) and molecular weight (MolW). Figure 3 shows that MultiDK and MD significantly outperform SD. Moreover, MultiDK is most robust to changes in the value of α. This result reveals that additional group and property information help improve prediction accuracy. In Figure 4, the performances of SD, MD and MultiDK are compared when the optimal value of α is used, where the SD family includes MFP, MACCS and MolW. This bar graph shows a clear difference between the SD family, MD and MultiDK approaches. The best α value were found by grid search, selecting α on the basis of regression performance in the range of 10−3 to 102 with 10 logarithmically equally spaced steps. At each step, regression was evaluated using a 20-fold cross-validation with initial data shuffling. SD (MFP), MD and MultiDK achieve their best regression coefficient values of E[r2 ] ± std(r2 ) = 0.72 ± 0.04, 0.86 ± 0.04 and 0.89 ± 0.03 at α = 10.0, 31.6 and 0.03, respectively, where E[r2 ] is the mean of squared correlation coefficients and std(r2 ) is the standard deviation (std) of squared correlation coefficients. This result highlights three important points. First, SD with MFP outperforms the other two SDs approaches, SD using MACCS and SD using MolW. It 12

ACS Paragon Plus Environment

Page 12 of 41

Page 13 of 41

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 14 of 41

Page 15 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

best α. We obtained the following cross-validation summary statistics: mean(r2 ) = 0.91, std(r2 ) = 0.027, root mean squared error (RMSE) = 0.61, mean absolute error (MAE) = 0.45, median absolute error (MedAE) = 0.33. We also compare a cross-validation performance of MultiDK with an alternative multiple kernel method, namely multi-kernel fusion. 86–89 The multi-kernel fusion method generates a new kernel by, for example, averaging multiple basic kernels. 89 We consider three basic kernels including linear, Gaussian and Tanimoto similarity kernels and generated three fused kernels from them: the linear and Gaussian kernel average, the Gaussian and (modified) Tanimoto similarity kernel average, and the linear, Gaussian and (modified) Tanimoto similarity kernel average, denoted as Fused-LG, Fused-GT∗ , Fused-LGT∗ . It is noteworthy that in order to apply to the fused method, we modify the Tanimoto similarity kernel as follows:

kT′ M (x, y) = kT M (fb (x), fb (y))

(6)

where kT M (x, y) is the original Tanimoto similarity kernel, and fb (x) is a function returning only a binary part of an input descriptor. Otherwise, the Tanimoto similarity kernel cannot be used for kernel fusion with ensemble descriptors that combine binary and non-binary descriptors, e.g., descriptors used in MultiDK23. The r2 performance comparisons are shown in Figure 7, which reveals that MultiDK23 outperforms a general fused method (Fused-LG: r2 = 0.87±0.03) and modified fused methods (Fused-GT∗ : r2 = 0.90±0.02 and Fused-LGT∗ : r2 = 0.88 ± 0.02). Table 2: 20-fold cross-validation performances of the 1676 molecules Method SD MD21 MD23 MultiDK10 MultiDK21 MultiDK23

Best α 1E+1 3E+1 3E+1 1E-3 3E-2 1E-1

15

E[r2 ] 0.72 0.86 0.88 0.80 0.89 0.91

std(r2 ) 0.06 0.05 0.03 0.04 0.04 0.03

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 16 of 41

Page 17 of 41

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 8: Prediction performance of different methods on the dataset with 496 molecules.

Figure 9: Prediction performance of different methods on the dataset with 1140 molecules.

18

ACS Paragon Plus Environment

Page 18 of 41

Page 19 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 10: Prediction performance of different methods on the dataset with 3310 molecules.

Table 3: Performances of solubility prediction for different datasets Method SD MD21 MD23 MultiDK10 MultiDK21 MultiDK23

496 Best α 1E+1 1E+1 1E+1 3E-3 7E-2 7E-2

molecules E[r2 ] std(r2 ) 0.65 0.12 0.84 0.07 0.88 0.06 0.70 0.11 0.86 0.06 0.89 0.05

1140 molecules Best α E[r2 ] std(r2 ) 1E+1 0.71 0.09 1E+1 0.87 0.04 1E+1 0.89 0.03 3E-3 0.79 0.05 3E-2 0.90 0.04 3E-2 0.92 0.02

19

ACS Paragon Plus Environment

3310 molecules Best α E[r2 ] std(r2 ) 3E+1 0.66 0.05 3E+1 0.79 0.06 3E+1 0.83 0.02 3E-2 0.77 0.05 1E-1 0.85 0.03 1E-1 0.87 0.04

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 20 of 41

Page 21 of 41

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

molecule types or the attached R-groups, all three methods predict the intrinsic solubility (logS) of the molecules to be below zero log-molar. Thus, all molecules have intrinsic solubility, unionized solubility, lower than the target solubility of aqueous flow battery electrolytes. Table 4: Predicted intrinsic solubility of 27 quinone molecules by three different methods, i.e., MultiDK, VCCLAB and EGSE, where Benzoquinone (BQ), naphthoquinone (NQ) and anthraquinone (AQ), with available unique positions of R-group attachment.

pH-dependent solubility for single R-group quinones In Figure 14, 15 and 16, we show the pH-dependent solubility predicted by the extended MultiDK method for BQ, NQ and AQ family molecules, respectively, whereas the collection of them are illustrated as a heat map in Figure 17. We applied the extended method to the three 22

ACS Paragon Plus Environment

Page 22 of 41

Page 23 of 41

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 24 of 41

Page 25 of 41

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 26 of 41

Page 27 of 41

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 28 of 41

Page 29 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

solubility, and applied it to various quinones with strong acidic or alkaline functional groups at different pH values where the quinones are the candidates of electrolytes for organic aqueous flow batteries.

Associated Content Supporting Information: The supporting information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.xxx. This document includes three supporting items: Comparing MultiDK with other machine learning methods such as support vector machine (SVM) and deep neural networks (DNN), analyzing the binary kernel function used in MultiDK, and comparing logS(pH) estimations of MultiDK and Cxcalc for seven experimental data.

Acknowledgement This work was funded by the U.S. DOE ARPA-E award DE-AR0000348. The computing time was provided by Harvard FAS Research Computing. We thank Roy G. Gordon and Michael J. Aziz for helpful discussions. The support of Changwon Suh and Rafael G´omezBombarel was useful in this work.

References (1) Huskinson, B.; Marshak, M. P.; Suh, C.; Er, S.; Gerhardt, M. R.; Galvin, C. J.; Chen, X.; Aspuru-Guzik, A.; Gordon, R. G.; Aziz, M. J. A Metal-Free OrganicInorganic Aqueous Flow Batter. Nature 2014, 505, 195–198. (2) Yang, B.; Hoober-Burkhardt, L.; Wang, F.; Prakash, G. K. S.; Narayanan, S. R. An Inexpensive Aqueous Flow Battery for Large-Scale Electrical Energy Storage Based on Water-Soluble Organic Redox Couples. J. Electrochem. Soc. 2014, 161, A1371–A1380. 29

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(3) Lin, K.; Chen, Q.; Gerhardt, M. R.; Tong, L.; Kim, S. B.; Eisenach, L.; Valle, A. W.; Hardee, D.; Gordon, R. G.; Aziz, M. J.; Marshak, M. P. Alkaline Quinone Flow Battery. Science 2015, 349, 1529–1532. (4) Liu, T.; Wei, X.; Nie, Z.; Sprenkle, V.; Wang, W. A Total Organic Aqueous Redox Flow Battery Employing a Low Cost and Sustainable Methyl Viologen Anolyte and 4-HO-TEMPO Catholyte. Adv. Energy Mater. 2016, 6, 1501449–n/a. (5) Winsberg, J.; Janoschka, T.; Morgenstern, S.; Hagemann, T.; Muench, S.; Hauffman, G.; Gohy, J.-F.; Hager, M. D.; Schubert, U. S. Poly(TEMPO)/Zinc Hybrid-Flow Battery: A Novel, ”Green,” High Voltage, and Safe Energy Storage System. Adv. Mater. 2016, 28, 2238–2243. (6) Soloveichik, G. L. Flow Batteries: Current Status and Trends. Chem. Rev. 2015, 115, 11533–11558. (7) Yang, B.; Hoober-Burkhardt, L.; Krishnamoorthy, S.; Murali, A.; Prakash, G. K. S.; Narayanan, S. R. High-Performance Aqueous Organic Flow Battery with QuinoneBased Redox Couples at Both Electrodes. J. Electrochem. Soc. 2016, 163, A1442– A1449. (8) Pyzer-Knapp, E. O.; Simm, G. N.; Aspuru-Guzik, A. A Bayesian Approach to Calibrating High-Throughput Virtual Screening Results and Application to Organic Photovoltaic Materials. Mater. Horiz. 2016, 3, 226–233. (9) Plessow, P. N.; Bajdich, M.; Greene, J.; Vojvodic, A.; Abild-Pedersen, F. Trends in the Thermodynamic Stability of Ultrathin Supported Oxide Films. J. Phys. Chem. C 2016, 120, 10351–10360. (10) Peplow, M. Mater. Sci.: The Hole Story. Nature News 2015, 520, 148.

30

ACS Paragon Plus Environment

Page 30 of 41

Page 31 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

˜ (11) Santos, E. J. G.; NA¸rskov, J. K.; Vojvodic, A. Screened Hybrid Exact Exchange Correction Scheme for Adsorption Energies on Perovskite Oxides. J. Phys. Chem. C 2016, 119, 17662–17666. (12) Ma, J.; Sheridan, R. P.; Liaw, A.; Dahl, G. E.; Svetnik, V. Deep Neural Nets as a Method for Quantitative Structure-Activity Relationships. J. Chem. Inf. Model. 2015, 55, 263–274. (13) Shu, Y.; Levine, B. G. Simulated Evolution of Fluorophores for Light Emitting Diodes. J. Chem. Phys. 2015, 142, 104104. (14) Hachmann, J.; Olivares-Amaya, R.; Jinich, A.; Appleton, A. L.; Blood-Forsythe, M. A.; Seress, L. R.; Rom´an-Salgado, C.; Trepte, K.; Atahan-Evrenk, S.; Er, S.; Shrestha, S.; Mondal, R.; Sokolov, A.; Bao, Z.; Aspuru-Guzik, A. Lead Candidates for HighPerformance Organic Photovoltaics from High-Throughput Quantum Chemistry - the Harvard Clean Energy Project. Energy Environ. Sci. 2014, 7, 698–704. (15) Curtarolo, S.; Hart, G. L. W.; Nardelli, M. B.; Mingo, N.; Sanvito, S.; Levy, O. The High-Throughput Highway to Computational Materials Design. Nat. Mater. 2013, 12, 191–201. (16) Kanal, I. Y.; Owens, S. G.; Bechtel, J. S.; Hutchison, G. R. Efficient Computational Screening of Organic Polymer Photovoltaics. J. Phys. Chem. Lett. 2013, 4, 1613–1623. ˜ (17) Sokolov, A. N.; Atahan-Evrenk, S.; Mondal, R.; Akkerman, H. B.; SA¡nchezCarrera, R. S.; Granados-Focil, S.; Schrier, J.; Mannsfeld, S. C. B.; Zoombelt, A. P.; Bao, Z.; Aspuru-Guzik, A. From Computational Discovery to Experimental Characterization of a High Hole Mobility Organic Crystal. Nat. Commun. 2011, 2, 437. (18) Fischer, C. C.; Tibbetts, K. J.; Morgan, D.; Ceder, G. Predicting Crystal Structure by Merging Data Mining with Quantum Mechanics. Nat. Mater. 2006, 5, 641–646.

31

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(19) Shoichet, B. K. Virtual Screening of Chemical Libraries. Nature 2004, 432, 862–865. (20) Bajorath, J. Integration of Virtual and High-Throughput Screening. Nat. Rev. Drug Discovery 2002, 1, 882–894. (21) Er, S.; Suh, C.; Marshak, M. P.; Aspuru-Guzik, A. Computational Design of Molecules for an All-Quinone Redox Flow Battery. Chem. Sci. 2015, 6, 885–893. (22) Pineda Flores, S. D.; Martin-Noble, G. C.; Phillips, R. L.; Schrier, J. Bio-Inspired Electroactive Organic Molecules for Aqueous Redox Flow Batteries. 1. Thiophenoquinones. J. Phys. Chem. C 2015, 119, 21800–21809. (23) Wang, J.; Hou, T. Recent Advances on Aqueous Solubility Prediction. Comb. Chem. High Throughput Screening 2011, 14, 328–338. (24) Skyner, R. E.; McDonagh, J. L.; Groom, C. R.; Mourik, T. v.; Mitchell, J. B. O. A Review of Methods for the Calculation of Solution Free Energies and the Modelling of Systems in Solution. Phys. Chem. Chem. Phys. 2015, 17, 6174–6191. (25) Huuskonen, J. Estimation of Aqueous Solubility for a Diverse Set of Organic Compounds Based on Molecular Topology. J. Chem. Inf. Comput. Sci. 2000, 40, 773–777. (26) Bhal, S. K.; Kassam, K.; Peirson, I. G.; Pearl, G. M. The Rule of Five Revisited: Applying Log D in Place of Log P in Drug-Likeness Filters. Mol. Pharmaceutics 2007, 4, 556–560. (27) Bergstr¨om, C. A. S.; Luthman, K.; Artursson, P. Accuracy of Calculated pH-Dependent Aqueous Drug Solubility. European J. Pharm. Sci. 2004, 22, 387–398. (28) Mitchell, J. B. O. Machine Learning Methods in Chemoinformatics. WIREs Comput Mol Sci 2014, 4, 468–481.

32

ACS Paragon Plus Environment

Page 32 of 41

Page 33 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(29) Hughes, L. D.; Palmer, D. S.; Nigsch, F.; Mitchell, J. B. O. Why Are Some Properties More Difficult To Predict than Others? A Study of QSPR Models of Solubility, Melting Point, and Log P. J. Chem. Inf. Model. 2008, 48, 220–232. (30) Palmer, D. S.; O’Boyle, N. M.; Glen, R. C.; Mitchell, J. B. O. Random Forest Models To Predict Aqueous Solubility. J. Chem. Inf. Model. 2007, 47, 150–158. (31) McDonagh, J. L.; Nath, N.; De Ferrari, L.; van Mourik, T.; Mitchell, J. B. O. Uniting Cheminformatics and Chemical Theory To Predict the Intrinsic Aqueous Solubility of Crystalline Druglike Molecules. J. Chem. Inf. Model. 2014, 54, 844–856. (32) Marten, B.; Kim, K.; Cortis, C.; Friesner, R. A.; Murphy, R. B.; Ringnalda, M. N.; Sitkoff, D.; Honig, B. New Model for Calculation of Solvation Free Energies: Correction of Self-Consistent Reaction Field Continuum Dielectric Theory for Short-Range Hydrogen-Bonding Effects. J. Phys. Chem. 1996, 100, 11775–11788. (33) Tannor, D. J.; Marten, B.; Murphy, R.; Friesner, R. A.; Sitkoff, D.; Nicholls, A.; Honig, B.; Ringnalda, M.; Goddard, W. A. Accurate First Principles Calculation of Molecular Charge Distributions and Solvation Energies from Ab Initio Quantum Mechanics and Continuum Dielectric Theory. J. Am. Chem. Soc. 1994, 116, 11875–11882. (34) Raccuglia, P.; Elbert, K. C.; Adler, P. D. F.; Falk, C.; Wenny, M. B.; Mollo, A.; Zeller, M.; Friedler, S. A.; Schrier, J.; Norquist, A. J. Machine-Learning-Assisted Materials Discovery Using Failed Experiments. Nature 2016, 533, 73–76. (35) Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; Dieleman, S.; Grewe, D.; Nham, J.; Kalchbrenner, N.; Sutskever, I.; Lillicrap, T.; Leach, M.; Kavukcuoglu, K.; Graepel, T.; Hassabis, D. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature 2016, 529, 484–489.

33

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(36) Jain, N.; Yalkowsky, S. H. Estimation of the Aqueous Solubility I: Application to Organic Nonelectrolytes. J. Pharm. Sci. 2001, 90, 234–252. (37) Ran, Y.; He, Y.; Yang, G.; Johnson, J. L. H.; Yalkowsky, S. H. Estimation of Aqueous Solubility of Organic Compounds by Using the General Solubility Equation. Chemosphere 2002, 48, 487–509. (38) Delaney, J. S. ESOL: Estimating Aqueous Solubility Directly from Molecular Structure. J. Chem. Inf. Comput. Sci. 2004, 44, 1000–1005. (39) Wang, J.; Hou, T.; Xu, X. Aqueous Solubility Prediction Based on Weighted Atom Type Counts and Solvent Accessible Surface Areas. J. Chem. Inf. Model. 2009, 49, 571–581. (40) Tetko, I. V.; Bruneau, P. Application of ALOGPS to Predict 1-Octanol/Water Distribution Coefficients, logP, and logD, of AstraZeneca In-House Database. J. Pharm. Sci. 2004, 93, 3103–3110. (41) Tetko, I. V.; Tanchuk, V. Y.; Villa, A. E. P. Prediction of n-Octanol/Water Partition Coefficients from PHYSPROP Database Using Artificial Neural Networks and E-State Indices. J. Chem. Inf. Comput. Sci. 2001, 41, 1407–1421. (42) Lipinski, C. A.; Lombardo, F.; Dominy, B. W.; Feeney, P. J. Experimental and Computational Approaches to Estimate Solubility and Permeability in Drug Discovery Dev. Settings. Adv. Drug Delivery Rev. 2001, 46, 3–26. (43) Viswanadhan, V. N.; Ghose, A. K.; Revankar, G. R.; Robins, R. K. Atomic Physicochemical Parameters for Three Dimensional Structure Directed Quantitative StructureActivity Relationships. 4. Additional Parameters for Hydrophobic and Dispersive Interactions and Their Application for an Automated Superposition of Certain Naturally Occurring Nucleoside Antibiotics. J. Chem. Inf. Comput. Sci. 1989, 29, 163–172.

34

ACS Paragon Plus Environment

Page 34 of 41

Page 35 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(44) Ali, J.; Camilleri, P.; Brown, M. B.; Hutt, A. J.; Kirton, S. B. Revisiting the General Solubility Equation: In Silico Prediction of Aqueous Solubility Incorporating the Effect of Topographical Polar Surface Area. J. Chem. Inf. Model. 2012, 52, 420–428. (45) Zhou, D.; Alelyunas, Y.; Liu, R. Scores of Extended Connectivity Fingerprint as Descriptors in QSPR Study of Melting Point and Aqueous Solubility. J. Chem. Inf. Model. 2008, 48, 981–987. (46) Durant, J. L.; Leland, B. A.; Henry, D. R.; Nourse, J. G. Reoptimization of MDL Keys for Use in Drug Discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273–1280. (47) Klopman, G.; Wang, S.; Balthasar, D. M. Estimation of Aqueous Solubility of Organic Molecules by the Group Contribution Approach. Application to the Study of Biodegradation. J. Chem. Inf. Comput. Sci. 1992, 32, 474–482. (48) K¨ uhne, R.; Ebert, R. U.; Kleint, F.; Schmidt, G.; Sch¨ uu ¨rmann, G. Group Contribution Methods to Estimate Water Solubility of Organic Chemicals. Chemosphere 1995, 30, 2061–2077. (49) Cheng, T.; Li, Q.; Wang, Y.; Bryant, S. H. Binary Classification of Aqueous Solubility Using Support Vector Machines with Reduction and Recombination Feature Selection. J. Chem. Inf. Model. 2011, 51, 229–236. (50) Tetko, I. V.; Poda, G. I. Application of ALOGPS 2.1 to Predict logD Distribution Coefficient for Pfizer Proprietary Compounds. Journal of Med. Chem. 2004, 47, 5601– 5604. (51) Xing, L.; Glen, R. C. Novel Methods for the Prediction of logP, pKa, and logD. J. Chem. Inf. Comput. Sci. 2002, 42, 796–805. (52) Hall, L. H.; Kier, L. B. Electrotopological State Indices for Atom Types: A Novel

35

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Combination of Electronic, Topological, and Valence State Information. J. Chem. Inf. Comput. Sci. 1995, 35, 1039–1045. (53) Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. (54) Duvenaud, D. K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; AspuruGuzik, A.; Adams, R. P. In Advances in Neural Information Processing Systems 28 ; Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc., 2015; pp 2224–2232. (55) Lind, P.; Maltseva, T. Support Vector Machines for the Estimation of Aqueous Solubility. J. Chem. Inf. Comput. Sci. 2003, 43, 1855–1859. (56) Steinbeck, C.; Hoppe, C.; Kuhn, S.; Floris, M.; Guha, R.; Willighagen, E. L. Recent Developments of the Chemistry Development Kit (CDK) - An Open-Source Java Library for Chemo- and Bioinformatics. Curr. Pharm. Des. 2006, 12, 2111–2120. (57) Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R.; others, Least Angle Regression. Annals of statistics 2004, 32, 407–499. (58) Hou, T. J.; Xia, K.; Zhang, W.; Xu, X. J. ADME Evaluation in Drug Discovery. 4. Prediction of Aqueous Solubility Based on Atom Contribution Approach. J. Chem. Inf. Comput. Sci. 2004, 44, 266–275. (59) Ledwidge, M. T.; Corrigan, O. I. Effects of Surface Active Characteristics and Solid State Forms on the pH Solubility Profiles of Drug-Salt Systems. InterNatl. J. (Wash.) of Pharmaceutics 1998, 174, 187–200. ´ (60) Hansen, N. T.; Kouskoumvekaki, I.; Jørgensen, F. S.; Brunak, S.; J´onsd´ottir, S. O. Prediction of pH-Dependent Aqueous Solubility of Druglike Molecules. J. Chem. Inf. Model. 2006, 46, 2601–2609. 36

ACS Paragon Plus Environment

Page 36 of 41

Page 37 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(61) Wang, J.-B.; Cao, D.-S.; Zhu, M.-F.; Yun, Y.-H.; Xiao, N.; Liang, Y.-Z. In Silico Evaluation of logD7.4 and Comparison with Other Prediction Methods. J. Chemom. 2015, 29, 389–398. (62) Pyzer-Knapp, E. O.; Suh, C.; G´omez-Bombarelli, R.; Aguilera-Iparraguirre, J.; AspuruGuzik, A. What Is High-Throughput Virtual Screening? A Perspective from Organic Materials Discovery. Annual Review of Mater. Res. 2015, 45, 195–216. (63) Kearnes, S. M.; Haque, I. S.; Pande, V. S. SCISSORS: Practical Considerations. J. Chem. Inf. Model. 2014, 54, 5–15. (64) Willighagen, E. L.; Denissen, H. M. G. W.; Wehrens, R.; Buydens, L. M. C. On the Use of 1H and 13C 1D NMR Spectra as QSPR Descriptors. J. Chem. Inf. Model. 2006, 46, 487–494. (65) Wang, J.; Krudy, G.; Hou, T.; Zhang, W.; Holland, G.; Xu, X. Development of Reliable Aqueous Solubility Models and Their Application in Druglike Analysis. J. Chem. Inf. Model. 2007, 47, 1395–1404. (66) RDKit: Open-Source Cheminformatics. http://www.rdkit.org, 2015; [Online; version 2-September-2015]. (67) McKinney, W. Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conference. 2010; pp 51 – 56. (68) Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E. Scikit-learn: Machine Learning in Python. J. Machine Learning Res. 2011, 12, 2825–2830. (69) Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; Kudlur, M.; Levenberg, J.; Monga, R.; Moore, S.; 37

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Murray, D. G.; Steiner, B.; Tucker, P.; Vasudevan, V.; Warden, P.; Wicke, M.; Yu, Y.; Zheng, X. TensorFlow: A System for Large-Scale Machine Learning. 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). GA, 2016; pp 265–283. (70) Waskom, M.; Botvinnik, O.; drewokane,; Hobson, P.; Halchenko, Y.; Lukauskas, S.; Warmenhoven, J.; Cole, J. B.; Hoyer, S.; Vanderplas, J.; gkunter,; Villalba, S.; Quintero, E.; Martin, M.; Miles, A.; Meyer, K.; Augspurger, T.; Yarkoni, T.; Bachant, P.; Evans, C.; Fitzgerald, C.; Nagy, T.; Ziegler, E.; Megies, T.; Wehner, D.; St-Jean, S.; Coelho, L. P.; Hitz, G.; Lee, A.; Rocher, L. Seaborn: v0.7.0 (January 2016). 2016; https://doi.org/10.5281/zenodo.45133, Software available from github.com/mwaskom/seaborn (accessed Mar 1, 2016). (71) G¨onen, M.; Alpaydin, E. Multiple Kernel Learning Algorithms. J. Machine Learning Res. 2011, 12, 2211–2268. (72) Bach, F. R.; Lanckriet, G. R. G.; Jordan, M. I. Multiple Kernel Learning, Conic Duality, and the SMO Algorithm. Proceedings of the Twenty-first International Conference on Machine Learning. 2004. (73) Lanckriet, G. R.; Cristianini, N.; Bartlett, P.; Ghaoui, L. E.; Jordan, M. I. Learning the Kernel Matrix with Semidefinite Programming. J. Machine Learning Res. 2014, 5, 27–72. (74) Yu, S.; Falck, T.; Daemen, A.; Tranchevent, L.-C.; Suykens, J. A.; Moor, B. D.; Moreau, Y. L2-Norm Multiple Kernel Learning and Its Application to Biomedical Data Fusion. BMC Bioinformatics 2010, 11, 309. (75) Chen, L.; Duan, L.; Xu, D. Event Recognition in Videos by Learning from Heterogeneous Web Sources. 2013 IEEE Conference on Computer Vision and Pattern Recognit. (CVPR). 2013; pp 2666–2673. 38

ACS Paragon Plus Environment

Page 38 of 41

Page 39 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(76) Xu, X.; Tsang, I. W.; Xu, D. Soft Margin Multiple Kernel Learning. IEEE transactions on neural networks and learning systems 2013, 24, 749–761. (77) Pekalska, E.; Paclik, P.; Duin, R. P. W. A Generalized Kernel Approach to Dissimilaritybased Classification. J. Machine Learning Res. 2001, 175–211. (78) Kew, W.; Mitchell, J. B. O. Greedy and Linear Ensembles of Machine Learning Methods Outperform Single Approaches for QSPR Regression Problems. Mol. Inf. 2015, 34, 634–647. (79) Ng, A. Y. Feature Selection, L 1 vs. L 2 Regularization, and Rotational Invariance. Proceedings of the twenty-first international conference on Machine learning. 2004; p 78. (80) Calculator Plugins (Cxcalc) Were Used for Structure Property Prediction and Calculation, Marvin 5.2.2, Chemaxon. ChemAxon(http://www.chemaxon.com), 1998-2009. (81) Buitinck, L.; Louppe, G.; Blondel, M.; Pedregosa, F.; Mueller, A.; Grisel, O.; Niculae, V.; Prettenhofer, P.; Gramfort, A.; Grobler, J.; Layton, R.; VanderPlas, J.; Joly, A.; Holt, B.; Varoquaux, G. API Design for Machine Learning Software: Experiences from the Scikit-Learn Project. ECML PKDD Workshop: Languages for Data Mining and Machine Learning. 2013; pp 108–122. (82) Labute, P. A Widely Applicable Set of Descriptors. J. Mol. Graphics Modell. 2000, 18, 464–477. (83) Sijm, D. T. H. M.; Sch¨ uu ¨rmann, G.; de Vries, P. J.; Opperhuizen, A. Aqueous Solubility, Octanol Solubility, and Octanol/Water Partition Coefficient of Nine Hydrophobic Dyes. Environ. Toxicol. Chem. 1999, 18, 1109–1117. (84) Chemicalize.org Was Used for Name to Structure Generation/Prediction of Xyz Properties/Etc, Chemaxon. chemicalize.org, 2015; accessed Oct 6, 2015. 39

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

˜ (85) V¨olgyi, G.; Baka, E.; Box, K. J.; Comer, J. E. A.; TakA¡cs-Nov´ ak, K. Study of pHDependent Solubility of Organic Bases. Revisit of Henderson-Hasselbalch Relationship. Anal. Chim. Acta 2010, 673, 40–46. (86) Sun, Z.; Ampornpunt, N.; Varma, M.; Vishwanathan, S. Multiple Kernel Learning and the SMO Algorithm. Advances in neural information processing systems. 2010; pp 2361–2369. (87) Jain, A.; Vishwanathan, S. V. N.; Varma, M. SPG-GMKL: Generalized Multiple Kernel Learning with a Million Kernels. Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2012. (88) Strobl, E. V.; Visweswaran, S. Deep Multiple Kernel Learning. Machine Learning and Applications (ICMLA), 2013 12th International Conference on. 2013; pp 414–417. (89) Mai, L.; Liu, F. Kernel Fusion for Better Image Deblurring. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 371–380. (90) Maaten, L. v. d.; Hinton, G. Visualizing Data Using T-SNE. J. Machine Learning Res. 9, 2579–2605. (91) Tetko, I. V.; Tanchuk, V. Y.; Kasheva, T. N.; Villa, A. E. P. Estimation of Aqueous Solubility of Chemical Compounds Using E-State Indices. J. Chem. Inf. Comput. Sci. 2001, 41, 1488–1493.

40

ACS Paragon Plus Environment

Page 40 of 41

Page 41 of 41

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment