Machine Learning Derived Quantitative Structure Property Relationship

Jan 29, 2019 - Abstract. Prediction of drug solubility is a crucial problem in pharmaceutical industries for both drug delivery and discovery purposes...
1 downloads 0 Views 1001KB Size
Subscriber access provided by Mount Allison University | Libraries and Archives

Process Systems Engineering

Machine Learning Derived Quantitative Structure Property Relationship (QSPR) to predict drug solubility in binary solvent systems Sivadurgaprasad Chinta, and Raghunathan Rengaswamy Ind. Eng. Chem. Res., Just Accepted Manuscript • DOI: 10.1021/acs.iecr.8b04584 • Publication Date (Web): 29 Jan 2019 Downloaded from http://pubs.acs.org on January 31, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

Machine Learning Derived Quantitative Structure Property Relationship (QSPR) to predict drug solubility in binary solvent systems Sivadurgaprasad Chinta, Raghunathan Rengaswamy* Department of Chemical Engineering, Robert Bosch Center for Data Science and AI, Indian Institute of Technology Madras, Chennai, Tamil Nadu, India -600036

Abstract: Prediction of drug solubility is a crucial problem in pharmaceutical industries for both drug delivery and discovery purposes. Several theoretical approaches have been proposed to predict drug solubility in mixed solvent systems when the solubility values in pure solvents are known. Quantitative structure property relationship (QSPR) approaches are gaining attention to predict various physical properties due to their robustness and computational tractability. In this article, a machine learning based QSPR based approach is proposed to predict drug solubility in binary solvent systems using structural features, such as molar refractivity, McGowan volume, topological surface area etc. Genetic algorithm (GA) based feature selection procedure is used to check the dependency between the selected features and to obtain the final set of significant features. Initially, solubility is assumed to behave linearly with respect to the structural features and model parameters are estimated using ordinary least squares (OLS) and a weight-based optimization approach. Later, solubility is assumed to be piecewise linear with respect to structural features and multiple model (MM) parameters are identified using a machine learning approach, which is a prediction error based clustering approach. The efficacy of proposed approaches is demonstrated on drug solubility data collected from literature. To compare the efficiency of proposed MM approach, neural network (NN) based nonlinear model with different configurations using Levenberg-Marquardt training algorithm has tested. A novel testing strategy is also proposed to identify a suitable model for a test sample when model parameters are obtained using a prediction error based clustering approach. Key words: Drug solubility in mixed solvents, QSPR approach, multiple models *Corresponding author: Raghunathan Rengaswamy, Ph.no. +91-44-22574159 Email: [email protected]

1

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 28

Introduction: Drug solubility is a major concern in pharmaceutics, for both drug delivery and discovery. In drug delivery, it is important to achieve the desired concentration of drug in circulation for achieving a required pharmacological response. Although the drug can reach the receptors through aqueous media, aqueous soluble drugs are preferred for clinical purposes due to the concern of oral bio-availability. Solubility also plays a crucial role in discovery and development investigations. In the last decade, the number of drugs which fail in commercialization is increasing due to their low aqueous solubility values 1. Owing to this fact, several approaches have been proposed to increase the drug solubility such as usage of cosolvents 2, Deep Eutectic Solvents (DES) 3, solubilizing agents 4, pharmaceutical salts and co-crystals 5, and various other techniques

6,7.

Di et al.

8

and Williams et al.

9

highlighted the major challenges faced by low

solubility drugs such as vitro and vivo assessments during drug discovery and development phases and suggested some possible remedies. Jorgensen and Duffy 10 made an attempt to model the aqueous solubility of drugs as a linear function of features such as solvent-accessible surface area, number of hydrogen bonds etc., which are obtained using Monte Carlo simulations. In a follow-up paper

11,

they quoted three

different ideas to predict the aqueous solubility of drugs. First was a group contribution method, where linear regression is performed on available data to obtain contributions of each fragment. Second, a linear regression based QSPR approach where features are obtained from a chemical structure is proposed. Third, a neural network (NN) based approach which accounts for nonlinear behavior of features in QSPR approach is evaluated. Though NN based approach allows identification of non-linear behavior, the drawback is that the internal processing of the NN is not lucid. Ran and Yalkowsky

12

verified the effectiveness of general solubility equation (GSE)

in predicting aqueous solubility, which merely requires information of the melting point and octanol-water partition coefficient of a drug to predict its aqueous solubility. Delaney 13 reviewed several approaches for predicting aqueous solubility using structural information of the solute and highlighted the challenges in their applicability. Lusci et al.

14

demonstrated the benefits of

machine learning techniques such as deep learning networks over other state-of-the-art methods used to predict the aqueous solubility of drugs. 2

ACS Paragon Plus Environment

Page 3 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

Cosolvency is one of the most feasible solutions for increasing aqueous solubility of drugs 2. Cosolvency of non-aqueous solvents is also important during the drug development phase 15. The major advantage of mixed solvent systems is that the solvent-solvent interactions in some compositions may allow more solute to dissolve than the single solvent systems

16.

One of the

earliest models to predict drug solubility in water-cosolvent systems was proposed by Yalkowsky

17.

Drawing inspiration from the thermodynamic mixing model, Acree

18

derived a

mathematical model to predict solute solubility in binary solvent systems at a fixed temperature when the solubility values in pure solvents are available. Jouyban and Hanaee

19

provided a

better strategy for regressing the variables specified in the model proposed by Acree 18 with a no intercept linear model. This mixed solvent model is further extended by Jouyban and Acree 20 for varying temperatures. Chen and Song

21

proposed a Nonrandom Two-Liquid Segment Activity

Coefficient (NRTL-SAC) model to estimate drug solubility in pure and mixed solvents using the molecular descriptors such as hydrophobicity, polarity, and hydrophilicity in a thermodynamic framework. Mullins et al.

22

proposed COSMO-based (Conductor-like Screening Models)

thermodynamic models to predict solubility in both pure and mixed solvent systems. Sheikholeslamzadeh and Rohani

23

estimated the solubility of three different solutes in different

mixed solvents systems using both experimental and computational studies and concluded that NRTL-SAC model performs better than the other thermodynamic frameworks. Kokitkar and Plocharczyk

24

applied NRTL-SAC model to identify optimal solvents to support crystallization

process. Shu and Lin

25

improved the efficacy of COSMO-SAC (segment activity coefficient)

model by minimizing the error in the prediction of solute-solvent interactions using the solubility data in pure solvents. Valavi et al.

26

extended NRTL-SAC model by introducing temperature

dependent binary interaction parameters, which results in better prediction of solubility values than the original NRTL-SAC model. Jouyban et al.

27

proposed a cosolvency model by consolidating different cosolvency models

proposed as a power series of the volume fraction of the cosolvent but with different assumptions. Jouyban

15

compared several cosolvency models based on multiple prediction

accuracy criteria such as root mean square error (RMSE), mean percentage deviation (MPD) etc. and concluded that Jouyban-Acree model 20 to be the most suitable approach for pharmaceutical purposes. Jouyban et al.

28

made an attempt to generalize the Jouyban-Acree model with 3

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 28

Abraham’s solvent and solute parameters to predict drug solubility in some water-cosolvent binary systems. This solute generalization approach considers solvents to be fixed, i.e., one model for each binary solvent system regardless of the solute used. A detailed review of state-ofthe-art methods for estimating drug solubility using both experimental and computational studies have been consolidated by Jouyban and Fakhree 29. QSPR is a mathematical correlation identified between the physical response of a molecule and its structural information. Structural information is denoted in the form of descriptors/features which are numerical values associated with the chemical constitution of a molecule structure ranging from atom counts to topological surface area. QSPR approaches proved their ability in predicting various physical properties of a molecule from its structural information

30.

Identifying QSPR involves three major steps, i.e., data preparation, data

processing, and model interpretation. Data preparation involves the conversion of chemical structures into a suitable form to calculate structural feature values. Structural features can be obtained from derived mathematical models, experimental analysis and various platforms designed for this purpose such as PaDEL-Descriptor, DRAGON, OpenBabel

31

etc. Data

processing is used for the removal of intercorrelated features and identifying optimal feature set using any feature selection algorithm such as genetic algorithm, stepwise algorithm etc. Feature selection algorithms are intelligent ways of exploring various possible combinations of overall feature set to obtain the most suitable subset of features

32.

Identified feature subset is exposed

using any linear or non-linear modeling toolbox to obtain the correlation between features and response. Interpretation of model requires knowledge about the behavior of features and the response

31.

Roy et al.

33

provided an overview of QSPR/QSAR modeling by consolidating the

details of various structural descriptors along with the QSAR applications. Yousefinejad and Hemmateenejad

34

reviewed various chemo metric approaches used for features selection and

model development for QSPR studies. Selecting a suitable mathematical tool for identifying linear or non-linear behavior is not a trivial task. Multiple linear regression (MLR) and partial least squares are the most efficient methods for predicting linear behavior, whereas neural networks and support vectors are efficient for predicting non-linear behavior. The disadvantage of neural networks is that the model will 4

ACS Paragon Plus Environment

Page 5 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

not be available explicitly. Multiple model identification is one of the alternatives to identify non-linear behavior, using piecewise linear models 35. A detailed review of the state-of-the-art of multiple model framework can be obtained in the recent two-part review 36,37. Among several cosolvency models that have been proposed to predict drug solubility, Jouyban-Acree model 20 draws greater attention due to its efficacy in predicting drug solubility in numerous binary solvent systems at different temperatures. Although this model form is unique, model constants vary for each combination of solute and binary solvent. Correlating drug solubility to structural features of the solute and both solvents through Jouyban-Acree model can lead to a universal model to predict drug solubility in any binary solvent system at any temperature. In the current work, QSPR approach is used to correlate drug solubility in binary solvent systems to structural features such as molar refractivity, molecular weight, McGowan volume etc. of solute and both solvents using a modified version of Jouyban-Acree model. A brief review of various drug solubility prediction methods is provided in section 1, while data preparation and processing steps are discussed in section 2. For feature selection, genetic algorithm is used on selected features to obtain an independent feature set. In section 3, a linear dependency between drug solubility and identified features is anticipated and model coefficients are obtained using MLR and also a weight-based optimization. In section 4, a piecewise linear dependency of drug solubility on identified features is assumed and model coefficients are obtained using a modified prediction error based clustering approach. Finally, this article is concluded with comments on the efficiency of proposed models on drug solubility data collected from literature.

Data preparation and processing Experimental drug solubility data of 63 diverse binary systems with varying solute and solvents is consolidated from various resources. In twenty-five out of sixty-three systems, the solubility data is obtained from water-cosolvent systems for different solutes, whereas it is obtained from non-aqueous solvent systems for the remaining. The data contains twenty-seven different solutes, a majority of which are based on anthracene (twelve binary systems). For 5

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 28

twelve out of these sixty-three systems, solubility data is obtained at two different temperatures, whereas the data is obtained at a single temperature for rest of the systems. Temperatures in the data collected vary from 293K to 308K. Based on the above observations, we can confirm that the data collected is well distributed over different combinations of solute, solvents, and temperature thus is useful for attaining a unique global model to predict drug solubility. The experimental data collected consists of 766 data samples in which 150 samples belong to pure solubility estimations. Jouyban-Acree model is designed in such a way that pure solubility values are always predicted at their experimental values irrespective of the model parameters used. Hence pure solubility samples are not considered for the mixed solubility prediction case studies. The data is further screened such that no data sample considered in this case study has a solute mole fraction less than 0.0001, thus reducing data samples count to 585. Solubility data is provided along with references of the sources in supporting information. PaDEL-Descriptor 38 is an open source software to calculate descriptors of different categories ranging from constitutional descriptors to electrostatic descriptors. Structure files of all the 49 compounds involved in the data are generated in smiles (.smi) format using MarvinSkecth

39.

Structural files generated are processed using PaDEL-Descriptor to obtain the structural features. Molar refractivity (AMR), McGowan characteristic volume (McG_Vol), van der Waals volume (VABC), Molecular weight (MW), sum of atomic polarizabilities (Apol) and first ionization potentials (Si) of both solvents and solute along with topological polar surface area (TopoPSA), solvent accessible surface area (SolvAccSA), excessive molar refraction (MLFER_E), combined polarizability (MLFER_S), overall solute hydrogen bond acidity (MLFER_A) and basicity (MLFER_BH ) values of solute are selected to account for the solvent-solvent and solute-solvent interactions 29. All the 24 descriptors are of different magnitudes and hence descriptors are scaled using their individual mean and standard deviation values (mean centric scaling). The JouybanAcree model is modified by normalizing the temperature term with room temperature and forms the basis for further investigation.  2 i  T  ln X sT   f1 ln X 1T  f 2 ln X 2T   f1 f 2   Qi  f1  f 2      i 0   298 

(1)

6

ACS Paragon Plus Environment

Page 7 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

Where, X sT the solubility (in mole fraction) of the solute in the system at temperature T, f1 and f 2 are the mole factions of solvents in the solvent mixture, X 1T and X 2T are the solute solubility

values (in mole fractions) in pure solvents at temperature T in increasing fashion (i.e. the solvent which has low solute solubility considered as solvent 1), Q0 , Q1 and Q2 are the Jouyban-Acree model constants, which are dependent upon the solute and solvents involved in the system. In this work, these model constants are assumed to be linearly dependent on selected structural features. Assuming Qi as a linear function of selected features, it can be expressed as, N

Qi   ci , j Fj ;

i  0,1, 2;

(2)

j 1

Where, F j is the jth structural feature value and N is the number of structural features. This now makes it possible to generalize the model to be able to predict solubilities for novel systems, systems that are not used in obtaining the model. We now perform some algebraic manipulations to render the equations in a standard multivariate linear model form. Substituting Equation (2) in Equation (1):  2  N  i  T  ln X sT   f1 ln X 1T  f 2 ln X 2T   f1 f 2     ci , j Fj   f1  f 2     i  0 j 1   298     

(3)

Rearranging Equation (3), and expand the right-hand side terms as follows:

 ln X   f ln X T s

1

T 1

 f 2 ln X 2T 



N  N    c0, j Fj   c1, j Fj  f1  f 2   j 1 j 1  T   f1 f 2  N    298  2   c2, j Fj  f1  f 2    j 1  

(4)

Convert Equation (4) in the form of a multivariate linear model as follows:





N

N

N

j 1

j 1

j 1

Y  ln X sT   f1 ln X 1T  f 2 ln X 2T    c0, j j   c1, j  j   c2, j  j ;

(5)

 T  where  j  f1 f 2   Fj ;  j   j  f1  f 2  ;  j   j  f1  f 2   298 

Y is the difference between actual log solute solubility fraction in solvent mixture and sum of the

products of pure log solubility values and their respective mole fractions in the solvent mixture. 7

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 28

For the 585 data samples considered in this study, once values of Y , j ,  j , and  j are known any regression technique can be applied to obtain regression coefficients ( ci , j ).

Feature selection Feature selection is the process of identifying the best possible combination of features using a predefined metric. Data is divided into 5 equal size partitions such that all partitions contain a minimum of 10% data points from all 63 binary systems. Values of Y , j ,  j , and  j are calculated using scaled structural features and the solubility data based on Equation (5). Genetic algorithm (GA) is used to select the optimal feature set. Features are selected by combining Kfold (K as 5) validation with GA. Feature selection procedure is executed for five folds such that each time the data in four different partitions are used to obtain regression coefficients ci , j . Variables for GA are binary ( b j ) i.e. whether a particular feature is selected or not and the objective is weight based RMSE of the whole data. Since log solubility predictions are biased towards low solubility data samples and general solubility predictions favor high solubility samples, a weighted objective of both predictions is considered to be more effective. The weighted objective for optimization problem can be expressed as follows:

  orig pred 2   ND  X n  X n  min  10  bj  ND   n 1 

   

0.5



 N D ln X orig  ln X pred  n   n     n 1 ND  



2

    

0.5

    

(6)

Where N D is the number of data points and ln  X npred  can be estimated as follows:  2  N  i  T  ln X sT   f1 ln X 1T  f 2 ln X 2T   f1 f 2     ci , j b j Fj   f1  f 2     i  0 j 1   298     

(7)

Where b j is a binary decision variable representing whether the jth structural feature is selected or not. Regression coefficients ( ci , j ) are obtained using MLR on training data based on Equation (5) with the features for which b j is one. After obtaining optimal binary variables set for each 8

ACS Paragon Plus Environment

Page 9 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

fold using GA, the final set of features are selected with a 60% probability from overall results i.e. if a particular feature is active for at least 3 folds out of 5 folds then that feature is selected into final feature set. The feature selection approach explained above is employed on consolidated drug solubility data (585 data samples) using the inbuilt ga function in MATLAB. Average and best fitness values at each generation for all the folds can be seen in Figure 1. The details of optimal features obtained in each fold along with different prediction efficacy metrics are provided in Table 1. Table 1 Details of feature selection process using GA Fold

1 2 3 4 5

Features that Weighted are not active objective in optimal Optimal All solution features features 1.4963 1.4963 1, 6, 9, 12, 19, 1.5549 1.8546 22, 23, 24 22 1.5272 1.5281 6 1.5094 1.5096 1.4883 1.4883

MPD (solubility) Opt. All

R2 metric (solubility) Opt. All

R2 metric (log solubility) Opt. All

49.028

49.028

0.311

0.311

0.601

0.601

51.166

49.613

0.345

0.171

0.545

0.322

51.610 50.223 50.899

51.483 50.773 50.899

0.289 0.326 0.336

0.289 0.297 0.336

0.582 0.586 0.600

0.581 0.594 0.600

* Tuned parameters: ConstraintTolerance and FunctionTolerance set to 0 and also provided an initial guess that all variables are 1 (i.e. active). Remaining parameters set to MATLAB default.

In fold 1 and fold 5, optimal solutions denote that all features are essential for drug solubility predictions. Fold 2 has the least active number of variables whereas variable 22 and variable 6 are not active in case of fold 3 and fold 4 respectively. It is interesting to note that in case of fold 3 and fold 4 (bolded values) leaving out certain variables doesn’t contribute significantly to the optimal objective. It can be observed from Figure 1 that for fold 1 and fold 5 the best objective value did not change throughout the generations. Since all variables are active (the solution coding corresponding to all variables are 1) is supplied as initial guess at zeroth generation, this means that all variables are important for these two folds. Though GA terminated at different generations for different folds, the reason for termination in all folds is that the optimal solution did not change for 50 consecutive generations. R2 values of general and log solubility predictions represent that the optimal solutions are more efficient than models consisting of all features. On the other hand, MPD values for the models obtained by considering all features are better than 9

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 28

the MPD values obtained for optimal solutions. This is because the MPD metric is sensitive to predictions of data samples with low solubility values. From the results obtained, it is evident that all features can be selected for further investigation of solubility prediction using modified Jouyban-Acree model.

Figure 1 Generation-wise best and average fitness values for all the folds

Single model approximations In this section, a single linear model i.e. modified version of Jouyban-Acree model given in Equation (3) is investigated to predict the drug solubility in binary solvent systems through two approaches. The first approach is to obtain model coefficients using ordinary least squares (OLS) and the second approach is to use a weight-based optimization approach (WBO) to obtain model coefficients. For this case study, the data is divided into five equal size partitions such that all partitions contain a minimum of 10% data points from all 63 binary systems. The efficacy of single model assumptions is validated using K-fold validation, for a K value of 5. This validation procedure repeats for five (K) times, each time the data in four (K-1) different partitions are used to obtain model coefficients whereas the data in remaining partition is used to test the efficacy of obtained coefficients. Once this procedure is completed, the average of coefficients over K folds are obtained and reported as final model coefficients. The weight based optimization is carried out for the objective specified in Equation (8) using quasi-newton algorithm (fminunc solver in

10

ACS Paragon Plus Environment

Page 11 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

MATLAB) with an initial guess of zero for all variables. The objective for weighted optimization is as follows: 0.5    NTrD ln X orig  ln X pred orig pred 2  n   n    NTrD  X n  X n   min  10     n 1 cij   NTrD NTrD   n 1   





2

   

0.5

    

(8)

Where, ci , j are variables for optimization problem and NTrD is the number of data points in the training dataset. The logarithmic solubility of the solute is estimated using Equation (3). Various metrics for testing the efficacy of model coefficients obtained using both approaches are provided in Table 2. The drug solubility values obtained using the averaged model coefficients over all folds in both approaches are plotted along with log solubility predictions. Parity plots for both types of predictions can be seen in Figure 2. It can be observed from both plots that OLS approach is biased towards samples with low solubility value since residuals are obtained using log solubility values alone. R2 values for log solubility predictions are slightly better for OLS approach than the optimization based approach due to the same reason. However R2 values of general solubility predictions are significantly better for weighted approach than OLS approach. This is due to the fact that the objective is incorporated with residuals of both general and log solubility predictions. It is also interesting to note that for all the samples with solubility fractions more than 0.1, the weighted objective approach predictions are more accurate than the OLS approach i.e. close to the reference line in the parity plot. Although the weighted approach significantly improved the solubility predictions of most of the samples, the MPD values are still poorer than the MPD values for OLS approach. This can be attributed to the fact that MPD metric is sensitive to the predictions of very low solubility samples. From these observations, it can be concluded that weight-based optimization approach, in general, results in better predictions for drug solubility irrespective of magnitudes than the OLS approach. Fine tuning the weights can improve the performance of the obtained model. Since the objective is non-linear (not just quadratic), obtaining a better optimal solution can also improve the overall performance of the model. The details of the model coefficients obtained in both approaches are provided in the supplementary information. Nonetheless, while 11

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 28

the weighted optimization approach is better than OLS, the R2 values for the weighted optimization approach are not very good from a prediction viewpoint.

Figure 2 Parity plots for general and log solubility predictions using both single model approaches

Multiple model approximation The reasons for the poor prediction for linear models might be that a single model cannot capture the behavior for a wide variety of systems. There might be characteristics of systems that group them together, in which case it might not be possible to identify a single global model for these systems. To test this hypothesis and develop a model of higher fidelity, in this section, logarithmic solubility of solute is assumed to be piecewise linearly dependent on structural features i.e. the operating model will be different but linear in different regions of feature space. Identification of models of this form is popularly referred to as multiple model learning (MML). Consider the example depicted in Figure 3. From the figure it can be seen that the output can be characterized using linear relationships that are different in different regions of input space. Identifying a single linear model throughout the input space will result in poor approximation. MML has the possibility of improving the prediction accuracy if the approach automatically identifies the different linearly operating models (four in this case) and their operating regimes.

12

ACS Paragon Plus Environment

Page 13 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

Figure 3 Multiple linear models underlying in different input partitions of data A MML approach that we recently proposed based on prediction error based clustering 35,40 is explored in this work. To examine the robustness of piecewise linear models, a two-layer testing approach is used. The data is divided into five equal size partitions such that all partitions contain a minimum of 10% data points from all 63 binary systems. The data in four partitions are used to identify linear models using K-fold validation approach. The leftover data is set aside as the global test set and used in the second layer testing. In each fold, data in three (K-1) different partitions are used to obtain multiple linear models and data in the remaining partition (K-fold test set) is used to test the obtained models in the first layer testing. Unlike in a single model scenario, predicting the output of a new data point using multiple models is not trivial. K nearest neighbors (KNN) is the most frequently used testing approach to find which model should be used to predict the output of a new data point. Since a prediction error (PE) based clustering approach is used to identify multiple models, a new strategy is proposed to identify a suitable model to predict the output of a new data sample. The details of the clustering approach along with proposed testing strategy are provided below. Kuppuraj and Rengaswamy

35

proposed a prediction error based fuzzy clustering approach to

identify underlying multiple linear models in any given data. The advantage of prediction error based approaches over the traditional Euclidian distance based clustering approaches is that samples are grouped based on their response to output variables thus reducing misclassifications at boundaries. For the sake of brevity, the steps of the algorithm are provided below. More 13

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 28

insights into the PE based algorithm can be obtained from our previous work 35. The objective of PE based clustering algorithm for grouping M data samples into N models is as follows: 2 1 N M q   ij y j  Ci x j 2   2 i 1  j 1 

f 

(9)

Where y j , x j are the response (output) and features vector (input variables) of sample j respectively. Ci represents the model parameters vector of cluster i such that yˆ j  Ci x j . 1. Initialize N models with different parameter values. Each cluster represents a model with different parameter values. 2. Calculate prediction error of sample j with respective to cluster i as follows: PEij  y j  Ci x j

(10)

2

3. Compute membership of sample j to cluster i as follows: ij 

1  PEij   k 1  PEkj N

4.

  

2

q 1

i  1, 2,K N ; j  1, 2,K M ; N  clusters, M  samples

;

(11)

Update cluster centers (model parameters) as follows:

M  Cir 1  Cir   r   ijq  y j  Cir x j  xTj   j 1 

(12)

It is a line search optimization, where search direction can be estimated as follows: g r 

 f  M q    ij  y j  Cir x j  xTj  Ci  j 1 

(13)

The step length of search can be estimated as follows: M

r 

N

   g x   y j 1 i 1 M N

q ij

T

r

j

j

 Cir x j 

 ijq  g r x j   g r x j  T

(14)

j 1 i 1

5. Calculate new prediction errors 6. Calculate root mean square error (RMSE), where b is the best model fit for sample j

14

ACS Paragon Plus Environment

Page 15 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

RMSE 



M j 1

PEb2 j

(15)

M

7. Terminate based on a criterion (RMSE less than predefined limit or number of iterations exceeds the limit) and go to next step else go to step 3 8. Merge like models based on a cosine angle metric and obtain new model parameters using OLS. Finally, report final models and input partitions. The cosine angle metric between clusters i and k:

 Ci CkT  Ci Ck

ik  cos 1 

  

(16)

Testing strategy: Traditional clustering algorithms operate based on Euclidian distance; hence, KNN is a suitable approach to identify an appropriate model for a new data (test) sample. In KNN, first, the Euclidian distances from the test sample to all samples in the training data are evaluated. Then the nearest K samples to test sample and their corresponding models are identified. Now, the model containing the highest number of samples from the nearest K samples is considered to be a suitable model for the test sample. The motivation behind prediction error based clustering is to accurately classify data samples that are close in variable space but are characterized with different models. In such cases finding a suitable model for a test sample using traditional KNN may not be effective. In the proposed strategy we incorporate prediction error along with KNN. First, the Euclidian distance from the test sample to all samples in the training data are evaluated. Next, the K nearest samples for the test sample are identified. Then prediction errors for these K nearest samples with respective to each model are computed and averaged over the models. The model with the least averaged prediction error (for the K nearest samples) is considered the most suitable model for the test sample. The proposed testing strategy is illustrated with an example of single input single output system, which comprises two different linear models in the operating region. It can be observed from Figure 4 that there exist two models in the plotted data. All the blue colored data points belong to model 1, which is represented by blue line, whereas the red data points and red line represent model 2. If traditional KNN approach with a K value as 3 is used to obtain the suitable model for a new test sample with input value as 5.3 (represented in black diamond), model 2 (red 15

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 28

line) will be selected as the suitable model for the test sample since out of the three nearest samples (S1, S2 and S3), two belong to model 2. If proposed testing strategy is used, first the prediction errors for the three nearest samples using both models will be calculated. The average prediction error for both the models will be evaluated and the model with least average prediction error will be identified as the suitable model. From Figure 4. it can be observed that model 1 has the least mean prediction error hence it is the suitable model for the new test sample with input value 5.3.

Figure 4 Prediction error based Knn strategy to identify a suitable model for a test molecule Drug solubility estimation using multiple models: The prediction error based clustering algorithm works on the principle that piecewise linear models of the form yˆ  Ci x characterizes the data in different input regions. Hence, once the partitions are identified, final model coefficients are obtained using OLS. As discussed earlier, in case of single model approximations, the linear model obtained using OLS favor low solubility samples. Hence, a weighted objective was incorporated into PE based clustering to obtain models without any bias towards any particular range of solubility samples. In case of single model approximations, it is also observed that replacing 20% of data every time for a new fold results in a significant deviation in model parameters. In case of multiple models, these deviations can result in ambiguity when the final set of models is reported. To address these two issues, the PE based clustering approach is modified such that final models are obtained using a weight based optimization.

16

ACS Paragon Plus Environment

Page 17 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

Step 1 in PE algorithm is modified such that the models are initialized with the final model parameters obtained in the previous fold. To incorporate the weighted objective, final models in step 8 are obtained using optimization instead of OLS. This optimization is carried out using quasi-newton algorithm (fminunc solver in MATLAB) for the objective specified in Equation(8). The final parameters obtained in the previous fold are used as an initial guess in the current fold so that the models obtained in all the folds will be consistent. This clustering procedure is repeated until models over all folds converge within a predefined similarity metric. The similarity metric is explained in detail with a simple example. Let’s assume that there exists N piecewise linear models in data and obtain model parameters in all K folds. Now compute the cosine angle metric provided in Equation (16) for each model with respect to the other models in remaining folds. Now, for the model i in the kth fold find the minimum angle in between the N models associated with any other fold, repeat the same until minimum angles with respect to all the other folds for that particular model are obtained. Now, identify the maximum angle i ,k  among the K  1 minimum angles obtained for the model i of fold k. Repeat the above procedure for all the models in all the folds. The maximum angle among the angles i ,k  that are associated with all the models in all the folds is defined as the similarity metric   and denoted as follows:   max i ,k  ; where

  

i  N ; k  K



i ,k  max min  k ' ; j  N ; k '  K  k '  k ; i, j

(17) (18)

The flow chart of the modified PE based algorithm is provided in Figure 5. In case of fold 1 in phase 1, since no previous fold exists, models are initialized with random parameters, whereas in step 8 the optimization is initiated with all variables as zeros as the initial guess. The drug solubility prediction is carried out assuming that two piecewise linear models exist in the data and the fuzzifier ‘q’ value is set to 1.5. The maximum number of iterations for PE based clustering algorithm is set to 1000, whereas tolerance for similarity metric   is set to 10o. The models in each fold are tested on two test data sets i.e. K-fold test set and the global test set. The suitable model for a data sample in test set is identified using the proposed testing strategy with K value as five i.e. the model, which has the least averaged prediction error over the five nearest 17

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 28

samples to the test sample is considered as the best suitable model for that particular test sample. Similar models are identified in all the folds and are averaged to test with the global test set. The data separated out for K-fold validation is associated with the averaged models based on the prediction errors, so that any new sample can use these data samples as neighbors for selecting a suitable model. This final pair of models can be considered as global models for predicting drug solubility in binary solvent systems irrespective of the temperature and components involved in the system. The drug solubility profiles estimated using these model coefficients for two different systems at different temperatures are plotted in Figure 7. Solubility profiles estimated using the final models obtained from the single model approximations are also included in this figure to show the efficacy of multiple models explicitly. All models’ information over the folds and phases are provided in the supplementary information. The details of prediction accuracy using various metrics for the models obtained in the final phase are reported in Table 3. It can be observed from the table that the R2 values for both general and log solubility predictions in all the folds are significantly improved using multiple model approximations. It is interesting to note that R2 values for both general and log solubility predictions are similar for all data sets in all the folds thus showing the effectiveness of weighted objective approach, i.e., there is no bias towards any category of samples, unlike OLS approach. The two layer testing statistics proves the robustness of obtained models. The solubility prediction MPD values of K-fold data and global test data set using the averaged parameters (AP) models are significantly low underlining the fact that this final pair of models can be used for drug solubility predictions in any binary solvent systems. It is evident from all three efficacy metrics for K fold test data (K-test) in case of fold 3 (bolded values) multiple models underperform, still much better than any single model metric. This can be attributed to the few misclassifications of low solubility samples in the testing phase. Removing these outliers improve the performance of multiple models.

18

ACS Paragon Plus Environment

Page 19 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Industrial & Engineering Chemistry Research

Table 2 Various efficacy metrics of obtained models using both single model approaches Fold

1 2 3 4 5 AP

MPD (solubility) Train Test 48.800 49.937 48.518 53.991 50.210 56.573 49.685 55.125 50.659 51.860 49.3347

OLS model R2 metric (solubility) Train Test 0.346 0.129 0.199 0.083 0.288 0.286 0.291 0.309 0.350 0.229 0.3291

Optimization model R2 metric MPD (solubility) R2 metric (log s) (solubility) Train Test Train Test Train Test 0.615 0.538 50.721 52.115 0.590 0.372 0.651 -1.07 56.886 60.708 0.754 0.628 0.603 0.496 56.714 64.440 0.721 0.471 0.603 0.556 55.331 59.399 0.700 0.680 0.621 0.518 54.164 56.417 0.656 0.442 0.5963 54.3691 0.6472

R2 metric (log s) Train Test 0.584 0.522 0.575 -1.17 0.533 0.380 0.517 0.444 0.568 0.455 0.5384

Figure 5 Modified PE based clustering algorithm for drug solubility predictions 19

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 28

To benchmark the performance of the multiple model approach with other popular techniques, we trained a neural network (NN) to identify the non-linear behavior. The neural network was trained using a Levenberg-Marquardt backpropagation training algorithm for 3 different configurations i.e. the number of hidden layers used are 1, 2 and 5. The data was divided into five equal size partitions such that all partitions contain a minimum of 10% data points from all 63 binary systems. The data in four partitions are used to train the neural network, whereas the remaining data used for testing. The network is trained using 'trainlm' function available in MATLAB neural network tool box. The efficacy metrics of three different NN configurations are as follows – MPD values are 35.495, 36.296, and 44.001; R2 metric values for solubility are 0.597, 0.900, and 0.838; R2 metric values for log solubility are 0.888, 0.814, and 0.884 correspondingly for hidden layer size of 1, 2 and 5. The superior performance of the NN approach compared to single linear model approaches can be attributed to the ability of neural networks to identify the non-linear behavior. Though the performance of NN approach is better than single model approaches it still significantly underperforms when compared to multiple linear model approach. Table 3 Various efficacy metrics of multiple models obtained using the modified PE approach Fold 1 2 3 4 AP

MPD (solubility) Train K-test G-test 7.426 22.255 17.327 7.440 76.034 18.354 7.590 161.670 19.086 31.395 6.966 48.825 8.070 18.113

R2 metric (solubility) Train K-test G-test 0.996 0.985 0.969 0.995 0.960 0.978 0.998 0.905 0.987 0.958 0.997 0.953 0.9947 0.980

R2 metric (log s) Train K-test G-test 0.993 0.904 0.927 0.990 0.908 0.940 0.991 0.872 0.924 0.890 0.993 0.941 0.991 0.938

To test the efficacy of proposed approach, the MPD values obtained are compared to the MPD values reported in the existing approaches

17,41–43

in Table 4. The reported MPD values of

individual systems include pure solubility samples for a rational comparison with the literature models. In case of naproxen in ethanol - water at 298k, 3 out of the 11 samples have solubility values less than 0.0001 and hence these samples are excluded while reporting the MPD values using the proposed approach. In case of naproxen in ethanol - water at 303k and Propyl phydroxybenzoate in PG - water at 300k, a similar policy was used while reporting MPD values. 20

ACS Paragon Plus Environment

Page 21 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

Yalkowsky equation (model a) is a zero parameter model, which is a linear combination (in terms of compositions) of the pure solubility values of a drug in both solvents. Models b and c correlate solubility parameters to drug solubility, whereas models d and e (proposed multiple model) are two QSPR based approaches with different structural features. Systems with high MPD values (>1000 in case of Yalkowsky equation) represent either the nonlinear interaction of solutes or presence of low solubility values (0.0001 to 0.001) in the data of that binary system. It is interesting to note that the MPD values of sulfanilamide in ethanol and water corresponding to models b and c are poorer than the Yalkowsky equation. We can conclude from Table 4 that except for acetaminophen (paracetamol) in ethanol - water system, MPD values obtained using the proposed approach are significantly better than other models. The MPD values of all 63 individual systems (if the data for a system is obtained at two different temperatures, then the MPD value is calculated for all the data samples together) can be observed in Figure 6. MPD values obtained range from 0.72% to 31.03% with a standard deviation of 6.45%. The minimum MPD value corresponds to system 24 (i.e. Benzoic acid in CCl4 and nHeptane) whereas the maximum MPD value corresponds to system 61 (i.e. Sulfamethazine in water and ethanol). Only 4 out of 63 systems have MPD values greater than 20% whereas 51 systems have MPD values lesser than 10%. These observations demonstrates the ability of the proposed approach in obtaining a generalized model for various binary solvent systems.

Figure 6 MPD values of all 63 binary systems obtained using multiple models approach 21

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 28

Table 4 MPD metrics of various water + cosolvent systems at different temperatures using several approaches Solute

Cosolvent

T(K)

Nd

MPDa

MPDb

MPDc

MPDd

MPDe

Acetaminophen

PG

293

11

115.9

66.9

-

-

22.85

Acetaminophen

PG

303

11

98.37

56.3

-

5.4

2.80

Acetaminophen

Ethanol

298

13

44.78

7.2

18.73

26.8

16.90

Caffeine

Ethanol

298

11

60.43

41.6

-

21.1

8.22

Caffeine

Ethanol

308

11

64.27

50.7

-

-

6.06

Naproxen

Ethanol

298

11

1803.2

23.4

-

-

7.34(8)

Naproxen

Ethanol

303

11

1788.2

20.5

-

-

7.40(9)

Sulfanilamide

Ethanol

298

12

31.24

43.8

31.95

18.2

6.27

Caffeine

Dioxane

298

16

60.19

-

21.85

-

3.77

Methyl phydroxybenzoate

PG

300

11

920.94

-

18.16

14.4

14.07

Methyl paminobenzoate

PG

300

11

669.15

-

15.78

16.5

9.02

Ethyl phydroxybenzoate

PG

300

11

1991.0

-

18.71

9.4

9.17

Ethyl paminobenzoate

PG

300

11

1416.5

-

9.60

18.2

8.71

Propyl phydroxybenzoate

PG

300

11

4739.0

-

19.31

14.1

8.04(10)

Propyl paminobenzoate

PG

300

11

4021.8

-

9.62

22.6

4.89

(a) Yalkowsky equation 17, (b) Solubility prediction using partial solubility parameters 41, (c) solubility prediction using an artificial neural network 42, (d) Jouyban-Acree model – effect of solute structure 43, (e) Multiple models approach

22

ACS Paragon Plus Environment

Page 23 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

Figure 7 Solubility profiles of two distinct binary system at various temperatures The solubility profiles shown in Figure 7 belong to two different systems obtained at various temperatures. The experimental solubility curves in case of system 1 (Acetanilide – Dioxane – Water) are irregular owing to the noise in solubility estimations whereas the predicted solubility curves are smooth indicating the theoretical behavior of solvent interactions in the system. The solubility profiles in system 2 (Caffeine – Ethyl acetate – Ethanol) show a clear dependency of solubility on temperature. The proposed multiple model (MM) approach is able to capture the temperature dependency effectively. It is interesting to note that predictions using multiple models (MM) approach are very accurate, even though magnitudes of the experimental solubility fractions are of different orders in these two systems. It can be concluded from the solubility profiles that multiple models are far superior for prediction of drug solubilities over the models that are available in the literature. This can be seen in the tremendous improvement in the R2 values between these approaches (Tables 2 and 3). 23

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 28

Conclusion In this study, a QSPR based approach using multiple linear models is examined to predict drug solubility. Drug solubility is assumed to behave in piecewise linear fashion in different partitions of structural features. The temperature term in Jouyban-Acree model is normalized with room temperature to avoid the influence of temperature magnitude on model coefficients. For this QSPR approach, various structural features of the solute and both solvents are selected and processed through feature selection using GA. The log solubility values are initially assumed to be linearly dependent on structural features and model coefficients are obtained using OLS and weight based optimization approach. The weight based optimization approach is shown to be better than the OLS approach; however, both models were not of high enough fidelity. Later, log solubility values are assumed to be piecewise linearly dependent on structural features and model coefficients are identified using a modified PE based clustering algorithm. A new testing strategy is also proposed to identify a suitable model for test samples in case of PE based clustering approaches. The prediction efficacy of final pair of models is tested on a global test set. The MPD and R2 values demonstrate that the final set of models can be used for predicting the solubility of drugs irrespective of the solute and solvents involved in the system and temperature.

Details of supporting information: The supporting information provided contains multiple tables. Table S1 contains experimental solubility data collected (along with references) as well as the solubility values predicted using various approaches. Table S2 contains the descriptors’ values of each drug (solute) separately. Table S3 contains the descriptors’ values of all solvents separately. Table S4 contains the descriptors’ values of each system (i.e. descriptors combining solute and both solvents) and the MPD metrics obtained using Yalkowsky equation and proposed multiple models approach along with the mean and standard deviation of descriptors’ values used for scaling the descriptors individually. Table S5 contains model parameters obtained using single model approaches (i.e. 24

ACS Paragon Plus Environment

Page 25 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

OLS and weight-based optimization approaches). Table S6 contains model parameters obtained using the multiple model (piecewise linear) approach in each phase. The highlighted parameters in Table S6 corresponds to the final pair of models for which the prediction metrics are reported in Table 3. The Supporting Information is available free of charge on ACS Publications website at http://pubs.acs.org/. Acknowledgement: We would like to thank Robert Bosch Center for Data Science and Artificial Intelligence for providing computational facilities. References: (1)

Kawabata, Y.; Wada, K.; Nakatani, M.; Yamada, S.; Onoue, S. Formulation Design for Poorly Water-Soluble Drugs Based on Biopharmaceutics Classification System: Basic Approaches and Practical Applications. Int. J. Pharm. 2011, 420 (1), 1–10.

(2)

Nayak, A. K.; Panigrahi, P. P. Solubility Enhancement of Etoricoxib by Cosolvency Approach. ISRN Phys. Chem. 2012, 2012 (Article ID 820653), 5 pages.

(3)

Li, Z.; Lee, P. I. Investigation on Drug Solubility Enhancement Using Deep Eutectic Solvents and Their Derivatives. Int. J. Pharm. 2016, 505 (1), 283–288.

(4)

Loftsson, T. Drug Solubilization by Complexation. Int. J. Pharm. 2017, 531 (1), 276–280.

(5)

Elder, D. P.; Holm, R.; Diego, H. L. de. Use of Pharmaceutical Salts and Cocrystals to Address the Issue of Poor Solubility. Int. J. Pharm. 2013, 453 (1), 88–100.

(6)

Savjani, K. T.; Gajjar, A. K.; Savjani, J. K. Drug Solubility: Importance and Enhancement Techniques. ISRN Pharm. 2012, 2012, 195727.

(7)

Vemula, V. R.; Lagishetty, V.; Lingala, S. Solubility Enhancement Techniques. Int. J. Pharm. Sci. Rev. Res. 2010, 5 (1), 41–51.

(8)

Di, L.; Fish, P. V; Mano, T. Bridging Solubility between Drug Discovery and Development. Drug Discov. Today 2012, 17 (9), 486–495.

(9)

Williams, H. D.; Trevaskis, N. L.; Charman, S. A.; Shanker, R. M.; Charman, W. N.; Pouton, C. W.; Porter, C. J. H. Strategies to Address Low Drug Solubility in Discovery and Development. Pharmacol. Rev. 2013, 65 (1), 315 LP-499.

(10)

Jorgensen, W. L.; Duffy, E. M. Prediction of Drug Solubility from Monte Carlo Simulations. Bioorg. Med. Chem. Lett. 2000, 10 (11), 1155–1158.

(11)

Jorgensen, W. L.; Duffy, E. M. Prediction of Drug Solubility from Structure. Adv. Drug Deliv. Rev. 2002, 54 (3), 355–366.

25

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 28

(12)

Ran, Y.; Yalkowsky, S. H. Prediction of Drug Solubility by the General Solubility Equation (GSE). J. Chem. Inf. Comput. Sci. 2001, 41 (2), 354–357.

(13)

Delaney, J. S. Predicting Aqueous Solubility from Structure. Drug Discov. Today 2005, 10 (4), 289–295.

(14)

Lusci, A.; Pollastri, G.; Baldi, P. Deep Architectures and Deep Learning in Chemoinformatics: The Prediction of Aqueous Solubility for Drug-Like Molecules. J. Chem. Inf. Model. 2013, 53 (7), 1563–1575.

(15)

Jouyban, A. Review of the Cosolvency Models for Predicting Solubility of Drugs in WaterCosolvent Mixtures. J. Pharm. Pharm. Sci. 2008, 11 (1), 32–58.

(16)

Maitra, A.; Bagchi, S. Study of Solute–solvent and Solvent–solvent Interactions in Pure and Mixed Binary Solvents. J. Mol. Liq. 2008, 137 (1), 131–137.

(17)

Yalkowsky, S. H.; Roseman, T. J. Techniques of Solubilization of Drugs; M. Dekker, 1981.

(18)

Acree Jr, W. E. Mathematical Representation of Thermodynamic Properties: Part 2. Derivation of the Combined Nearly Ideal Binary Solvent (NIBS)/Redlich-Kister Mathematical Representation from a Two-Body and Three-Body Interactional Mixing Model. Thermochim. Acta 1992, 198 (1), 71–79.

(19)

Jouyban-Gharamaleki, A.; Hanaee, J. A Novel Method for Improvement of Predictability of the CNIBS/R-K Equation. Int. J. Pharm. 1997, 154 (2), 245–247.

(20)

Jouyban-Gharamaleki, A.; Acree Jr, W. E. Comparison of Models for Describing Multiple Peaks in Solubility Profiles. Int. J. Pharm. 1998, 167 (1), 177–182.

(21)

Chen, C.-C.; Song, Y. Solubility Modeling with a Nonrandom Two-Liquid Segment Activity Coefficient Model. Ind. Eng. Chem. Res. 2004, 43 (26), 8354–8362.

(22)

Mullins, E.; Liu, Y. A.; Ghaderi, A.; Fast, S. D. Sigma Profile Database for Predicting Solid Solubility in Pure and Mixed Solvent Mixtures for Organic Pharmacological Compounds with COSMO-Based Thermodynamic Methods. Ind. Eng. Chem. Res. 2008, 47 (5), 1707–1725.

(23)

Sheikholeslamzadeh, E.; Rohani, S. Solubility Prediction of Pharmaceutical and Chemical Compounds in Pure and Mixed Solvents Using Predictive Models. Ind. Eng. Chem. Res. 2012, 51 (1), 464–473.

(24)

Kokitkar, P. B.; Plocharczyk, E.; Chen, C.-C. Modeling Drug Molecule Solubility to Identify Optimal Solvent Systems for Crystallization. Org. Process Res. Dev. 2008, 12 (2), 249–256.

(25)

Shu, C.-C.; Lin, S.-T. Prediction of Drug Solubility in Mixed Solvent Systems Using the COSMOSAC Activity Coefficient Model. Ind. Eng. Chem. Res. 2011, 50 (1), 142–147.

(26)

Valavi, M.; Svärd, M.; Rasmuson, Å. C. Prediction of the Solubility of Medium-Sized Pharmaceutical Compounds Using a Temperature-Dependent NRTL-SAC Model. Ind. Eng. Chem. Res. 2016, 55 (42), 11150–11159.

(27)

Jouyban, A.; Chew, N. Y. K.; Chan, H.-K.; Sabour, M.; Acree Jr, W. E. A Unified Cosolvency Model for Calculating Solute Solubility in Mixed Solvents. Chem. Pharm. Bull. 2005, 53 (6), 634– 637.

(28)

Jouyban, A.; Soltanpour, S.; Soltani, S.; Tamizi, E.; Fakhree, M. A. A.; Acree, W. E. Prediction of

26

ACS Paragon Plus Environment

Page 27 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Industrial & Engineering Chemistry Research

Drug Solubility in Mixed Solvents Using Computed Abraham Parameters. J. Mol. Liq. 2009, 146 (3), 82–88. (29)

Jouyban, A.; Fakhree, M. A. A. Experimental and Computational Methods Pertaining to Drug Solubility; InTech: Rijeka, 2012; p Ch. 9.

(30)

Katritzky, A. R.; Kuanar, M.; Slavov, S.; Hall, C. D.; Karelson, M.; Kahn, I.; Dobchev, D. A. Quantitative Correlation of Physical and Chemical Properties with Chemical Structure: Utility for Prediction. Chem. Rev. 2010, 110 (10), 5714–5789.

(31)

Roy, K.; Kar, S.; Das, R. N. A Primer on QSAR/QSPR Modeling: Fundamental Concepts; Springer, 2015.

(32)

Goodarzi, M.; Dejaegher, B.; Heyden, Y. Vander. Feature Selection Methods in QSAR Studies. J. AOAC Int. 2012, 95 (3), 636–651.

(33)

Roy, K.; Kar, S.; Das, R. N. QSAR/QSPR Modeling: Introduction BT - A Primer on QSAR/QSPR Modeling: Fundamental Concepts; Roy, K., Kar, S., Das, R. N., Eds.; Springer International Publishing: Cham, 2015; pp 1–36.

(34)

Yousefinejad, S.; Hemmateenejad, B. Chemometrics Tools in QSAR/QSPR Studies: A Historical Perspective. Chemom. Intell. Lab. Syst. 2015, 149, 177–204.

(35)

Kuppuraj, V.; Rengaswamy, R. Evaluation of Prediction Error Based Fuzzy Model Clustering Approaches for Multiple Model Learning. Int. J. Adv. Eng. Sci. Appl. Math. 2012, 4 (1–2), 10–21.

(36)

Adeniran, A. A.; Ferik, S. El. Modeling and Identification of Nonlinear Systems: A Review of the Multimodel Approach;Part 1. IEEE Trans. Syst. Man, Cybern. Syst. 2017, 47 (7), 1149–1159.

(37)

Ferik, S. El; Adeniran, A. A. Modeling and Identification of Nonlinear Systems: A Review of the Multimodel Approach;Part 2. IEEE Trans. Syst. Man, Cybern. Syst. 2017, 47 (7), 1160–1168.

(38)

Yap, C. W. PaDEL‐descriptor: An Open Source Software to Calculate Molecular Descriptors and Fingerprints. J. Comput. Chem. 2011, 32 (7), 1466–1474.

(39)

ChemAxon, L. Instant J Chem/MarvinSketch. 2012.

(40)

Chinta, S.; Abhishek, S.; Raghunathan, R. Prediction Error-Based Clustering Approach for Multiple-Model Learning Using Statistical Testing. Eng. Appl. Artif. Intell. 2019, 77, 125–135.

(41)

Jouyban, A.; Shayanfar, A.; Panahi-Azar, V.; Soleymani, J.; Yousefi, B.; Acree Jr., W.; York, P. Solubility Prediction of Drugs in Mixed Solvents Using Partial Solubility Parameters. J. Pharm. Sci. 2011, 100 (10), 4368–4382.

(42)

Jouyban, A.; Majidi, M.-R.; Jalilzadeh, H.; Asadpour-Zeynali, K. Modeling Drug Solubility in Water–cosolvent Mixtures Using an Artificial Neural Network. Farm. 2004, 59 (6), 505–512.

(43)

Jouyban, A.; Fakhree, M. A. A.; Ghafourian, T.; Saei, A. A.; Acree, W. E. Deviations of Drug Solubility in Water-Cosolvent Mixtures from the Jouyban-Acree Model–effect of Solute Structure. Die Pharm. Int. J. Pharm. Sci. 2008, 63 (2), 113–121.

27

ACS Paragon Plus Environment

Industrial & Engineering Chemistry Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Page 28 of 28

Table of contents (Graphic)

28

ACS Paragon Plus Environment