In Silico Calculation of Infinite Dilution Activity ... - ACS Publications

Jul 6, 2016 - Three new models are proposed, each of them based on distinct machine learning algorithm: stepwise multiple linear regression (SWMLR) ...
0 downloads 0 Views 3MB Size
Subscriber access provided by the University of Exeter

Article

In Silico Calculation of Infinite Dilution Activity Coefficients of Molecular Solutes in Ionic Liquids: Critical Review of Current Methods and New Models Based on Three Machine Learning Algorithms Kamil Paduszy#ski J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.6b00166 • Publication Date (Web): 06 Jul 2016 Downloaded from http://pubs.acs.org on July 17, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

In Silico Calculation of Infinite Dilution Activity Coefficients of Molecular Solutes in Ionic Liquids: Critical Review of Current Methods and New Models Based on Three Machine Learning Algorithms Kamil Paduszyński∗ Department of Physical Chemistry, Faculty of Chemistry Warsaw University of Technology, Noakowskiego 3, 00-664 Warsaw, Poland E-mail: [email protected]

Phone: +48 22 234 56 40

1 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Abstract The aim of the paper is to address all the disadvantages of currently available models for calculating infinite dilution activity coefficients (γ ∞ ) of molecular solutes in ionic liquids (ILs) – a relevant property from the point of view of many applications of ILs, particularly in separations. Three new models are proposed, each of them based on distinct machine learning algorithm: stepwise multiple linear regression (SWMLR), feed-forward artificial neural network (FFANN) and least squares support vector machine (LSSVM). The models were established based on the most comprehensive γ ∞ data bank reported so far (> 34, 000 data points for 188 ILs and 128 solutes). Following the paper published previously [J. Chem. Inf. Model 2014, 54, 1311–1324] the ILs were treated in terms of group contributions, whereas the Abraham solvation parameters were used to quantify an impact of solute structure. Temperature is also included in the input data of the models so that they can be utilized to obtain temperaturedependent data and thus related thermodynamic functions. Both internal and external validation techniques were applied to assess the statistical significance and explanatory power of the final correlations. A comparative study of the overall performance of the investigated SWMLR/FFANN/LSSVM approaches is presented in terms of root mean square error and average absolute relative deviation between calculated and experimental γ ∞ , evaluated for different families of ILs and solutes, as well as between calculated and experimental infinite dilution selectivity for separation problems benzene from n-hexane and thiophene from n-heptane. LSSVM is shown to be a method with the lowest values of both training and generalization errors. It is finally demonstrated that the established models exhibit an improved accuracy compared to the state-of-the-art model, namely, temperature-dependent group contribution linear solvation energy relationship (TDGC-LSER), published in 2011.

2 ACS Paragon Plus Environment

Page 2 of 65

Page 3 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Introduction Ionic liquids (henceforth referred to as ILs) are molten organic salts exhibiting several peculiar characteristics interesting for both fundamental and applied chemical sciences. 1–3 In particular, they usually form liquid phase at room temperature and basically do not vaporize. Besides they disclose long-term thermal and chemical stability over wide range of operating conditions as well as enhanced dissolution capabilities for a great diversity of materials, ranging from simple inorganic salts 4 and gases 5 to complex biomaterials, e.g. APIs, 6 biomass-derived value-added compounds, 7 or (poly)saccharides. 8 Because of these features, ILs have been commonly perceived as “green” replacements for volatile organic compounds (VOCs) traditionally used in chemical industry. Since over 25 years this has been the main reason justifying the importance of knowlegde of the basic physical and thermodynamic characteristics of pure ILs, e.g. phase transitions, specific density, dynamic viscosity, interfacial tension, heat capacity, and so on, but also the properties of mixtures of ILs with VOCs. In fact, accurate data on phase equilibria and excess functions of ILs-based systems are essential as well as far as one considers the applications of ILs in heat and mass transfer based chemical engineering unit operations. These data are crucial indeed, because an utilization of ILs in novel and clean separation processes has attracted a significant amount of attention in recent years. 9 In particular, ILs have been considered as entrainers in liquid-liquid extraction and extractive distillation of azeotropic systems. 10 Both design and optimization of any extraction-involving processes require a lot of information. The most important, however, are these related to the differences in mutual affinity between the solutes to be separated and the solvents used as extractive media. This affinity can be deduced based on different thermodynamic data, for example from partition coefficients, binary or ternary liquidliquid equilibrium (LLE) phase diagrams or infinite dilution activity coefficients of the solutes in the solvent (throughout the paper referred to as γ ∞ ). Although the LLE data are usually much more difficult to measure than γ ∞ , they reflect the real extraction process more closely. On the other hand, γ ∞ values give only a rough piece of information about the real extraction process, but these data can be measured by using much simpler and cheaper chromatographic techniques compared to LLE 3 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

determinations. Besides, based on their thermodynamically precise definition, γ ∞ data allow to evaluate and quantify the differences in molecular interactions between solutes and solvent: given γ ∞ temperature-dependence, limiting values of partial excess functions of mixing (enthalpy and entropy) can be calculated. From a purely utilitarian point of view, some relevant characteristics for extraction like selectivity, capacity and performance index can be derived from γ ∞ . 11 Then, these can be employed to semi-quantitative solvent screening process, i.e. to estimate whether the solvent under consideration (e.g. an IL) is worth to be considered in a given separation problem or not. 12 Summing up, one can conclude that γ ∞ must be perceived as a relevant thermodynamic property and basically this has been the main boost for all the studies summarized in this contribution. A vast amount of various experimental data on ILs-based systems has been reported in the open literature in the last two and a half decades. Most of the data is available free to public in web-based databank ILThermo, 13 maintained by the National Institute of Standards and Technology of the United States (NIST Standard Reference Database 147). On the basis of the experimental evidence collected therein it has been clearly established that chemical structures of both cations and anions significantly affect all the macroscopic properties of bulk ILs. However, despite a huge effort of numerous ILs research communities the quantitative (in some case even qualitative) structureproperty relationship is still not clear or at least not well-understood. Of course, this is due to an enormous number of possible cation-anion combinations and thus ILs. Some authors claim that the number of possible 1:1 ILs is of the order of 106 , whereas this number would reach 1018 if binary and ternary systems were included. 2 To my best knowledge, however, the number of 1:1 ILs that can be combined from the cations and anions described in literature is around 150,000. 14 However, synthesis and physico-chemical measurements for approximately 1500 ILs only have been reported thus far. Hence, only a small fragment of the “ILs chemical space” has been discovered. The remaining regions of the space, i.e. the unknown chemical information on ILs, could be explored by different means. This is beyond a shadow of doubt that the experiments are the source of the most valuable information on the system under study. However, they are usually time-consuming and expensive as they require expertise, apparatus and the skilled staff familiar with the equipment. In

4 ACS Paragon Plus Environment

Page 4 of 65

Page 5 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

the age of exponentially increasing computing power of desktop computers, an alternative for a real experiment may be so-called in silico experiment, or, in other words, modeling. Recently, a vast amount of contributions concerned with the modeling of various physical and thermodynamic properties of ILs-based systems have been reported. Obviously, in the framework of this short research paper it is unfeasible to present a detailed overview of the progress in this particular field. Exhaustive summaries on different modeling techniques applied to ILs can be found in several excellent review papers published elsewhere in the last few years. 15–19 However, since this paper treats γ ∞ in ILs, a more precise overview of the methods and models reported for this property thus far is provided. 12,20–36 Furthermore, this review is intended to highlight the drawbacks and deficiencies of the currently available models and thus explain the idea and motivation of this work more deeply. To my best knowledge, the very first paper dealing solely with γ ∞ in ILs modeling was published by Diedenhofen and co-workers 20 in 2003. The authors applied conductor-like screening model for real solvents (COSMO-RS), a predictive method for the thermodynamic properties of liquid mixtures that employs a statistical mechanics tools based on the results of quantum chemical calculations. 20 The unquestionable advantage of this method is that it does not require any experimental input, but only the information that are needed to be known are the chemical structures (and the optimized conformations) of the molecules constituting the investigated system. The calculations were carried out considering three representative ILs, namely, 1-ethyl-3-methylimidazolium bis(trifluoromethylsulfonyl)imide, 1-ethyl-2,3-dimethylimidazolium bis(trifluoromethylsulfonyl)imide, and 4-methyl-N-butylpyridinium tetrafluoroborate, and 38 molecular solutes (including alkanes, alkenes, alkylbenzenes, alcohols, polar organics and chloromethanes). In total, 330 γ ∞ data points were predicted at T = 314 K and T = 344 K, and then compared to the reference experimental data. The resulting root-mean-square error was 0.421 and 0.393 of natural logarithm-based units (ln-units) at T = 314 K and T = 343 K, respectively. The average absolute relative deviations between calculated and experimental γ ∞ (i.e. not ln γ ∞ ) were around 33% and 25% at lower and higher temperature, respectively. A variant of COSMO-RS, named COSMO-RS

5 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(Oldenburd), was also successfully applied to reproduce experimental γ ∞ data. 37 Nevertheless, the original COSMO-RS approach of Klamt and Eckert still remains a very popular in the studies of γ ∞ of ILs (one may speculate that it is a leading approach). For example, γ ∞ data predicted with COSMO-RS were applied to preselect the ILs suitable for three important separation processes: aromatic/alkanes, alkenes/alkanes, and ethanol/water. 12 Kurnia et al. 21 used the COSMO-RS to design an ionic liquid suitable for absortive cooling based on the predicted γ ∞ of water. Very recently, Coutinho et al. 22–24 demonstrated a systematic study on γ ∞ of some important solutes, like water 22,23 or thiophene. 24 All these studies indicate a reliable prediction of γ ∞ in a wide range of ILs within average absolute relative deviation < 30%. Despite the fact, that the COSMO-RS allows to obtain a deeper insight into the interactions between ILs and molecular compounds, the model itself is quite complicated in use. In fact, its application usually requires some quantum chemical expertise and experience as well as the commercial (thus reliable and validated) software packages. For this reason, simpler models like empirical quantitative structure-property relationships (QSPRs) based on different types of molecular descriptors seem to be more preferable for practical utilization, e.g. in chemical engineering in fast solvent design and screening. The very first contribution on QSPR modeling of γ ∞ was presented by Eike and co-workers 25 in 2004. The ILs considered were the same as in the paper of Diedenhofen et al. 20 The solutes were described by using a set of four molecular descriptors: octanol/water partition coefficient (physico-chemical descriptor), number of hydrogen bonds (fragment-type descriptor), surfaceweighted partial negative surface area (quantum chemical descriptor) and the sum of the E-state values for carbon atoms (topological descriptor). It was shown that for each IL, the proposed set of descriptors is capable to reproduce the experimental data within < 30%, thus the same accuracy was achieved as in the case of previously published COSMO-RS calculations. 20 Nevertheless, this has to be emphasized that COSMO-RS results were based on first principles, whereas the QSPR calculations proposed by Eike et al. 25 were actually simple linear correlations. Moreover, in the QSPR model under consideration a separate set of 10 coefficients (2 coefficients describing per descriptor) were required for each IL, what significantly reduces its potential applications. A very

6 ACS Paragon Plus Environment

Page 6 of 65

Page 7 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

similar study was reported by Tämm and Burk. 26 These authors developed three-parameter QSPR based on alternative set of solutes’ molecular descriptors compared to the reference 25. Although the accuracy of the new QSPR correlations was a little bit better (< 20%), the model suffered from not accounting for temperature-dependence of γ ∞ (i.e. individual coefficients are required for each value of T). Another family of QSPR models that can be used to describe γ ∞ are those based on linear solvation-energy relationship (LSER) involving the Abraham molecular descriptors. 38 Usually, there are five descriptors denoted by E, S, A, B and L implemented in such correlations, whereas each descriptor is related to a specific kind of molecular properties of solute, namely, dipolarity and polarizability (E and S), acidity (A), basicity (B) and lipophilicity (L). Mintz and Acree 27 demonstrated that these descriptors can be used to correlate γ ∞ data with a reasonable accuracy by using simple multiple linear regression:

RT ln γ ∞ = c + eE + sS + a A + bB + l L

(1)

Usually, however, LSER approach is not applied directly to model γ ∞ data, but rather to represent gas-to-IL partition coefficient (log K) or water-to-IL partition coefficient (log P). 27 However, this is not a problem of LSER-based calculation of γ ∞ since the values of log K can be easily transformed into γ ∞ by using the following simple relation:

log K = log

RT

(2)

γ ∞ Psolute VIL

where the symbols R, T, Psolute and VIL stand for universal gas constant, temperature, saturated vapor pressure of solute at T and molar volume of IL at T, respectively. Due to the fact that LSER descriptors are not always easily accessible, Ge and co-workers 28 proposed to replace the traditional descriptors by their theoretical counterparts obtained by means of quantum chemical calculations. However, such step did not improve the overall accuracy of calculations, as the resulting average absolute relative deviation between correlated and experimental γ ∞ was 28% for a set of 34 organic solutes in 1-butyl-3-methylimidazolium trifluoromethanesulfonate and 33% for 7 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

the same set of solutes in 1-propyl-2,3-dimethylimidazolium tetrafluoroborate. It is noteworthy that in some of the papers mentioned so far, 27,28 neither internal nor external validation of the models was performed. Analysis of statistical significance of the obtained coefficients was not carried out as well. For the first time it was done by Xi et al., 29 who divided their whole data pool into two subsets: training set used to fit the QSPR model parameters and cross-validate the model (using the well-known LOO scheme) and test set introduced to test the model predictive capacity. In total, the authors proposed the QSPR correlation incorporating six descriptors (mainly quantum chemical ones). It is noteworthy that reciprocal temperature was explicitly included in the set of descriptors (although in somewhat different way than in reference 20). 29 It has been already mentioned that all the listed QSPR models are limited to prediction of γ ∞ of a solute in a given IL, assuming that the model coefficients are known. The coefficients are IL-specific, so that they are determined on the basis of experimental γ ∞ data. In other words, the proposed models cannot be used to perform a pure in silico design of the structure of IL with desirable values of γ ∞ or any other properties related to it. In order to resolve this problem, several group contribution (GC) approaches for calculating the IL-specific coefficients of LSER-based models were proposed. 30–35 Sprunger et al. 30–32 proposed to separate the IL-specific coefficients e, s, a, b and l from eq (1) into cation-specific and anion-specific contributions: c = ccation + canion , e = ecation +eanion , s = s cation +s anion , . . . , and so on. In the very first version of their model published in years 2007–2009, 32 root-mean-square deviation at the level of 0.145 log-units of log K calculated based on 955 samples covering 9 cations and 8 anions was achieved. 32 It is noteworthy that the LSER anion-specific coefficients (canion , eanion , s anion , . . .) were set equal to zero in the case of bistriflamide ([NTf2 ] – ) anion in order to define a reference point for calculating thermodynamic properties of other ions. 32 As a result, the respective LSER coefficients for some new cations (canion , eanion , s anion , . . .) could be obtained directly as these obtained for the corresponding [NTf2 ]based IL. In turn, the coefficients for new anions could be obtained by substraction based on the IL-specific coefficients fitted directly to thermodynamic or retention data: canion = c − ccation , eanion = e − ecation , s anion = s − s cation , . . .. This promising ion-specific approach for the treatment of 8 ACS Paragon Plus Environment

Page 8 of 65

Page 9 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

the LSER coefficients has been systematically revised and extended since 2009 by the same research group and co-workers. The first significant extension of the model was proposed by Grubbs et al., 39 who summarized the calculations for ILs composed of 19 distnict cations and 10 anions. A detailed list of all the contributions to ion-specific LSER covering years from 2010 to 2013 can be in the references list of the paper of Stephens et al. 40 (see refs. 24–39 therein), who in 2014 provided a comprehensive revision and extension of the model. In this version of the ion-specific LSER, the parameters table covered 40 different cations and 16 different anions, so that the number of binary ILs that can be tested by the model (40 × 16 = 640) increased approximately by a factor of 9 compared to the pioneering paper of Sprunger et al. 30–32 Since the publication of the paper of Stephens et al. 40 the range of applications of the model has additionally increased as the parameters for several novel cationic and anionic structures (like imidazolide anions or quinuclidinium-based cations) have been calculated and reported. 41–43 In order to generalize the ion-specific LSER model to a wider range of cations, Revelli et al. 33 proposed to split the cation-specific contribution into 12 additive group-specific contributions, while leaving the anion-specific coefficients unchanged. The resulted model reproduced the used database of 1450 samples with log K root-mean-square deviation of 0.152 log-units. Grubbs et al. 34 extended this model by introducing a number of new anions. Root-mean-square deviation calculated based on 2347 values of log K was 0.131 log-units. Unfortunately, the authors did not perform neither internal nor external validation of their correlations. Furthermore, the final model reported by Grubbs et al. 34 was developed based on the experimental γ ∞ data only at T = 323 K making the calculations at higher or lower temperatures unfeasible. To resolve this problem, Mutelet et al. 35 proposed temperature-dependent GC-LSER method (originally abbreviated by TDGC-LSER) for predicting log K of various solutes in ILs. The cations’ functional groups in the new model were transferred from the previous GC-LSER models. Surprisingly, the upgraded T-dependent model incorporates less groups compared to T-independent model by Grubbs et al. 34 The effect of temperature was accounted for by dividing each LSER IL-specific coefficient by T. The final value of root-mean-square deviation between experimental and calculated log K (6990

9 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 65

samples for 42 ILs) was 0.130 log-units. The authors tested predictive capacity of their model by exclusion the data for five ILs from the model coefficients determination. Besides, the values of log K returned by the model were used to recalculate respective γ ∞ data by using eq (2). The average absolute relative deviation vary from 9% to 72% depending on IL. In my opinion, the TDGC-LSER model can be seen as the state-of-the-art empirical model for γ ∞ prediction with the widest range of potential applications. All the QSPR models discussed so far were based on a multiple linear relationship between ln γ ∞ (or a related property, mostly log K) and the molecular descriptors. However, more complicated machine-learning algorithms have become very popular in recent years, especially in establishing quantitative structure-activity relationship (QSAR). 44 Some of them found applications in QSPR modeling of physical properties of ILs. In particular, dozens of papers on applications of artificial neural networks (ANNs) to represent diverse properties of different materials can be found in literature. For example, in our previous work we demonstrated that ANN can be applied to accurately reproduce an extensive database of ILs viscosity data over a wide range of temperature and pressure. 14 However, in the case of γ ∞ of molecular solutes in ILs, we found only a single paper by Nami and Deyhimi concerned with ANN modeling of this property. 36 Unfortunately, several important drawbacks of the model proposed by these authors should be highlighted: (1) the range of the database used to develop and test ANN was too narrow (only 16 imidazolium ILs, only 914 samples – there were much more data readily accessible from open literature; limited domain of ILs chemical space available); (2) the ANN topology optimization was not performed, so that there is a high risk that the model is overfitted; (3) the authors did not provide any relevant data regarding the final model, i.e. the values of the model parameters and all the descriptors (actually, some of them have not been precisely defined in the paper!), the ranges of inputs, trace of ANN training and so on. Therefore, drawing any conclusions based on the results presented in reference 36 is not recommended. In this contribution all the disadvantages of the models mentioned in the presented review (first of all, limited predictive capacity) are addressed. New empirical correlations for γ ∞ of molec-

10 ACS Paragon Plus Environment

Page 11 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

ular solutes in ILs obtained solely on the basis of the Abraham descriptors of the solutes and functional group assignments of cations and anions composing IL are proposed. In order to provide a comparative analysis of different modeling methods that can be adopted to compute γ ∞ , the new models employ three different machine learning algorithms: step-wise multiple linear regression (SWMLR), feed-forward artificial neural network (FFANN) and least squares support vector machine (LSSVM). To my best knowledge this is the very first time when such analysis is demonstrated. The remaining sections of the paper cover a detailed description of the models’ development procedure (including: brief description of the applied machine-learning algorithms, their implementation to represent ILs, solutes and γ ∞ data, database used to adjust the parameters, validation methods), the summary of results and their discussion and statistical analysis followed by the the final concluding remarks, outlook and perspectives for future investigations.

11 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 65

Development of the New Models In the following subsections, a description of all the steps made in the process of the new models development is presented. First, the three machine learning algorithms used to develop the new models for γ ∞ calculation are briefly described. The summaries comprise a general overview of mappings f : Rn → R representing a given dataset

D = {(xi, yi ) : xi ∈ Rn ; yi ∈ R; i = 1, . . ., N }

(3)

(where Rn denotes a set of column n-element vectors of real numbers) with the relationships of type

yi = f (xi ; p) + εi = yˆi + εi

(i = 1, . . ., N )

(4)

where yˆ = f (x; p) denotes the value calculated by using the model, ε stands for disturbance term (residual), whereas p is the vector of parameters (coefficients) of the model f . In general, the best model (i.e. the optimum value of p) is selected so that the sum of squares of residuals (or similar objective function) is in minimum. Then, assignments of x and y corresponding to application of the new models to γ ∞ data are proposed and discussed. Finally, the database used in the models’ training and testings as well as the cross-validation scheme are summarized.

Machine Learning Algorithms Stepwise Multiple Linear Regression (SWMLR) Multiple linear regression (MLR) is basically the simplest machine learning algorithm based on an assumption that the scalar response variable y is proportional separately to each component of

12 ACS Paragon Plus Environment

Page 13 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

the input vector x, namely

y = β 0 + xT β + ε

(5)

where β0 is an intercept term, β = [ β1, . . . , βn ]T (“T” denotes matrix transposition) is the vector of model coefficients (regressors). If we redefine x = [1, x 1, . . . , x n ]T and β = [ β0, β1, . . . , βn ]T , then for each sample from D, eq (5) simplifies to

yˆi = xiT β

(i = 1, . . ., N )

(6)

The optimum values of regressors are usually estimated by using ordinary least squares (OLS) approach via minimization of sum of squares of residuals. The resulting formula for OLS estimator of β reads as 45

β = (XT X) −1 XT y

(7)

where y = [y1, . . . , y N ]T and X ∈ RnN (set of N × n matrices) is so-called design matrix defined such that the i-th row of X corresponds to xiT . Note that X should be a full-rank matrix in order to properly apply eq (7). A proper selection of independent variables is crucial to obtain reliable and statistically sound MLR model. One of the methods to select a model with the highest explanatory power is stepwise MLR (in further text referred to as SWMLR). 45 Generally, SWMLR consists of first defining an initial model and then in adding/removing terms this model based on a selected criterion of their statistical significance (e.g. p-value or F-statistic). At each step, the criterion is computed and compared with the fixed entrance value to test models with and without a considered term. The method terminates, when no single step improves the model. MATLAB (MathWorks, Inc.; Matlab version R2014a) implementation of SWMLR within Statistics Toolbox and stepwisefit function 46 was applied in this study to model γ ∞ data. The

13 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 65

utilized algorithm proceeded as follows: 1. Setting initial model as yˆ = β0 and entrance/exit tolerance of a tested coefficient βi as its p-value for null hypothesis H0 : βi = 0, where the critical value was fixed to pc = 0.05 (terms with the p-values lower than pc are assumed to statistically significant, i.e. it is unlikely that they would have zero coefficient if added to the model); go to step 2. 2. Testing the larger models (with an extra term fitted): If for any terms not in the model p < pc , add the one with the smallest p-value and repeat this step; otherwise, go to step 3. 3. Testing the smaller model (without an extra term): If for any terms in the model p > pc , remove the one with the largest p-value and go to step 2; otherwise, end. It should be noted that models resulted from SWMLR are, in general, locally rather than globally optimal. This is due to the fact that the final model may be sensitive towards the initial set of terms, thus also the order in which the terms are included to and excluded from the model. 45

Feed-Forward Artificial Neural Networks (FFANN) Artificial neural network (ANN) is a machine learning algorithm inspired by biological neural systems, particularly by the brain. 47–49 The most important common features of both biological and artificial neural networks is the layered structure of their basic units (neurnons) as well as the collective and parallel processing of the input signals. Due to their flexibility and enhanced generalization capability, ANNs have found so far numerous applications in many areas of science, particularly chemistry. 50 Typical ANN is usually pictured as built up of simple processing units called the nodes. The nodes form different kinds of layers, for example hidden layers, where the input data from the previous layer are processed and passed to another layer, or output layers where the data from the last hidden layer are gathered together, finally processed and eventually returned as the output. Depending on the network topology (i.e. the way of interconnection between layers and nodes) as well as the direction in which the input signal is processed, one can distinguish several 14 ACS Paragon Plus Environment

Page 15 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

types of ANNs. 47–49 In feed-forward ANNs the signal processing move forwards only through a number of hidden layers and the output layer. According to universal approximation theorem there exists a FFANN with a single hidden layer having a finite number of nodes can approximate any multivariative continuous function on compact subsets of Rn . 49 That is why that FFANNs have been widely used in function approximation. 51 Nevertheless, it should be stressed that currently there are some controversies around their robustness in some practical utilizations. 52 In a great majority of the cases (this work as well), FFANNs composed of two layers are used to determine a relationship between input x and output y. The first layer is the hidden layer with an arbitrary number of nodes. In turn, the second layer is the output layer, whereas the number of nodes in this layer is the same as dimension of the output variable. In this work the scalar output are considered, so that there will be a single node in the output layer. Input processing by such two-layer FFANN is as follows. Let s denote the number of nodes in the hidden layer of the network. First, x is passed to the hidden layer, where it is transformed into intermediate output o1 ∈ Rs according to: o1 = ϕ1 (W1 x + b1 )

(8)

where W1 ∈ Rns and b1 ∈ Rs are matrices of the hidden layer’s parameters called, respectively, weights and biases. The mapping ϕ1 : Rs → Rs is called activation (or transfer) function of the 1st hidden layer. By analogy, the output from the second layer, i.e. o2 , is produced on the basis of the output o1 and the output layer transfer function ϕ2

o2 = ϕ2 (W2 o1 + b2 )

(9)

and so on for the another layers if available. Of course, only two layers are pondered in this work, thus o2 = y ∈ R, W2 ∈ R1s and b2 = b2 ∈ R. Combining eq (8) and eq (9) we get the main working

15 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 65

equation for FFANN approximation of y data from D:

  yˆi = ϕ2 W2 ϕ1 (W1 xi + b1 ) + b2

(10)

where i = 1 . . ., N. From eq (10) is clearly seen that before the considered FFANN (denoted by n–s–1) can applied it is necessary to: (1) specify the transfer functions ϕ1 and ϕ2 ; (2) set number of nodes in hidden layer s; (3) determine s(n + 2) + 1 weights and biases, i.e. train the network. In this work, the hyperbolic tangent sigmoid and identity function were used as the activation functions for the hidden and output layer, respectively. Regarding the number of hidden nodes, the problem is not unequivocal due to the fact that strict rules for determining an optimum value of s do not exist. In general, s depends on many factors: the dimensionality of input (n) and output, the number of training examples (N), the amount of noise in the “target” data y, the complexity of the function to be learned, the types of activation functions ϕ1 and ϕ2 , the training algorithm, and so on. A commonly accepted “rule of thumb” for a good value of s is that the final network should result in the lowest generalization error, provided that the ratio of number of training cases to the number of all FFANN weights/biases should be as high as possible. 48,49 MATLAB implementation of eqs (8) to (10) within Neural Networks Toolbox 47 was employed in this work. The network’s weights and biases were determined with Levenberg-Marquardt nonlinear least squares algorithm 53,54 applied to minimize mean squared error (MSE) between calculated and target y values

MSE =

N  1 X yi − yˆi 2 N i=1

(11)

Before the actual training process both inputs x and targets y were uniformly scaled to be within [−1, 1] interval. This is a common practice in ANN modeling because the scaling eliminates artifical biases in transfer functions and MSE, hence the network training converges much faster. In order to avoid overfitted models, “early stopping” approach was additionally incorporated into the network training process. 47 In this method, a subset of D called validation set is excluded 16 ACS Paragon Plus Environment

Page 17 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

from the weight and biases fitting. Two values of MSE corresponding to both training set and validation set are monitored during the optimization. Whereas training MSE never increases (by definition), the validation MSE starts to increase when the model begins to overfit the data. If this increase is observed for a given number of subsequent iterations (in this work, 5), then the training process stops. Initial values of weight and biases were generated followed Nguyen-Widrow algorithm. 55 Finally, considering number of data samples used in the training process, different FFANN architectures with s varying from 1 to 45 were tested to find the optimum number of hidden nodes.

Least Squares Support Vector Machine (LSSVM) The idea of support vector machine (SVM) arises from the problem of data classification, i.e. assigning a numerical label (usually from finite set, e.g. ±1) to a given input datum based on a decision function called a classifier. 56,57 The final form of the classifier is obtained by means of supervised learning, i.e. on the basis of some examples representing a valid behavior of the problem under consideration. SVMs have been also used as efficient tools for nonlinear function estimation. 58 Regarding this particular area of applications, it has been well established in machinelearning community, that SVMs have several advantages over other machine learning algorithms, especially over ANNs. For example, SVMs do not require a step similar to a selection of ANN topology (e.g. number of hidden layers or units), do not suffer from a high risk of local minima or overfitting, are not sensitive to starting points. 59 In 1999, Suykens and Vandewalle 60 proposed an upgraded version of SVM called least squares SVM (LSSVM). In particular, LSSVM slighlty reformulates the original optimization problem of SVM in terms of least squares method and replaces inequality constrains by respective equalities. It has been turned out that these changes significantly simplify the calculations and reduce overall computational cost. Indeed, the final LSSVM model is given as a solution of set of linear equations instead of a quadratic programming problem required in conventional SVM. 61,62 LSSVM has been already proven to be an interesting tool for representing various characteristics of reservoir

17 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 65

fluids, 63–65 also different properties of pure ILs. 66,67 Detailed explanation of all the theoretical and practical aspects of both SVM and LSSVM can be found in extensive monographs 57,61 and numerous online resources. 62 In particular, an interested reader is referred to an exhaustive comparison of SVM vs LSSVM given by Wang and Hu. 58 A concise overview of the LSSVM methodology and its most relevant expressions are given in the following paragraphs. LSSVM considers the problem of approximation of dataset D defined in eq (3) with the following function

yi = wT ϕ(xi ) + b + εi

(i = 1, . . . N )

(12)

where w ∈ Rh is the weight vector from high-dimensional space, b ∈ R is the bias term and ϕ : Rn → Rh . The optimization problem for the final values of w and b is given as follows: N

min

w,b,ε1,...,ε N

1 X 2 1 ε J (w, b, ε 1, . . . , ε N ) = wT w + γ 2 2 i=1 i

(13)

where γ > 0 is so-called regularization constant, subject to eq (12). The constrained optimization problem given in eq (13) is rewritten in the form of unconstrained Lagrangian with multipliers αi (i = 1, . . ., N). Differentiation of the Lagrangian with respect to all the variables followed by simple algebraic rearrangements leads to a compact relationship between the weights vector and Lagrange multipliers, namely

w=

N X

(14)

αi ϕ(xi )

i=1

Further transformations allow to respresent the optimum condition as the following systems of linear equations  1TN  0   1 N Ω + γ −1 I N

   

     b   0   =    α   y  

(15)

18 ACS Paragon Plus Environment

Page 19 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

In eq (15), 1 N = [1, . . ., 1]T , y = [y1, . . ., y N ]T , α = [α1, . . ., α N ]T and Ω ∈ R NN is a symmetric matrix with Ωi j = ϕ(x j ) T ϕ(xi ). Substituting eq (14) into eq (12) (keeping in mind that εi = yi − yˆi ), and introducing K (x, x′ ) ≡ ϕ(x) T ϕ(x′) result in the final working equation of a LSSVM-based model, namely

yˆi =

N X

αi K (xi, x j ) + b

(i = 1, . . . , N )

(16)

j=1

“Kernel trick” allows to avoid using dot products of ϕ function and replace them by a special kernel function. 61 Several kinds of kernel functions are currently in common use in LSSVM modeling. In this work, radial basis function (RBF) kernel is adopted kx − x′ k 2 K (x, x ) = exp − 2σ 2 ′

!

(17)

The parameter σ 2 is considered to be an extra degree of freedom, which is optimized by an external optimization algorithm during the calculations. MATLAB tools for LSSVM calculations included in the LS-SVMlab Toolbox (version 1.8) 62 were used in this work. First, an appropriate algorithm implemented within LS-SVMlab detects and rescales continuous, categorical and binary variables. 62 Then, the parameters of the LSSVM (regularization constant γ and kernel parameter σ 2 ) are determined via minimization of a selected performance measure; this step is called tuning. Default settings of LS-SVMlab, i.e. MSE defined in the same fashion as in eq (11) and Nelder-Mead simplex algorithm, 68 were adopted to tune the LSSVM. Suitable starting points for γ and σ 2 are determined by a state-of-the-art global optimization technique, Coupled Simulated Annealing (CSA). 69 In the framework of the LS-SVMlab tuning the quality of the fit is additionally checked by 5-fold cross-validation. Given eventually tuned γ and σ 2 , α and b (and the final model) are uniquely calculated by solving linear systems in eq (15). This step is referred to as the LSSVM training.

19 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 65

Database For the purpose of this study, an extensive compilation of experimental γ ∞ data was created. The data were extracted form the available open literature sources published since 2001. Table 1 classifies the data collected with respect to the type of the cation. As can be noted, the compilation formed by 33414 data points combined into 7342 data sets was finally obtained. It is also seen that imidazolium salts are the most exhaustively investigated family of ILs as they make up more than a half of all the compounds. The data cover a wide range of temperature form T = (288.15 to 413.15) K, but also relatively wide interval of γ ∞ values from (0.01 to 9810). Both T and ln γ ∞ data are distributed approximately normally around their means, what can be seen in histogram shown in Figure 1. In total, 162 articles reporting the measured γ ∞ values of 128 distinct solutes (alkanes, cycloalkanes, alkenes, alkynes, aromatics, alcohols, aldehydes, ketones, ethers, esters, Cl-F compounds, water, others) in 188 ILs were taken into consideration. The collected data were determined mostly with gas-liquid chromatography based on Everett-Cruickshank formalism. 70,71 A full list of references is given in the Supporting Information. In general, the most comprehensive data bank of γ ∞ data for ILs systems among these reported so far is presented in this work.

20 ACS Paragon Plus Environment

Page 21 of 65

Table 1. Summary of the Infinite Dilution Activity Coefficients Database Used in This Work in Terms of Number of Ionic Liquids, Solutes, References, Data Sets and Data Points, Temperature T and γ ∞ Range. IL Family

Number of the Data Collected ILs

Imidazolium (Im)

Solutes Papers

104

121

19

Quinolinium (Quin)

Data Setsa

ln γ ∞

T (K) Data Pointsa

min

max

min

max

95 3248 (443) 13933 (1454) 288.15 413.15 −4.343

9.191

83

20

644 (6)

3714 (32)

297.00 378.15 −2.408

6.279

2

53

2

89 (0)

505 (0)

308.15 368.15 −0.994

4.654

Pyrrolidinium (Pyr)

16

78

16

787 (17)

3982 (57)

298.15 408.15 −3.079

6.668

Piperidinium (Pip)

7

65

6

324 (0)

1879 (0)

308.15 368.15 −1.457

6.188

Bicyclic (Bicyc)

4

80

3

184 (0)

877 (0)

313.15 363.15 −2.919

4.108

Morpholinium (Mo)

4

64

4

246 (0)

1467 (0)

298.15 368.15 −1.737

6.436

Ammonium (N)

17

102

13

748 (13)

3094 (39)

300.95 396.15 −4.605

8.284

Phosphonium (P)

14

77

14

524 (37)

2006 (125)

298.15 413.15 −3.817

6.799

1

32

1

32 (0)

250 (0)

298.15 368.15 −0.238

4.875

188

128

162 6826 (516) 31707 (1707) 288.15 413.15 −4.605

9.191

Pyridinium (Py)

21

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Journal of Chemical Information and Modeling

Sulfonium (S) Overall a

Numbers of data sets/points excluded from the models development process are given in parentheses.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 65

Figure 1. Distribution of temperature-dependent infinite dilution activity coefficients (ln γ ∞ vs T) collected in the database presented in this work.

The ILs contained in the database are composed of 124 ions, including 84 distinct cations and 40 distinct anions. A full list of ions, including their names, molecular weights and chemical structures, is given in the Supporting Information. The data cover ten cationic families of ILs, namely, imidazolium, pyridinium, quinolinium, pyrrolidinium, piperidinium, bicyclic cations, morpholinium, ammonium, phosphonium and sulfonium, including simple alkyl-substituted as well as functionalized “task-specific” cations. The ILs based on 1-(n-alkyl)-3-methylimidazolium cations, [Cn C1 Im]+ where n = 2–8 (2 denotes ethyl, 4 denotes butyl, . . . ), are the most often encountered compounds in the database (in particular, there are 27 ILs with [C4 C1 Im]+ cation). Diversity of anion types is much more higher, including small or moderate inorganic moieties like [NTf2 ] – (the most abundant ion, 55 ILs), [BF4 ] – , [(C2 F5 )3 PF3 ] – , [Br] – (11 ILs per each), [CF3 SO3 ] – , [SCN] – (7 ILs per each), [NO3 ] – , [PF6 ] – (6 ILs per each), [Cl] – , [N(CN)2 ] – (5 ILs per each), [B(CN)4 ] – , [C(CN)3 ] – (4 ILs per each) as well as larger and asymmetric organic molecules like alkylsulfates and alkylsulfonates (21 ILs), dialkylphosphates (10 ILs), long-chain carboxylates (4 ILs). Only for

22 ACS Paragon Plus Environment

Page 23 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

53 ions, the data for more than a single IL can be found in the database. In the case of a great majority of IL-solutes binaries, a single data set per system was available. Basically all these systems were included in the modeling, unless they display unclear and abnormal values of γ ∞ . The most representative examples of papers reporting such obviously wrong data are the contributions of Wang et al., 72,73 who measured γ ∞ of different solutes in two common ILs: [C8 C1 Im][PF6 ] 72 and [C6 C1 Im][Cl]. 73 At first sight it can be noticed that γ ∞ for n-alkanes and aromatics follow unusual trends. The authors report that γ ∞ decreases with the molecular mass of alkane, which is in complete disagreement with what we all know about IL-solute molecular interactions. In fact, longer alkyl chain of solute promotes stronger solute-solute rather than ILsolute forces, what is evidenced by increase of γ ∞ . Furthermore, unusually high values of γ ∞ of benzene in [C8 C1 Im][PF6 ] (e.g. 101.9 at T = 343.15 K) and much smaller values of γ ∞ of other aromatics (e.g. 0.7426 at T = 343.15 K) are put together not considering that this data are inconsistent. 72 First, one could be misled that these errors might be a results of typos or wrong word processing. However, similar findings are presented for [C6 C1 Im][Cl] in a paper published approximately one year later. 73 For 521 IL-solute systems, the data from different papers are available and this is a crucial step before the final calculations to compare and evaluate them. Indeed, there is no doubt that erroneous data would result in unreliable models. In order to illustrate the problem, two examples of significant data scatter are shown in Figure 2: n-hexane in [C4 C1 Im][BF4 ] (Figure 2a) 74–78 and ethanol in [C2 C1 Im][C2 SO4 ] (Figure 2b). 79–81 Actually, it is not so trivial task to determine which data are wrong and then to exclude them from further considerations. This is first of all due to the fact that not all the authors report uncertainties associated with the measured γ ∞ . Besides, there are no standarized procedures and recommendations for valid reference data for γ ∞ calculation by using the Everett-Cruickshank formula. 70,71 In particular, different sources for thermophysical data required in γ ∞ calculations based on retention data (e.g., critical properties, second virial coefficient, vapor pressure of solute) may also result in the data scatter. Note that in the example from Figure 2b the data are in contradiction (if we concern their temperature dependence), even

23 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

when published by the same research group. 79,80

24 ACS Paragon Plus Environment

Page 24 of 65

Page 25 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 2. Infinite dilution activity coefficient (γ ∞ ) as a function of temperature (T) for representative systems displaying significant data scatter: (a) n-hexane in 1-butyl-3-methylimidazolium tetrafluoroborate; 74–78 (b) ethanol in 1-ethyl-3-methylimidazolium ethylsulfate. 79–81 Solid lines designated by linear regression of ln γ ∞ vs 1/T and plotted to guide the eye.

In this work two main (but unfortunately not universal) criteria to move a given data set in or out were applied: (1) check if the data set follow a regular trend when compared to similar data sets (e.g. for homologous series); (2) check the sign of the slope of ln γ ∞ vs 1/T and compare to similar systems. For the examples from Figure 2, based on the trends only the data of Bahlmann et al. 74 were accepted for the system n-hexane in [C4 C1 Im][BF4 ], see Figure 2a. In turn, based on the slope the data of Liebert et al. 79 (ethanol in [C2 C1 Im][C2 SO4 ], see Figure 2b) were move out from further considerations. As an overall result of the data revision following these rules, we finally rejected 516 data sets and 1707 data points. A details are given in Table 1 for each cationic family of ILs.

25 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 65

Input/Output for the Proposed Models Following the previous publications concerned with modeling of physical properties of ILs, 14,82 the ILs are represented in terms of group contributions (GCs). The number of functional groups introduced to describe γ ∞ data for all the ILs under study is, however, much smaller than in the case of density (177 groups), 82 or viscosity (242 groups). 14 This is because the number of ILs for which γ ∞ data have been available until the end of 2015 (188) is lower by an order of magnitude compared to density (1028 ILs in 2012) 82 /viscosity (1484 ILs in 2014). 14 Three classes of fragments are defined: cation cores representing the main building block of the cation (or the cation as a whole), anion cores representing the main building block of the anion (or the anion as a whole) and functional groups substituted to the cores. In total, G = 81 different groups are defined in this work to capture the structural diversity of ILs collected in the database: 18 cation cores, 33 anions cores and 30 substituted groups. A full list of groups is given in Table S1 in the Supporting Information along with their symbols and the numbers of their occurrences in the database. Group assignment for each cation and anion can be found in the Supporting Information as well. The cation cores comprise 10 distinct cationic families of ILs listed in previous section and Table 1. In particular, separate groups were considered for different positional isomers. For example, distinct cation cores are defined for 1,3-/1,2,3-substituted imidazolium cations or 1,2/1,3-/1,4-substituted pyridinium cations. The only cation represented as a single cationic group is pyridinium cation [PyH]+ . In the case of ammonium ILs, separate groups were assigned to primary and quaternary ammonium cations. In turn, almost all anions included in the database are described by a single group. This is due to stronger diversity in anionic vs cationic structures. Only in the case of carboxylates, sulfonates, sulfates and phosphates, the corresponding anion cores were defined. We checked that introducing smaller groups for remaining anions would result in the models exhibiting larger generalization errors in predicting γ ∞ for the systems not included in the training data. Substituted functional groups comprise nonpolar saturated and unsaturated fragments like CH3 , CH2 , CH, tert-Bu, CH2 −CH, o-/p-substituted phenyl rings as well as polar 26 ACS Paragon Plus Environment

Page 27 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

fragments like OH, CH3 O, CH2 O, C(O)O, C− −N, Si(CH3 )3 , CF3 , SO3 H, B(OH)2 . For some of the groups, to capture so-called “proximity” effects as well as to differentiate between structural isomers, several variants were defined depending on the position of a group in the cation or anion. For instance, groups substituted to aliphatic chains and aromatic rings are treated individually. Furthermore, groups attached directly to heteroatoms in aromatic rings (like in imidazolium or pyridinium cations) and aliphatic rings (like in pyrroldinium or piperidinium cations) are also distinguished. For each of N samples deposited in the database, one can define a group assignment vector ni = [n1,i, . . . , nG,i ], where n j,i ≥ 0 stands for the number of occurrences of functional group j in the IL for which the sample data point i is reported. Thus, for the whole database one can build N × G matrix N defined so that ni forms the i-th row of N. Before any calculations with SWMLR, FFANN or LSSVM, we first investigated the mutual pairwise correlations between the groups to avoid overfitted models with insignificant degrees of freedom. For this purpose, Gauss-Jordan elimination was performed on N to transform the matrix into reduced row echelon form and to find the pivot variables (MATLAB’s rref function was used). Following this procedure, 15 columns were removed from N and thus, only the remaining 66 were finally accepted for further processing with the machine learning algorithms (18 cation cores, 32 anion cores and 16 substituted groups). The full list of excluded groups is also given in Table S1 in the Supplementary Material. It should be underlined that in our previous studies, 14,82 the groups and the groups assignments were designed in such manner that each atom present in either cation or anion was present in one and only one occurrence of a group. Hence, there were no unassigned fragments in the molecules and there was no overlap of the groups. Unfortunately the elimination of N makes this picture a little bit more complicated, because one has to keep in mind that in the consequence of the elimination of some fragments in chemical structure ILs may be not actually represented by the model. However, the MATLAB program provided in the Supporting Information (see Appendix A) treats the problem automatically, so that a casual end-user does not have to be actually aware of this issue. Solutes are described by using a set of five the Abraham descriptors: 38 E, an excess molar

27 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 65

refraction (with respect to n-alkane having the same molar volume as the considered solute); S, dipolarity-polarizability descriptor obtained based on gas-liquid chromatography retention data on polar stationary phases; A, hydrogen-bond acidity descriptor; B, hydrogen-bond basicity descriptor; L, the solute gas-hexadecane partition coefficient at 298 K. These descriptors are readily accessible, but first of all they have been shown to be nicely correlated with γ ∞ data. 27 The values of E, S, A, B and L used in this work are listed in Table S2 in the Supporting Information. The models to be developed intend to predict not only γ ∞ but also its temperature dependence. Therefore, temperature is also included into the input data. However, on the basis of results of some preliminary calculations it was observed that lower training and generalization error are achieved when a scaled reciprocal temperature 1000/T is used rather than T. Summing up, the input vectors xi and the corresponding output variable targets yi used to numerically describe the γ ∞ data collected in the database within FFANN and LSSVM approaches are respectively defined as

xi = [1/T˜i, Ei, Si, Ai, Bi, Li, ni, ]T ∈ RG+6

(18a)

yi = ln γi∞

(18b)

with T˜i = 10−3Ti /K and i = 1 . . ., N. After the elemination process described above the final dimensionality of input is n = G + 6 = 72. In the case of the SWMLR, an utilization of eq (18a) does not lead to a good model. In fact, that would result in the universal temperature coefficient, while in general the slope of ln γ ∞ vs 1/T can change considerably (including its sign) from system to system. Therefore, the generalized expression Mutelet et al. 35 on TDGC-LSER is preserved, but applied directly to ln γ ∞ . Thus, for SWMLR the input and output are as follows: xi = [1, ni /T˜i, ni Ei /T˜i, ni Si /T˜i, ni Ai /T˜i, ni Bi /T˜i, ni Li /T˜i ]T ∈ R6G+1

(19a)

yi = ln γi∞

(19b)

where “1” corresponds to the intercept term of the model and i = 1, . . . , N. Of course, the initial 28 ACS Paragon Plus Environment

Page 29 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

very high dimensionality of the input n = 6G + 1 = 397 was significantly reduced by the stepwise regression.

Cross-Validation/Testing Schemes Since the main purpose of this study is to propose reliable and (basically first of all) predictive empirical equations for γ ∞ calculation, the standard procedures for both internal and external statistical validation are incorporated into the models’ development. A general view of the training, validation and testing “workflow” is shown in Figure 3 and described in this section.

29 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 65

Figure 3. A flow diagram for training/(cross-)validation/testing procedure applied to develop the models presented in this study (details given in text, in particular, number of folds for crossvalidation k = 9 and validation/testing set ratio p = 10). In the very first step, the whole data pool was revised and all the unaccepted data points and duplicates were dismissed. Then, the data were randomly divided into two subsets: {training + validation} set and testing set. The {training + validation} set was used to fit (train) the model parameters and carry out the internal validation (cross-validation). In particular, k-fold crossvalidation method was adopted in this work. In this technique, the entire set is partitioned into k subsets (folds) of equal size. Then, a single set of all the k folds is retained as the validation data for evaluating generalization capabilities of the model, while the remaining (k − 1) folds are used together as training data. As seen in Figure 3, such cross-validation is repeated k times with each fold used exactly once as validation data. Thus, k distinct models differing in the values of 30 ACS Paragon Plus Environment

Page 31 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

optimized parameters are produced. The external validation (testing) is examined for each model by using the same testing set composed of the data used in neither training nor internal validation. The actual protocol for application of the k-fold cross-validation is specific for each machine learning algorithm, namely: • In the case of SWMLR, the entire {training + validation} set was used to select statistically significant terms in eqs (6) and (19). Then, the model with finally selected coefficients was tested against k-fold cross-validation in a standard fashion. • For FFANN, the weights and biases from eq (10) are fitted only to the training data. The validation set is used within so-called “early stopping” algorithm described earlier. It is noteworthy that the additional degree of freedom is the number of hidden units s, so that kfold cross-validation was examined for different network architectures. The optimum model is selected based on the variation of generalization error against s. • In LSSVM calculations, the cross-validation scheme is somewhat reversed. For each of the k folds, the validation part of the {training + validation} set is used to tune the kernel parameter σ 2 from eq (17) and the regularization constant γ from eq (13). Given γ and σ 2 . the training part was used to solve the system of equations given in eq (15) for the support values from eq (16). MSE defined in eq (11) was used as a measure of quality the model performance in training or prediction. MSE was calculated over the samples from the respective training, validation and testing set and the results were compared. In general, interpretation and assessment of the results of both internal and external validation are not always trivial tasks. Usually, the models are perceived as statistically correct and significant if all three MSE values are of the same order of magnitude and have small variance, whereas the best model is always the one yielding in the lowest testing error. In particular, overfitted model is obtained when generalization (including validation and testing) errors are substantially higher than training errors.

31 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 65

In this work, the total number of data samples after the data revision and duplicates removing was 31704 (only 3 duplicates detected). The training, validation and testing sets covered approximately 90% and 10% of the size of the whole data pool, namely, 28454 and 3250 data points, respectively. It should be emphasized that for a given data set of IL-solute γ ∞ vs T, each data point was assigned to only one set (i.e. the data sets were not shared between {training + validation} and testing). Based on our preliminary examinations, we observed that if this step was not made, artificial downward bias of the MSE and other statistical measures would be observed (mainly, due to interpolation rather than extrapolation of the temperature dependency). In the cross-validation, the same number of k = 9 folds were applied regardless of the machine learning algorithm. Such value of k was fixed to set the size of validation subset approximately equal to testing subset.

32 ACS Paragon Plus Environment

Page 33 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Results and Discussion The Final Models The results of SWMLR-, FFANN- and LSSVM-based models development are given in Table 2, where the values of MSE of training, k-fold cross-validation and testing processes are listed. The models finally selected for further analyses are also highlighted in Table 2. Detailed results are briefly recapitulated in the remaining part of this subsection. The final models along with all the relevant data and the MATLAB subroutines allowing to reproduce the results presented and discussed herein are provided in the Supporting Information as a collection of mat- and m-files. Instructions how to handle these file are summarized in Appendix A. Table 2. Summary of the SWMLR-, FFANN- and LSSVM-Based Models Development Process (Training, Cross-Validation and Testing) in Terms of Mean Squared Error (MSE) Between Calculated and Experimental Values of Infinite Dilution Activity Coefficient, see eq (11). a Fold

SWMLR

FFANN (s = 19)

training validation

test

training validation

LSSVM test

training validation

test

1

0.204

0.201

0.222

0.035

0.041

0.056

0.041

0.050

0.053

2

0.202

0.221

0.221

0.032

0.038

0.053

0.042

0.050

0.053

3

0.203

0.208

0.220

0.030

0.035

0.058

0.029

0.038

0.049

4

0.204

0.197

0.220

0.032

0.034

0.069

0.045

0.047

0.055

5

0.204

0.204

0.216

0.029

0.031

0.065

0.034

0.037

0.053

6

0.204

0.201

0.219

0.029

0.036

0.058

0.047

0.056

0.056

7

0.201

0.222

0.221

0.030

0.044

0.055

0.046

0.058

0.057

8

0.203

0.212

0.220

0.028

0.036

0.068

0.033

0.041

0.050

9

0.202

0.213

0.221

0.036

0.046

0.068

0.031

0.042

0.051

mean

0.203

0.209

0.220

0.031

0.038

0.061

0.039

0.046

0.053

SD

0.001

0.009

0.002

0.003

0.005

0.006

0.007

0.008

0.003

33 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

a

Page 34 of 65

The optimum models are those with MSE values written in bold.

The SWMLR analysis of input/output data given in eq (19) resulted in the final linear model. Based on the selected statistical criteria for significance, dimensionality of the input vector was reduced from 396 to 287, so that more than 100 terms were moved out from the initial model. The stepwise model selection took 341 individual inclusion/exclusion iterations. As seen from Table 2, the mean values of MSE are similar regardless of the subset of the entire data pool. Based on a very small value of standard deviation of generalization error, we conclude that the proposed linear model is statistically valid. In particular, the model associated with fold no. 5 was selected as the final SWMLR model due to the lowest value of testing MSE. The history of the stepwise selection, the values of 288 accepted coefficients (including intercept) as well as more detailed data on statistics of the proposed model (t-statistics and corresponding p-values, standard errors of coefficients and ANOVA-related statistics) are provided in the Supporting Information as a mat-file named model_swmlr.mat. In the case of FFANN approach, the final model selection was based on an investigation of networks having different number of hidden layer nodes s. For the purpose of this study, s varied from 1 to 45 with an increment of 1. The results are depicted in Figure 4. In particular, Figure 4a presents a mean value of training/validation/testing MSE as a function of s. Please note that error bars representing the standard deviation of MSE (over all the k = 9 folds) are additionally plotted. It can be seen that as s increases, training and validation MSE systematically decreases. The difference between training and validation MSE tends to slightly increase for higher s. Testing MSE (i.e., the generalization error) exhibits dissimilar behavior. In fact, it follows the trends revealed in the case of training and validation, but only for low values of s ≤ 19. Starting from approximately s = 19, an increasing trend in MSE is observed. In general, this means that the FFANN-based models with s > 19 overfit the training data. Furthermore, for nets with s > 19 it seems that generalization error strongly depends on the selection of training/validation/testing data. This is reflected by a more noticeable scatter of testing MSE values, thus the higher values

34 ACS Paragon Plus Environment

Page 35 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

of MSE standard deviations. On the basis on this analysis, s = 19 was decided to be picked as the optimum number of hidden units for representing our database. Figure 4b shows the result of the optimization of weights and biases of the final network. In total, the model involves 1407 parameters what corresponds to approximately 18 (training) data samples per parameter. Such value allow to find the developed model as statistically sound. It can be noticed from Table 2 that, the values of MSE observed for FFANN models are one order of magnitude lower than corresponding values produced for SWMLR. This is of course related to much greater number of parameters employed by FFANN compared to SWMLR and hence the better flexibility of the former machine learning algorithm. Based on the comparison of the generalization errors, the model corresponding to the fold no. 2 was finally picked as the optimum one. The optimized weights and biases as well as all the relevant information associated with the final model are provided in the Supporting Information as a mat-file named model_ffann.mat.

35 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 36 of 65

Figure 4. FFANN model development based on the training and generalization (validation/testing) mean square error (MSE) between calculated and experimental values of infinite dilution activity coefficient defined in eq (11): (a) selection process for an optimum number of nodes in hidden layer (s); (b) training, validation and testing of the final model.

One can argue that LSSVM seems to be most accurate approach, when applied to treat the problem of γ ∞ extensive database representation with input/output defined in eq (18). This is confirmed by the lowest values of testing MSE among all listed in Table 2. The model corresponding to fold no. 3 was judged to be an optimum one. The final value of the regularization constant was γ = 9.6372 · 103 , whereas the RBF kernel parameter, see eq (17), was σ 2 = 41.6104. All the model relevant information (including input/output pre- and post-processing schemes and, most importantly, the support values) are provided in the mat-file model_lssvm.mat.

36 ACS Paragon Plus Environment

Page 37 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Overview of the Models’ Performance This section presents the performance of the finally accepted SWMLR-, FFANN- and LSSVMbased models. All the results are analyzed and discussed in terms of two statistical measures, namely, root mean square error (RMSE) and average absolute relative deviation (AARD) between calculated and experimental values of γ ∞ , defined as √ RMSE = MSE

(20a)

N

 100 X AARD = exp yˆi − yi − 1 N i=1

(20b)

MSE is obtained from eq (11), while the y symbol with and without “hat” corresponds to natural logarithm of calculated and experimental value of γ ∞ , cf. eqs (18) and (19). Summation in eq (20) goes over N selected data points (not only the whole database, but, for example, data for a given family of ILs or solutes and so on). Figure 5 demonstrates parity plots (calculated vs experimental) obtained for each model as the results of training, validation and testing. As seen, for each modeling technique the data points are distributed randomly along the diagonal regardless of the subset. However, the scatter observed for FFANN and LSSVM is significantly lower compared to SWMLR. This observation corresponds with the values of RMSE and AARD calculated over the training/validation/testing sets, see Table 3. In fact, RMSE values for SWMLR are more than twice higher than the values obtained for FFANN and LSSVM. In terms of AARD, FFANN and LSSVM turned out to be the most promising approaches as well, with similar overall average deviations of 12.8% and 12.0%, respectively (against 35.1% for SWMLR). LSSVM calculations generate, however, slightly lower testing error. To be more specific, sum of training and validation subset with 28454 LSSVM, FFANN and SWMLR calculations resulted in the lowest absolute error in 12756, 11500 and 4198 samples, respectively. In the case of testing set composed of 3250 data points the corresponding “scores” are as follows: 1406 for LSSVM, 1255 for FFANN and 589 for SWMLR. Therefore, LSSVM predominates the two remaining machine learning algorithms (especially SWMLR) in 37 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 38 of 65

terms of both fitting capability as well as generalization. Table 3. Summary of Calculations Using the Finally Proposed SWMLR-, FFANN- and LSSVM-Based Models for in Terms of Root Mean Squared Error (RMSE) and Average Absolute Relative Deviations (AARD) Between Calculated and Experimental Values of Infinite Dilution Activity Coefficient, see eq (20). Subset

SWMLR

FFANN

LSSVM

RMSE

AARD

RMSE

AARD

RMSE

AARD

0.453

35.2

0.181

12.4

0.174

11.6

validation 0.433

33.6

0.183

12.3

0.169

11.6

testing

0.465

35.6

0.23

16.5

0.222

15.5

overall

0.453

35.1

0.187

12.8

0.179

12.0

training

38 ACS Paragon Plus Environment

Page 39 of 65

39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Journal of Chemical Information and Modeling

Figure 5. Parity plots of calculated vs experimental values of infinite dilution activity coefficient (γ ∞ ) resulting from the SWMLR-, FFANN- and LSSVM-based models finally proposed in this work: (a) training set; (b) validation set; (c) testing set.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 40 of 65

More detailed insight into performance of the proposed model is shown graphically in Figure 6 and in tabular form in Tables 4 and 5. In particular, Figure 6 shows a box plot representing distribution of relative deviations between calculated and experimental values of γ ∞ . In Figure 6a the data are grouped into different cationic families of ILs. As can be easily noticed, all the distributions disclose symmetry around median values close to 0 irrespective of type of ILs. The widest bars covering from 25th to 75th percentile were obtained for SWMLR-based model, but in all cases the they are within range [−50, 50]. However, the “whiskers” displayed by this model (extending to the most extreme data points not considered as outliers, i.e. ≈ 99.3 coverage if the data are normally distributed) comprise much broader ranges of deviations, e.g. [−96, 113] for imidazolium ILs, [−91, 108] for ammonium ILs and [−87, 170] for phosphonium ILs. In turn, the distributions of FFANN- and LSSVM-calculated deviations are more sharp compared to SWMLR. In fact, all the boxes and whiskers are within ±10% and ±45%, respectively. The number of outliers also given in Figure 6 for each binary “model + IL family”. As seen, the outliers constitute only a small fraction of the entire data subsets as their percentages are of the order of several percent (except sulfonium ILs modeled with SWMLR). Distributions classified in regard to the solutes are depicted in Figure 6b. As seen, some distributions being outcomes of SWMLR calculations are centered at non-zero values, e.g. alkynes with median at +59%, Cl-F compounds with median at +53% or water with median at −33%. This means that the model tends to produce systematically biased results. The spread of deviation of SWMLR calculated activity coefficients varies from relatively narrow (like [−76, 60] for alkanes) to extremely wide (like [−76, 300] for Cl-F compounds). The respective calculations with FFANN and LSSVM resulted in much more accurate results.

40 ACS Paragon Plus Environment

Page 41 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 4. Summary of Calculations Using the Finally Proposed SWMLR-, FFANN- and LSSVM-Based Models for Different Cationic Families of Ionic Liquids in Terms of Root Mean Squared Error (RMSE) and Average Absolute Relative Deviations (AARD) Between Calculated and Experimental Values of Infinite Dilution Activity Coefficient, see eq (20). a IL Familyb

SWMLR

FFANN

LSSVM

RMSE

AARD

RMSE

AARD

RMSE

AARD

Im

0.512

41.4

0.215

15.2

0.205

14.0

Py

0.323

24.8

0.131

9.50

0.123

8.90

Quin

0.248

19.7

0.124

9.69

0.121

9.20

Pyr

0.332

25.5

0.137

9.68

0.142

Pip

0.265

20.0

0.0957

7.27

0.0928

Bicyc

0.318

22.9

0.209

Mo

0.327

25.3

0.137

N

0.507

40.3

0.224

15.0

0.212

13.5

P

0.630

53.9

0.201

14.5

0.181

12.6

S

0.244

19.5

0.0401

a

11.9 9.61

2.95

0.216

10.1 6.99 12.9

0.138

9.66

0.072

5.43

Detailed list of ILs belonging to each family is given in Tables S7 to S16 in the

Supporting Information. b

A key is given in Table 1.

Table 5. Summary of Calculations Using the Finally Proposed SWMLR-, FFANN- and LSSVM-Based Models for Different Families of Solutes in Terms of Root Mean Squared Error (RMSE) and Average Absolute Relative Deviations (AARD) between Calculated and Experimental Values of Infinite Dilution Activity Coefficient, see eq (20). a Solute Family

SWMLR

FFANN

LSSVM

RMSE

AARD

RMSE

AARD

RMSE

Alkanes

0.486

31.7

0.171

11.1

0.142

Cycloalkanes

0.447

27.7

0.211

13.9

0.192

41 ACS Paragon Plus Environment

AARD 8.52 11.9

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 42 of 65

Alkenes

0.379

28.8

0.157

10.6

0.143

9.56

Alkynes

0.535

64.6

0.145

10.3

0.139

9.53

Aromatics

0.37

23.6

0.139

0.121

7.94

Alcohols

0.443

34.9

0.17

12.1

0.162

12

Aldehydes

0.583

66.5

0.409

35.6

0.389

34.1

Ketones

0.357

29.7

0.194

14.9

0.18

13.5

Ethers

0.371

31.7

0.19

14.3

0.191

15.3

Esters

0.327

26.7

0.193

11.8

0.192

11.9

0.29

22.6

0.346

30

Cl, F compounds 0.748

102

9.54

Water

0.712

37.7

0.205

15.7

0.137

Others

0.538

38.3

0.238

16.4

0.255

a

8.16 18.2

Detailed list of compounds belonging to each family is given in Table S2 in the Supporting

Information.

42 ACS Paragon Plus Environment

Page 43 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 6. A box plot representing distributions of relative deviations between calculated and experimental values of infinite dilution activity coefficient (γ ∞ ): (a) classification with respect to cationic family of ILs (a key is given in Table 1); (b) classification with respect to solute family. On each box, the central mark is the median, the edges of the box are the 25th and 75th percentiles, while the whiskers extend to the most extreme data points not considered as outliers (i.e. approximately 99.3 coverage if the data are normally distributed). Number of outliers is shown above each bar/whisker (with the percentage given in parentheses).

Tables 4 and 5 present the results of calculations in terms of RMSE and AARD calculated for all cationic families of ILs as well as for all families of solutes, respectively. More detailed summaries taking into account the whole data pool division into training/validation/testing sets are given in Tables S3 and S4 in the Supporting Information. Furthermore, RMSE and AARD values calculated individually for each IL and each solute contained in the database are given in Tables S5 and S6 in the Supporting Information, respectively. Finally, Tables S7–S16 in the Supporting Information

43 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 44 of 65

present RMSE and AARD obtained for all 6651 IL-solute distinct binary systems (including each of 7342 data sets).

Infinite Dilution Selectivity Calculations Thus far, it was demonstrated that the proposed empirical equations are capable of reproducing the huge database of γ ∞ data. Based on these data, one can calculate some auxiliary properties useful from the utilitarian point of view, for instance infinite dilution selectivity for separation problem “solute 1 from solute 2”, defined as

∞ S12 =

γ2∞ γ1∞

(21)

A solvent (e.g. IL) is perceived as a good candidate for the separation (e.g. via extraction or ∞ extractive distillation), if ln S12 ≫ 1. Since more than 15 years eq (21) has been used to estimate a potential performance of ILs in separations (see more than 150 references in the Supporting Information). In particular, until now “aromatics from aliphatic fraction” and “sulfur compounds form gasoline” have been the most popular separation problem under study. In fact, almost all ∞ of at least one of these problems. the papers concerned with γ ∞ end with the analysis of S12

Therefore, we decided to check how the developed models handle the selectivity. Although this may be seen as trivial task, one should keep in mind that inaccuracies in γ ∞ predictions do not ∞ predictions due to error compensation. As case studies we selected have to yield in erroneous S12

two separation problems, namely “benzene (1) from n-hexane (2)” (problem A) and “thiophene (1) from n-heptane (2)” (problem B). First, we picked the ILs for which the values of γ ∞ for benzene/n-hexane and thiophene/nheptane binaries could be found in the database. It should be emphasized that the data are ∞ typically reported at different temperatures, so that one can suggest to calculate the values of S12

at a fixed reference temperature. Indeed, given γ ∞ vs T data set can be nicely regressed by using approximately linear relationship between ln γ ∞ and 1/T and then the activity coefficient can be

44 ACS Paragon Plus Environment

Page 45 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

interpolated or extrapolated. However, a great care must be taken in order to avoid uncertainties related with the extrapolation. Since the interpolation is more preferable, the final values of γ ∞ used in eq (21) were calculated at T determined as the middle point of an intersect of respective temperature ranges of data available for solutes 1 and 2. Following this process, experimental infinite dilution selectivity for 142 and 94 ILs (from among all 188 ILs) was calculated for separation problems A and B, respectively. The results of calculations are shown in Figure 7 in the form of parity plots. Detailed list of both experimental and calculated values is provided in the Supporting Information, Tables S17 and S18. In general, symmetric scatter of the calculated vs experimental selectivities is observed except SWMLR results which tends to underestimate the selectivity in both of the investigated problems. Again, the quality of calculations deteriorates following the order LLSVM ' FFANN > SWMLR regardless of the separation problem considered. For the problem A, the final AARD ∞ were 28.4%, 16.0% and 12.5% for SWMLR, between predicted and experimental values of S12

FFANN and LSSVM, respectively. The values of the separation problem B are similar, namely, 24.6%, 13.4% and 9.8%.

45 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 46 of 65

Comparison with TDGC-LSER Method The results presented and discussed as yet have proved that, in terms of total RMSE or AARD, the SWMLR model provides as good description as the state-of-the-art TDGC-LSER. 35 As a matter of fact, on the basis of the comparison of the overall statistical measures only one could erroneously state that there is not any improvement in the SWMLR model compared with TDGC-LSER. However, it should be noted that the model of Mutelet and co-workers 35 was developed based on much less number of training data (6990 samples) than in was done in this work (28454 samples). The number of data points per parameter differs significantly as well, namely, 28454/288 ≈ 100 for the new SWMLR and 6990/132 ≈ 53 for the TDGC-LSER. Furthermore, Mutelet and co-workers did not make any statistical assessment of their model. Based on standard errors of the regressors listed in Table 3 in reference 35, it is clearly seen that some terms have not an explanatory power and thus they should have been rejected. On the basis of these arguments we can finally deduce that indeed the linear model proposed in this work offers basically the same accuracy for reproducibility of γ ∞ , but it is more significant in terms of statistics. Besides, the range of the potential applications of the new model is more comprehensive, because greater number of possible cations and anions can be built of the groups defined in this work. In particular, several groups not considered in the paper of Mutelet et al. 35 were introduced (see Table S1 in the Supporting Information). In order to elucidate the differences in the TDGC-LSER and the models proposed in this work more clearly, the ILs that can be represented by all the models were selected and all the experimental data available for them were extracted from the database. As a result a representative data collection of 2239 data sets comprising 63 distinct ILs (including: 29 imidazolium, 7 pyrididnium, 7 pyrrolidinium, 6 piperidinium, 9 ammonium and 5 phosphonium ILs) and 30 solutes was obtained. Then, the values of γ ∞ and the respective RMSE/AARD values for all the models established in this paper and the TDGC-LSER were calculated. Detailed information on statistical measures obtained for each IL can be found in Table S19 in the Supporting Information. The aggregated results given in the form of overall AARD calculated over different families of ILs are briefly illustrated in Figure 8 and discussed in the following paragraph. 46 ACS Paragon Plus Environment

Page 47 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

∞ ), Figure 7. Parity plots for of calculated vs experimental values of infinite dilution selectivity (S12 defined in eq (21), obtained by using the SWMLR-, FFANN- and LSSVM-based models finally proposed in this work: (a) benzene from n-hexane; (b) thiophene from n-heptane.

47 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 48 of 65

Figure 8. Comparison of the SWMLR-, FFANN- and LSSVM-based models finally proposed in this work with the TDGC-LSER approach, 35 expressed in terms of average absolute relative deviations (AARD) between calculated and experimental values of γ ∞ for different families of ILs (a key is given in Table 1). First of all, it should be emphasized that, as expected, the reference data are represented by FFANN and LLSVM with the lowest RMSE and AARD, regardless of the type of ILs. In general, it is seen in Figure 8 that in almost all the cases the approaches proposed in this work display AARD values lower than TDGC-LSER. The exceptions are imidazolium ILs, whose AARD resulted from SWMLR is noticeably higher (45.0% vs 29.0% vs for the TDGC-LSER), but also pyrrolidinium and piperidinium ILs, nevertheless with only slightly higher AARD. At first sight one can notice that TDGC-LSER predictions are very poor in the case of ILs based on ammonium and phosphonium cations. In fact, the values of AARD for ILs belonging to this families are extremely high and reach more than 1000% for 5 from among 9 ILs under study, see Table S19 in the Supporting Information. This particularly pertains to the ILs not included in the training data used by Mutelet et al., 35 but (surprisingly) also these used to fit the model parameters. For example, for [N4111][NTf2 ] included in the training set of TDGC-LSER one gets the average abolute relative deviation > 2500% (see Table S19 in the Supporting Information). Interestingly, Mutelet et al. 35 reported AARD of 31.69% for this ILs. Summing up, TDGC-LSER model does not capture an impact of ammonium cation branching on γ ∞ properly. Similar observations were made for phosphonium ILs. In this case, however, the deviations are relatively lower (although still big). AARD varies from 26% for [P14,666 ][NTf2 ] (the only phosphonium IL studied within TDGC-LSER; note that Mutelet et al. 35 reported approximate 48 ACS Paragon Plus Environment

Page 49 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

value, namely, 29.28%) to 2000% for [P6664][PF6 ].

49 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 50 of 65

Conclusions In this an effort was made to address the following shortcomings of currently available empirical models for calculating activity coefficients of molecular solutes in their infinite diluted mixtures with ILs: (1) limited applicability domain (this particularly refers to IL-specific QSPR models); (2) temperature independence; (3) not always easily accessible molecular descriptors (e.g. topological or quantum chemical ones); (4) rare and scarce data set used in training process; (5) insufficient flexibility of linear models and obscure (or none) tests of statistical significance and explanatory power of their terms. In order to overcome these difficulties, three machine learning algorithms were applied to reproduce an extensive compilation of temperature-dependent γ ∞ data in terms of functional groups making up ions and readily available Abraham molecular descriptors of molecular solutes. The linear model developed with SWMLR approach yields in basically the same overall accuracy, when confronted with the state-of-the-art TDGC-LSER equation. 35 In spite of the similarities between both of the models (first of all, the same type of explanatory variables) it is emphasized that the new model has been preselected and finally accepted in a systematic stepwise scheme based on precisely defined criteria. Moreover, external validation applied to examine predictive capacity of the model showed that generalization errors it exhibits are approximately equal to the training errors. Hence, when there is an interest in simple linear models, then the SWMLR correlation from this paper should be perceived as favored. Furthermore, the results of this study demonstrated that a simple linear relationship between ln γ ∞ , 1/T and group contributions is not enough to reproduce the whole current state of the data within AARD better than ±35%. In order to improve this value, one has to apply more sophisticated machine learning algorithms like neural networks or support vector machines, in particular FFANN and LSSVM. The latter algorithm and its implementation presented herein have turned out to disclose a slightly enhanced accuracy, including training as well as generalization. Therefore, the LSSVM-based model was finally found as the ultimately recommended tool for γ ∞ calculation in an entirely in silico mode. 50 ACS Paragon Plus Environment

Page 51 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Cross-validation allowed to confirm that the finally established models are not overfitted and thus they should return stable results. The distribution of relative deviations between calculated and experimental activity coefficients evidenced that reasonable predictions are highly probable regardless of the family of ILs or solutes. Following some general rules of a good practice of applications of empirical models, it should be underlined that the truly reliable results are expected mostly for systems involving ILs similar to these from the training (or testing) data. Nevertheless, the proposed models can be applied in screening of numerous combinations of cations and anions to look for an IL interesting from standpoint of an application under consideration. Of course, the researchers working in the field of ILs are strongly encouraged to try to conduct such studies. Finally, this is expected that the reported correlations will be updated and extended as soon as remarkable amount of the new γ ∞ data will be uploaded to open literature sources.

51 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 52 of 65

Appendix A Besides Tables S1 to S19, an extra relevant information related to the calculations shown in this work are provided in the Supporting Information as a bundle of MATLAB files. These files make up the simple program called idac, which enables the reader of this paper to use the models proposed in a convenient fashion with MATLAB. In this work, version 2014a of MATLAB was used and thus this version or newer is recommended to perform calculations with idac. Furthermore, the code was carefully designed so that it can be run with the freeware GNU Octave as well. In order to display a short tutorial how to use the program enter the command help idac after copying the file idac.m and all the mat-files in the current working directory. Summary of the content of the files is as follows: idac_auxdata.mat: groups, a cell array with the names of the functional groups (valid names are listed in the Table S1 in the Supporting Information); accgrind, a logical vector specifying which groups are used in the finally accepted models; solutes, a structure of solutes information, including the name (field name, see Table S2 in the Supporting Information) and the Abraham descriptors (field AbrahamDesc). idac_swmlr.mat: InModel, a logical vector specifying which terms are in the final model; StepRegHistory, a structure representing a trace of stepwise linear regression, including the values of coefficients for the models tested at each step and the corresponding values of RMSE; Stats, a structure of additional statistics, whose the most important fields are: B, coefficients for terms in final model, with values for a term not in the model set to the value that would be obtained by adding that term to the model, see β in eq (7); intercept, estimated intercept; SE, standard errors for coefficient estimates; TSTAT, t-statistics for coefficient estimates; PVAL, p-values for coefficient estimates (for a complete list of fields and their meaning see the Statistics Toolbox documentation 46). idac_ffann.mat: W1, W2, b1 and b2, the matrices of weights and biases W1 , W2 , b1 and b2 , see eq (10); X_range and y_range, ranges of input/output data used in data input scaling and output postprocessing; net_final and tr_final, object of class net with the final FFANN model and a trace of 52 ACS Paragon Plus Environment

Page 53 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

its training (optional, not required to perform calculations; for a complete list of the functionalities see Neural Networks Toolbox documentation 47). idac_lssvm.mat: model, a structure of all the data associated with the final LSSVM model, whose the most important fields are: nb_data, number of training data; alpha, Lagrange multipliers, αi with i = 1, . . ., N, see eq (16); gam, the value of the regularization constant γ, see eq (15); kernel_pars, kernel function parameter σ 2 , see eq (17); (a complete list of fields can be found in the LSSVMlab Toolbox documentation 62). idac.m: the main function of the program processing ASCII input file into the values of γ ∞ (an exemplary input is provided in the file sample.txt); contains four auxiliary functions allowing to transform the input file into X matrix, see eqs (18a) and (19a), as well as to perform calculations with the finally proposed SWMLR-, FFANN- and LSSVM-based models.

53 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 54 of 65

Acknowledgement Mr. Marcin Durski, Eng., is greatly acknowledged by fruitful discussions regarding application of artificial neural networks, but also by his technical assistance in creating the database, the data revision and evaluation. Funding for this research was provided by the Ministry of Science and Higher Education in the years 2015–2016 within the framework of the project “Iuventus Plus” no. 0347/IP3/2015/73.

Supporting Information Available (1) The PDF file containing several tables recapitulating the finally accepted models in detail: Table S1, the list of functional groups; Table S2, the list of the Abraham descriptors of molecular solutes; Table S3, the results of training/validation/testing for different families of ILs; Table S4, the results of training/validation/testing for different families of solutes; Table S5, results of calculations for distinct ILs; Table S6, results of calculations for distinct solutes; Tables S7–S16, results of calculations for binary IL-solutes systems for each cationic family of ILs; Table S17, infinite dilution selectivity for benzene/n-hexane separation problem; Table S18, infinite dilution selectivity for thiophene/n-heptane separation problem; Table S19, comparison of SWMLR/FFANN/LSSVM models with TDGC-LSER. (2) MATLAB files idac.m, sample.txt, idac_auxdata.mat, idac_swmlr.mat, idac_ffann.mat, idac_lssvm.mat compressed into a single ZIP archive. material is available free of charge via the Internet at http://pubs.acs.org/.

54 ACS Paragon Plus Environment

This

Page 55 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

References (1) Freemantle, M. An Introduction to Ionic Liquids; RSC Publishing: Cambridge, UK, 2009. (2) Plechkova, N. V.; Seddon, K. R. Applications of Ionic Liquids in the Chemical Industry. Chem. Soc. Rev. 2008, 37, 123–150. (3) Hallet, J. P.; Welton, T. Room-Temperature Ionic Liquids: Solvents for Synthesis and Catalysis. 2. Chem. Rev. 2011, 111, 3508–3576. (4) Pereiro, A.; Araújo, J.; Oliveira, F.; Esperança, J.; Canongia Lopes, J.; Marrucho, I.; Rebelo, L. Solubility of Inorganic Salts in Pure Ionic Liquids. J. Chem. Thermodyn. 2012, 55, 29–36. (5) Lei, Z.; Dai, C.; Chen, B. Gas Solubility in Ionic Liquids. Chem. Rev. 2014, 114, 1289–1326. (6) Mizuuchi, H.; Jaitely, V.; Murdan, S.; Florence, A. T. Room Temperature Ionic Liquids and Their Mixtures: Potential Pharmaceutical Solvents. Eur. J. Pharm. Sci. 2008, 33, 326–331. (7) Passos, H.; Freire, M. G.; Coutinho, J. A. P. Ionic Liquid Solutions as Extractive Solvents for Value-Added Compounds from Biomass. Green Chem. 2014, 16, 4786–4815. (8) Zakrzewska, M. E.; Bogel-Łukasik, E.; Bogel-Łukasik, R. Solubility of Carbohydrates in Ionic Liquids. Energy Fuels 2010, 24, 737–735. (9) Perez De Los Rios, A., Hernandez Fernandez, F. J., Eds. Ionic Liquids in Separation Technology; Elsevier: Amsterdam, 2014. (10) Pereiro, A. B.; Araújo, J. M. M.; Esperança, J. M. S. S.; Marrucho, I. M.; Rebelo, L. P. N. Ionic Liquids in Separations of Azeotropic Systems – A Review. J. Chem. Thermodyn. 2012, 46, 2–28. (11) Anantharaj, R.; Banerjee, T. COSMO-RS Based Predictions for the Desulphurization of Diesel Oil Using Ionic Liquids: Effect of Cation and Anion Combination. Fuel Process. Technol. 2011, 92, 39–52. 55 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 56 of 65

(12) Gutiérrez, J. P.; Meindersma, G. W.; de Haan, A. B. COSMO-RS-Based Ionic-Liquid Selection for Extractive Distillation Processes. Ind. Eng. Chem. Res. 2012, 51, 11518–11529. (13) ILThermo v. 2.0, NIST Standard Reference Database No. 147. Published online, 2016; URL: http://ilthermo.boulder.nist.gov/. Accessed on March 15, 2016. (14) Paduszyński, K.; Domańska, U. Viscosity of Ionic Liquids: An Extensive Database and a New Group Contribution Model Based on a Feed-Forward Artificial Neural Network. J. Chem. Inf. Model. 2014, 54, 1311–1324. (15) Maginn, E. J. Molecular Simulation of Ionic Liquids: Current Status and Future Opportunities. J. Phys. Condens. Matter 2009, 21, 373101. (16) Vega, L. F.; Vilaseca, O.; Llovell, F.; Andreu, J. S. Modeling Ionic Liquids and the Solubility of Gases in Them: Recent Advances and Perspectives. Fluid Phase Equilib. 2010, 294, 15–30. (17) Coutinho, J. A. P.; Carvalho, P. J.; Oliveira, N. M. C. Predictive Methods for the Estimation of Thermophysical Properties of Ionic Liquids. RSC Adv. 2012, 2, 7322. (18) Maia, F. M.; Tsivintzelis, I.; Rodriguez, O.; Macedo, E. A.; Kontogeorgis, G. M. Equation of state modelling of systems with ionic liquids: Literature review and application with the Cubic Plus Association (CPA) model. Fluid Phase Equilib. 2012, 332, 128–143. (19) Das, R. N.; Roy, K. Advances in QSPR/QSTR Models of Ionic Liquids for the Design of Greener Solvents of the Future. Mol. Divers. 2013, 17, 151–196. (20) Diedenhofen, M.; Eckert, F.; Klamt, A. Prediction of Infinite Dilution Activity Coefficients of Organic Compounds in Ionic Liquids Using COSMO-RS. J. Chem. Eng. Data 2003, 48, 475–479. (21) Kurnia, K. A.; Pinho, S. P.; Coutinho, J. A. P. Designing Ionic Liquids for Absorptive Cooling. Green Chem. 2014, 16, 3741.

56 ACS Paragon Plus Environment

Page 57 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(22) Khan, I.; Kurnia, K. A.; Mutelet, F.; Pinho, S. P.; Coutinho, J. A. P. Probing the Interactions between Ionic Liquids and Water: Experimental and Quantum Chemical Approach. J. Phys. Chem. B 2014, 118, 1848–1860. (23) Kurnia, K. A.; Pinho, S. P.; Coutinho, J. A. P. Evaluation of the Conductor-like Screening Model for Real Solvents for the Prediction of the Water Activity Coefficient at Infinite Dilution in Ionic Liquids. Ind. Eng. Chem. Res. 2014, 53, 12466–12475. (24) Matheswaran, P.; Wilfred, C. D.; Kurnia, K. A.; Ramli, A. Overview of Activity Coefficient of Thiophene at Infinite Dilution in Ionic Liquids and their Modeling Using COSMO-RS. Ind. Eng. Chem. Res. 2016, 55, 788–797. (25) Eike, D. M.; Brennecke, J. F.; Maginn, E. J. Predicting Infinite-Dilution Activity Coefficients of Organic Solutes in Ionic Liquids. Ind. Eng. Chem. Res. 2004, 43, 1039–1048. (26) Tämm, K.; Burk, P. QSPR analysis for infinite dilution activity coefficients of organic compounds. J. Mol. Model. 2006, 12, 417–421. (27) Mintz, C.; Acree, W. E. Partition Coefficient Correlations for Transfer of Solutes from Gas Phase and Water to Room Temperature Ionic Liquids. Phys. Chem. Liq. 2007, 45, 241–249. (28) Ge, M.; Xiong, J.; Wang, L. Theoretical Prediction for the Infinite Dilution Activity Coefficients of Organic Compounds in Ionic Liquids. Chin. Sci. Bull. 2009, 54, 2225–2229. (29) Xi, L.; Sun, H.; Li, J.; Liu, H.; Yao, X.; Gramatica, P. Prediction of Infinite-Dilution Activity Coefficients of Organic Solutes in Ionic Liquids Using Temperature-Dependent Quantitative Structure–Property Relationship Method. Chem. Eng. J. 2010, 163, 195–201. (30) Sprunger, L.; Clark, M.; Acree Jr., W. E.; Abraham, M. H. Characterization of RoomTemperature Ionic Liquids by the Abraham Model with Cation-Specific and Anion-Specific Equation Coefficients. J. Chem. Inf. Model. 2007, 47, 1123–1129.

57 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 58 of 65

(31) Sprunger, L.; Proctor, A.; Acree Jr., W. E.; Abraham, M. H. LFER Correlations for Room Temperature Ionic Liquids: Separation of Equation Coefficients into Individual Cation-Specific and Anion-Specific Contributions. Fluid Phase Equilib. 2007, 265, 104–111. (32) Sprunger, L. M.; Gibbs, J.; Proctor, A.; Acree Jr., W. E.; Abraham, M. H.; Meng, C., Y.; Yao; Anderson, J. L. Linear Free Energy Relationship Correlations for Room Temperature Ionic Liquids: Revised Cation-Specific and Anion-Specific Equation Coefficients for Predictive Applications Covering a Much Larger Area of Chemical Space. Ind. Eng. Chem. Res. 2007, 48, 4145–4154. (33) Revelli, A.-L.; Mutelet, F.; Jaubert, J.-N. Prediction of Partition Coefficients of Organic Compounds in Ionic Liquids: Use of a Linear Solvation Energy Relationship with Parameters Calculated through a Group Contribution Method. Ind. Eng. Chem. Res. 2010, 49, 3883–3892. (34) Grubbs, L. M.; Ye, S.; Saifullah, M.; McMillan-Wiggins, M. C.; Acree, W. E.; Abraham, M. H.; Twu, P.; Anderson, J. L. Correlations for Describing Gas-to-Ionic Liquid Partitioning at 323K Based on Ion-Specific Equation Coefficient and Group Contribution Versions of the Abraham Model. Fluid Phase Equilib. 2011, 301, 257–266. (35) Mutelet, F.; Ortega-Villa, V.; Moïse, J.-C.; Jaubert, J.-N.; Acree, W. E. Prediction of Partition Coefficients of Organic Compounds in Ionic Liquids Using a Temperature-Dependent Linear Solvation Energy Relationship with Parameters Calculated through a Group Contribution Method. J. Chem. Eng. Data 2011, 56, 3598–3606. (36) Nami, F.; Deyhimi, F. Prediction of Activity Coefficients at Infinite Dilution for Organic Solutes in Ionic Liquids by Artificial Neural Network. J. Chem. Thermodyn. 2011, 43, 22–27. (37) Kato, R.; Gmehling, J. Systems with Ionic Liquids: Measurments of VLE and γ ∞ Data Prediction of Their Thermodynamic Behavior Using Original UNIFAC, Mod. UNIFAC (Do) and COSMO-RS (Ol). J. Chem. Thermodyn. 2005, 37, 603–619.

58 ACS Paragon Plus Environment

Page 59 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(38) Abraham, M. H. Scales of Solute Hydrogen-Bonding: Their Construction and Application to Physicochemical and Biochemical Processes. Chem. Soc. Rev. 1993, 22, 73. (39) Grubbs, L. M.; Saifullah, M.; De La Rosa, N. E.; Acree Jr., W. E.; Abraham, M. H.; Zhao, Q.; Anderson, J. L. Cation-Specific and Anion-Specific Abraham Model Correlations for Solute Transfer into Ionic Liquid Solvents. Global J. Phys. Chem. 2010, 1, 1–19. (40) Stephens, T. W.; Chou, V.; Quay, A. N.; Shen, C.; Dabadge, N.; Tian, A.; Loera, M.; Willis, B.; Wilson, A.; Acree Jr., W. E.; Twu, P.; Anderson, J. L.; Abraham, M. H. Thermochemical Investigations of Solute Transfer into Ionic Liquid Solvents: Updated Abraham Model Equation Coefficients for Solute Activity Coefficient and Partition Coefficient Predictions. Phys. Chem. Liq. 2014, 52, 488–518. (41) Stephens, T. W.; Hart, E.; Kuprasertkul, N.; Metha, S.; Wadawadigi, A.; Acree Jr., W. E.; Abraham, M. H. Abraham Model Correlations for Describing Solute Transfer into Ionic Liquid Solvents: Calculation of Ion-Specific Equation Coefficients for the 4,5-Dicyano-2(trifluoromethyl)imidazolide Anion. Phys. Chem. Liq. 2014, 52, 777–791. (42) Mutelet, F.; Alonso, D.; Stephens, T. W.; Acree Jr., W. E.; Baker, G. A. Infinite Dilution Activity Coefficients of Solutes Dissolved in Two Trihexyl(tetradecyl)phosphonium Ionic Liquids. J. Chem. Eng. Data 2014, 59, 1877–1885. (43) Ayad, A.; Mutelet, F.; Negadi, A.; Acree Jr., W. E.; Jiang, B.; Lu, A.; Wagle, D. V.; Baker, G. A. Activity Coefficients at Infinite Dilution for Organic Solutes Dissolved in Two 1-Alkylquinuclidinium Bis(trifluoromethylsulfonyl)imides Bearing Alkyl Side Chains of Six and Eight Carbons. J. Mol. Liq. 2016, 215, 176–184. (44) Pirhadi, S.; Shiri, F.; Ghasemi, J. B. Multivariative Statistical Analysis Methods in QSAR. RSC Adv. 2015, 5, 104635–104665. (45) Draper, N. R., Smith, H., Eds. Applied Regression Analysis; John Wiley& Sons: New York, 2014. 59 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 60 of 65

(46) Statistics and Machine Learning Toolbox Documentation; The MathWorks, Inc., 2016; Published online: http://www.mathworks.com/help/stats/. Accessed on March 15, 2016. (47) Neural Network Toolbox Documentation; The MathWorks, Inc., 2016; Published online: http://www.mathworks.com/help/nnet/. Accessed on March 15, 2016. (48) Livingstone, D. J., Ed. Artificial Neural Networks: Methods and Applications; Humana Press: Totowa, NJ, 2008. (49) Haykin, S. O. Neural Networks and Learning Machines; Prentice Hall: New York, 2009. (50) Maltarollo, V. C.; Honório,; Ferreira da Silva, A. B. Applications of Artificial Neural Networks in Chemical Problems; Artificial Neural Networks – Architectures and Applications; InTech, 2013; Chapter 10, pp 203–223, Available from: http://www.intechopen. com/books/artificial-neural-networks-architectures-and-applications/ applications-of-artificial-neural-networks-in-chemical-problems.

Ac-

cessed on March 15, 2016. (51) Ferrari, S.; Stengel, R. Smooth Function Approximation Using Neural Networks. IEEE Trans. Neural Netw. 2005, 16, 24–38. (52) Manallack, D. T.; Livingstone, D. J. Neural Networks in Drug Discovery: Have They Lived up to Their Promise? Eur. J. Med. Chem. 1999, 34, 195–208. (53) Levenberg, K. A Method for the Solution of Certain Problems in Least Squares. Q. Appl. Math. 1944, 2, 164–168. (54) Marquardt, D. An Algorithm for Least-Squares Estimation of Nonlinear Parameters. SIAM J. Appl. Math. 1963, 11, 431–441. (55) Nguyen, D.; Widrow, B. Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights. Proceedings of the International Joint Conference on Neural Networks. 2005; pp 21–26. 60 ACS Paragon Plus Environment

Page 61 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(56) Vapnik, V., Ed. The Nature of Statistical Learning Theory; Spinger-Verlag: New York, 1995. (57) Cristianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines and Other Kernel-Based Learninig Algorithms; Cambridge University Press: Cambridge, New York, 2000. (58) Wang, H.; Hu, D. Comparison of SVM and LS-SVM for Regression. Neural Networks and Brain, 2005. ICNN B ’05. International Conference on. 2005; pp 279–283. (59) Liu, H.; Yao, X.; Zhang, R.; Liu, M.; Hu, Z.; Fan, B. Accurate Quantitative Structure-Property Relationship Model To Predict the Solubility of C 60 in Various Solvents Based on a Novel Approach Using a Least-Squares Support Vector Machine. J. Phys. Chem. B 2005, 109, 20565–20571. (60) Suykens, J. A. K.; Vandewalle, J. Least Squares Support Vector Machine Classifiers. Neural Process. Lett. 1999, 9, 293–300. (61) Suykens, J. A. K.; Van Gestel, T.; De Brabanter, J.; De Moor, B.; Vandewalle, J. Least Squares Support Vector Machines; World Scientific Publishing: River Edge, NJ, 2002. (62) LSSVMlab. Published online. Accessed on March 15, 2016, 2014; URL: http://www. esat.kuleuven.be/sista/lssvmlab/. (63) Mohammadi, A. H.; Eslamimanesh, A.; Richon, D.; Gharagheizi, F.; Yazdizadeh, M.; Javanmardi, J.; Hashemi, H.; Zarifi, M.; Babaee, S. Gas Hydrate Phase Equilibrium in Porous Media: Mathematical Modeling and Correlation. Ind. Eng. Chem. Res. 2012, 51, 1062–1072. (64) Kamari, A.; Khaksar-Manshad, A.; Gharagheizi, F.; Mohammadi, A. H.; Ashoori, S. Robust Model for the Determination of Wax Deposition in Oil Systems. Ind. Eng. Chem. Res. 2013, 52, 15664–15672. (65) Fayazi, A.; Arabloo, M.; Shokrollahi, A.; Zargari, M. H.; Ghazanfari, M. H. State-of-the-Art

61 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 62 of 65

Least Square Support Vector Machine Application for Accurate Determination of Natural Gas Viscosity. Ind. Eng. Chem. Res. 2014, 53, 945–958. (66) Gharagheizi, F.; Sattari, M.; Ilani-Kashkouli, P.; Mohammadi, A. H.; Ramjugernath, D.; Richon, D. A “Non-Linear” Quantitative Structure–Property Relationship for the Prediction of Electrical Conductivity of Ionic Liquids. Chem. Eng. Sci. 2013, 101, 478–485. (67) Gharagheizi, F.; Ilani-Kashkouli, P.; Sattari, M.; Mohammadi, A. H.; Ramjugernath, D.; Richon, D. Development of a LSSVM-GC Model for Estimating the Electrical Conductivity of Ionic Liquids. Chem. Eng. Res. Des. 2014, 92, 66–79. (68) Nelder, J. A.; Mead, R. A Simplex Method for Function Minimization. Comput. J. 1965, 7, 308–313. (69) Xavier-de Souza, S.; Suykens, J.; Vandewalle, J.; Bolle, D. Coupled Simulated Annealing. IEEE Trans. Syst. Man Cybern. Part B Cybern. 2010, 40, 320–335. (70) Everett, D. H. Effect of Gas Imperfection on G.L.C. Measurements: A Refined Method for Determining Activity Coefficients and Second Virial Coefficients. Trans. Faraday Soc. 1965, 61, 1637–1645. (71) Cruickshank, A. J. B.; Gainey, B. W.; Hicks, C. P.; Letcher, T. M.; Moody, R. W.; Young, C. L. Gas-Liquid Chromatographic Determination of Cross-Term Second Virial Coefficients Using Glycerol. Benzene + Nitrogen and Benzene + Carbon Dioxide at 50 ◦ C. Trans. Faraday Soc. 1969, 65, 1014–1031. (72) Deng, L.; Wang, Q.; Chen, Y.; Zhang, Z.; Tang, J. Determination of the Solubility Parameter of Ionic Liquid 1-Octyl-3-Methylimidazolium Hexafluorophosphate by Inverse Gas Chromatography. J. Mol. Liq. 2013, 187, 246–251. (73) Li, X.; Wang, Q.; Li, L.; Deng, L.; Zhang, Z.; Tian, L. Determination of the Thermody-

62 ACS Paragon Plus Environment

Page 63 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

namic Parameters of Ionic Liquid 1-Hexyl-3-Methylimidazolium Chloride by Inverse Gas Chromatography. J. Mol. Liq. 2014, 200, 139–144. (74) Bahlmann, M.; Nebig, S.; Gmehling, J. Activity Coefficients at Infinite Dilution of Alkanes and Alkenes in 1-Alkyl-3-Methylimidazolium Tetrafluoroborate. Fluid Phase Equilib. 2009, 282, 113–116. (75) Foco, G. M.; Bottini, S. B.; Quezada, N.; de la Fuente, J. C.; Peters, C. J. Activity Coefficients at Infinite Dilution in 1-Alkyl-3-methylimidazolium Tetrafluoroborate Ionic Liquids. J. Chem. Eng. Data 2006, 51, 1088–1091. (76) Revelli, A.-L.; Mutelet, F.; Turmine, M.; Solimando, R.; Jaubert, J.-N. Activity Coefficients at Infinite Dilution of Organic Compounds in 1-Butyl-3-Methylimidazolium Tetrafluoroborate Using Inverse Gas Chromatography. J. Chem. Eng. Data 2008, 54, 90–101. (77) Zhang, J.; Zhang, Q.; Qiao, B.; Deng, Y. Solubilities of the Gaseous and Liquid Solutes and Their Thermodynamics of Solubilization in the Novel Room-Temperature Ionic Liquids at Infinite Dilution by Gas Chromatography. J. Chem. Eng. Data 2007, 52, 2277–2283. (78) Zhou, Q.; Wang, L.-S.; Wu, J.-S.; Li, M.-Y. Activity Coefficients at Infinite Dilution of Polar Solutes in 1-Butyl-3-methylimidazolium Tetrafluoroborate Using Gas-Liquid Chromatography. J. Chem. Eng. Data 2007, 52, 131–134. (79) Liebert, V.; Nebig, S.; Gmehling, J. Experimental and Predicted Phase Equilibria and Excess Properties for Systems with Ionic Liquids. Fluid Phase Equilib. 2008, 268, 14–20. (80) Hector, T.; Uhlig, L.; Gmehling, J. Prediction of Different Thermodynamic Properties for Systems of Alcohols and Sulfate-based Anion Ionic Liquids Using Modified UNIFAC. Fluid Phase Equilib. 2013, 338, 135–140. (81) Sumartschenkowa, I. A.; Verevkin, S. P.; Vasiltsova, T. V.; Bich, E.; Heintz, A.; Shevelyova, M. P.; Kabo, G. J. Experimental Study of Thermodynamic Properties of Mixtures 63 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 64 of 65

Containing Ionic Liquid 1-Ethyl-3-methylimidazolium Ethyl Sulfate Using Gas-Liquid Chromatography and Transpiration Method. J. Chem. Eng. Data 2006, 51, 2138–2144. (82) Paduszyński, K.; Domańska, U. A New Group Contribution Method For Prediction of Density of Pure Ionic Liquids over a Wide Range of Temperature and Pressure. Ind. Eng. Chem. Res. 2012, 51, 591–604.

64 ACS Paragon Plus Environment

Page 65 of 65

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Graphical TOC Entry

65 ACS Paragon Plus Environment