Article pubs.acs.org/jcim
Cite This: J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Melting Temperature Estimation of Imidazole Ionic Liquids with Clustering Methods Jorge Alberto Cerecedo-Cordoba,† Juan Javier Gonzaĺ ez Barbosa,† Juan Frausto Solís,*,† and Nohra Violeta Gallardo-Rivas† †
Tecnológico Nacional de México/Instituto Tecnológico de Ciudad Madero, Avenida Primero de Mayo, 89440, Cuidad Madero, Tamaulipas, México
Downloaded via BUFFALO STATE on July 19, 2019 at 20:03:36 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.
S Supporting Information *
ABSTRACT: Ionic liquids (ILs) are ionic compounds with low melting points that can be designed to be used in an extensive set of commercial and industrial applications. However, the design of ILs is limited by the quantity and quality of the available data in the literature; therefore, the estimation of physicochemical properties of ILs by computational methods is a promising way of solving this problem, since it provides approximations of the real values, resulting in savings in both time and money. We studied two data sets of 281 and 134 liquids based on the molecule imidazole that were analyzed with QSPR techniques. This paper presents a software architecture that uses clustering techniques to improve the robustness of estimation models of the melting point of ILs. These results indicate an error of 6.25% in the previously unmodeled data set and an error of 4.43% in the second data set. We have an improvement with the second data set of 1.81% over the last results previously found.
■
INTRODUCTION Over the last few decades, ionic liquids (ILs) have gained enormous relevance due to their flexible design and various applications.1 Also, ILs can be used as environmentally benign solvents and as substitutes for volatile organic solvents. ILs are salts that are liquid at room temperature. Another common definition establishes ILs as salts with melting temperatures lower than 100 °C.2 There is a clear difference with conventional salts, which have higher melting points; for example, NaCl is a liquid at 900 °C. Additionally, ILs can remain liquid over a wide temperature range and have a negligible vapor pressure. There are currently around 1000 ILs documented in the literature, with approximately 300 commercially available. The design of purpose-specific ILs is possible by a detailed calibration of the anions and cations. However, the design and synthesis process of ILs can be time-consuming and costly. Computer-aided methods can produce substantial savings in resources in the design of ILs.3 Nevertheless, the efficiency of these methods is directly linked to the accuracy of the estimation models of the physicochemical properties used. For this reason, it is essential to analyze the quantitative structure−property relationships (QSPR), which establish the relationships between molecular structures and the desirable properties. The estimation of the melting point (Tm) in ILs has proved to be a complex task due to various problems: the lack of experimental data and the quality of the data is worrisome due to discrepancies between sources,4 ILs drastically change their physicochemical properties with small variations in the ions,5 and the thermal properties have complex behavior.6,7 © XXXX American Chemical Society
An exhaustive search of the relevant literature did not reveal an accurate method to estimate the physicochemical properties of ILs. The machine-learning approaches commonly used are (a) linear regression,6,8,9 (b) regression trees,10 (c) support vector machines,4 and (d) neural networks.4,10,11 Previous works seems to demonstrate that prediction algorithms by themselves will hardly improve current results by a significant amount. A divide and conquer strategy was proposed in 2007 that generates subsets of ILs.4 This idea raises the problem of determining the best subsets. A possible solution is the generation of subgroups with the opinion of an expert. Another idea is the creation of subsets by using clustering algorithms. We explore both alternatives. Clustering had been used in the state-of-art of this area as a method to ensure the proper distribution of ionic liquids in training and test sets.12−14 In this paper, clustering algorithms are used to separate ILs into subgroups of similar characteristics. Here we prove that this is an effective solution for problems with data of a complex nature, at least for the data sets tested. In this paper, we present an architecture to achieve the approximation of the melting temperature of ILs based on imidazole. We selected this group of ILs due to its popularity in the industry and the scientific community; this will allow us to have an experimental base that will allow reproducibility and the future implementation for other families of ILs. The architecture shown is based on the separation and classification of ILs into clusters. Then an estimation model is created for each cluster. Received: March 7, 2019 Published: June 4, 2019 A
DOI: 10.1021/acs.jcim.9b00203 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling
■
Article
PROPOSED METHODOLOGY Data Preparation. The ILs used in this paper are from Zhang et al.16 data set A and Huo et al.6 (data set B). We selected data source A because it is a compendium of ILs in the literature. Data set B was selected in order to comparate our method with the state-of-art. The data were limited to only those cations that are based on imidazole (Figure 1). Data set A is composed of
The rest of this paper is organized as follows: The State-of-Art section presents the most relevant works in relation to melting point estimation of ILs. Later, we show in the Proposed Methodology section the architecture for the clustering and estimation of ILs. Finally, we present the Analysis of Results and Conclusions.
■
STATE-OF-ART In the year 2002, Katritzky et al. presented a method named BMLR (best multilinear regression) to predict the melting points of various ionic liquids.8 They used as a study subject 104 imidazolium bromides and 45 benzimidazolium bromides. Their methodology consisted of calculating a diverse group of molecular descriptors through the CODESSA program developed by them. They defined a grouping of ionic liquids to achieve the estimation. Katritzky’s paper is the first to propose the strategy of using subsets of ILs to improve the regression accuracy. In 2004, Carrera and Aires-de-Sousa proposed the use of regression trees and forests, as well as the use of neural networks, to solve this problem.10 For this, they calculated the molecular descriptors with DRAGON software.15 The set of descriptors was processed in a regression tree. Also, they experimented with regression forests and counterpropagation neural networks. The results presented by Carrera and Aires-de-Sousa showed that the use of nonlinear models is an efficient tool. López-Martiń et al.9 estimated the melting temperature of 84 ILs on the basis of imidazole with acceptable results. They used the molecular descriptors generated by CODESSA and DRAGON software. They found that the essential descriptors are those related to the size, symmetry, and distribution of the charges of the ions of every IL. In 2007, Varnek et al.4 used several machine-learning techniques to predict the melting point of various ionic liquids. In their methodology, they used artificial neural networks (ANNs), support vector machines (SVM), k-nearest neighbors, and multilinear regression analysis. The best algorithms found in their study were SVM and neural networks, regardless of the descriptors used. Although Varnek’s work achieved an extensive survey of various prediction techniques, the predictions have an error of 37.5 K in his best case and 46.4 K for the worst configuration. They suggest reviewing the descriptors used in their methods and exploring other advanced techniques for improving the performance; some of these techniques are multitasking, graybox, and clustering. In our case, we are working with clustering methods that are presented in this paper. In 2008, an artificial neural network for estimation of the melting point of 97 ILs based on imidazole was presented by Torrecilla et al.11 In their paper, the authors modeled the cations and anions used for the model with molecular descriptors calculated with the CODESSA software. Then, a new model based on the group contribution method was proposed by Huo et al.6 They assumed that each atom and bond contribute to the melting point of the molecule. In their work, the authors obtained good estimations of Tm (average relative deviation of 5.86%). From the 190 ILs used, 31 had a deviation bigger than 10%. Also, the average deviation in this group was 28.2 K. Genetic algorithms (GAs) were also used in regression models of Tm by Lazzús et al.7 Lazzús created a regression with GAs, and contribution groups were used to generate an estimation model of different ILs. In Lazzús’ work, 200 ILs were used for training and 200 ILs for validation.
Figure 1. Imidazole molecule.
281 ILs with several anions and cations (92 and 45, respectively). The 92 cations used are shown in Table 1. The cations used have changes on nitrogens 1 and 3 (R1, R3). The anions are comprised of 45 various molecules shown in Table 2. The data set A has a melting point range from 180.65 to 454.15 K. The full data set A of ILs is available in Supporting Information (SI). Data set B is composed of 134 ILs formed with 75 cations and 15 anions. Data set B has a melting point range from 243.15 to 533.65 K and is available in the work of Huo.6 The distribution of temperature in data sets A and B are uneven and are shown in Figure 2. We used the MarvinSketch software17 from the MarvinSuite to digitize the cations and anions separately. The chemical structures were optimized with the calculation of the Merck molecular force field (MMFF) with the software OpenBabel.18 The molecular description of the 92 cations and 45 anions was performed with the PaDEL-descriptor software.19 Molecular descriptors are numbers extracted from the molecular structure that collectively contain detailed information on the chemistry of the molecules analyzed. This software allows the calculation of 1875 different descriptors of type 1D, 2D, and 3D per molecule. In our original calculation, for data set A, 1454 and 1508 descriptors were generated for anions and cations. In total, 2962 descriptors were generated. In the case of data set B, 1270 and 1466 descriptors were calculated for anions and cations, respectively, with a total of 2736 descriptors. Descriptors with a constant value for all data were eliminated for their lack of impact on the model, while those with error values were replaced with the average value of the descriptor. Feature Selection. We performed a feature selection to reduce the number of molecular descriptors with algorithms from machine learning. The following algorithms are used: Tree feature selection is a regression analysis with a regression tree performed, and the upper roots are selected as the relevant features. F_regression is a univariate linear test, where, first, the correlation between the characteristics and the response variable is calculated and then it is transformed to score F; the best K characteristics are selected according to the F score. Mutual_info_regression calculates the mutual information between the predictive characteristics and the response variable; in other words, it is a univariate test where the k characteristics that are most dependent on the response variable are selected. RFE is the case where the feature selection is performed by iteratively eliminating the minor features, and the importance of the feature is determined by analyzing the coefficients of a fitted linear model. RFECV is RFE with cross-validation. All the methods used are open-source software and are from scikit-learn.20 We made a uniform random split of the full data set into training and validation set with 70% and 30%, B
DOI: 10.1021/acs.jcim.9b00203 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling Table 1. Cations of Dataset A CIDa
1-substituent
11 12 21 22 23 24 26 27 28 29 210 211 212 216 217 218 221 222 224 226 227 228 229 230 231 232 233 234 235 236 237 238 239 243 246 249 251 253 255 259 260 2108 2143 2144 2145 2146
methyl ethyl methyl methyl ethyl ethyl trifluoroethyl ethyl propyl isopropyl (2-methoxyethyl) butyl isobutyl amyl butyl vinyl hexyl phenylethanoyl heptyl butyl benzyl (4-methoxyphenyl) octyl phenethyl (1-heptoxymethyl) phenylethanoyl butoxymethyl nonyl octyl hydrocinnamyl 1-butoxymethyl octyl decyl undecyl dodecyl trimdecyl tetradecyl pentadecyl hexadecyl octadecyl cosyl methoxymethyl methylnitrile ethylnitrile propylnitrile butylnitrile
3-substituent
CIDa
1-substituent
3-substituent
methyl methylb methyl methylb methyl ethyl methyl methyl methyl methyl methyl methyl ethyl butyl methyl methyl methyl butyl methyl methyl methyl methyl methyl propyl propoxymethyl methyl ethyl methyl 1-butoxymethyl propyl methyl methyl methyl methyl methyl methyl methyl methyl methyl methyl methyl methyl methyl methyl
2114 2120 2121 2122 2123 2124 2125 2126 2127 2128 2109 2110 2111 2113 2143 2144 2145 2146 2149 2150 2151 2152 2153 2154 2162 2163 2164 2172 2173 2175 2176 2180 2188 2189 2190 2198 2199 2200 2201 2202 2203 2204 2205 2208 2215 2223
(1-(R)-ethoxycarbonyl-ethyl) octyloxymethyl nonyloxymethyl decyloxymethyl diundecyloxymethyl dodecyoxymethyl (2-hydroxyethyl) methyl sec-butyl (3-methyl-butyl) methyl ethyl propyl butyl ethoxytrimethylsilane ethoxytrimethylsilane ethoxytrimethylsilane 2-ethoxyethyl butyl butyl hexyl hexyl octyl octyl SF5(CF2)2(CH2)2 SF5(CF2)2(CH2)4 SF5(CF2)4(CH2)2 3-[diphenylc ]propyl 3-[morpholinec ]propyl 3-[imidazolec ]propyl 3-bromopropyl 2-hydroxypropyl N-butylcarbamoylmethyl N-butyl-N-methylcarbamoylmethyl N,N-diethylcarbamoylmethyl methyl methyl methyl propargyl allyl butylnitrile propylcarboxyl 2-benzoxylpropyl 2,3-dibromopropyl valerylene benzyl
methyl octyloxymethyl nonyloxymethyl decyloxymethyl diundecyloxymethyl dodecyoxymethyl methyl acetate methyl methyl (1,1,2,2-tetrafluoroethyl) (1,1,2,2-tetrafluoroethyl) (1,1,2,2-tetrafluoroethyl) (1,1,2,2-tetrafluoroethyl) methyl fluoroethane 3,3,3-trifluoropropyl 3,3,3-trifluoropropyl 2-(d) ethyl 3-(d) propyl 2-(d) ethyl 3-(d) propyl 2-(d) ethyl 3-(d) propyl methyl methyl methyl methyl methyl methyl methyl butyl methyl methyl methyl propargyl allyl propylcarboxyl propargyl allyl butylnitrile propylcarboxyl methyl methyl valerylene benzyl
a
CID is the identifier used on the data source. bMethyl in R2. cDithiocarbamate. dDiethoxyphosphinyl.
respectively. Several parameter configurations were proposed to determine the best feature selection. The numbers of variables selected were 5−20, 30, 40, 50, 60, 70, 80, 90, 100, 150, and 200. The feature selection algorithms were executed with only the training set as input. Once finished, we ranked the sets of features by measuring the error of training and testing regression models. The regression models used with this purpose are SVR and regression trees due their wide usage, and they were trained with only the training set and tested with the validation set. Table 3 reports the 10 best results measured in coefficient of determination (R2) and mean absolute percentage error (MAPE) for data set A. Likewise, Table 4 shows the results
for data set B. The metrics are defined in eqs 1 and 2. The usage of R2 as measure of quality in nonlinear models is discourage,21 but we use it as reference to follow common practices in the area. We use the RFE algorithm with 40 variables for data set A in the rest of the paper for obtaining the best results regarding MAPE. In the case of data set B, the best 30 variables selected by the tree feature selection algorithm were used. n
R2 = 1 − C
∑i = 1 (yi − yi )̂ 2 n
∑i = 1 (yi − y ̅ )2
(1) DOI: 10.1021/acs.jcim.9b00203 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling Table 2. Anions of Dataset A CID
name
CID
name
11 12 13 14 21 31 33 34 35 38 39 41 43 45 47 51 54 61 63 71 72 81 83
chloride bromide iodine perchlorate tetrafluoroborate bis((trifluoromethyl)sulfonyl)imide bis((perfluoroethane)sulfonyl)imide dicyanamide 2,2,2-trifluoro-N-(trifluoromethylsulfonyl)acetamide nitrite nitrate sulfate trifluoromethanesulfonate perfluorobutylsulfonate tosylate hexafluorophosphate tris(pentafluoroethyl)trifluorophosphate trifluoroacetate acetate tris(trifluoromethylsulfonyl)methide tricyanomethanide tetrachloroaluminate hexafluoroniobate
84 85 86 88 217 220 223 224 225 226 227 233 310 312 317 412 413 414 417 425 636 1503
hexafluorotantalate hexafluoroarsenate hexafluoroantimonate oxypentafluorotungstate tetrakis(3,5-bis(trifluoromethyl)phenyl)borate tetrakis((4-perfluorohexyl)phenyl)borate trifluoromethyltrifluoroborate pentafluoroethyltrifluoroborate (heptafluoro-n-propyl)trifluoroborate (nonafluoro-n-butyl)trifluoroborate 3-(trifluoroborate)-butylnitrile n-pentyltrifluoroborate bis(nonafluorobutane-1-sulfonyl)imide saccharinate bis(fluorosulfonyl)imide pentyl sulfate 4,4,5,5,5-pentafluoropentyl sulfate 2,2,3,3,4,4,5,5-octafluoropentyl sulfate methylsulfate perfluoroethylsulfate hydrogen phthalate thiocyanate
Figure 2. Comparison of data sets.
Table 3. Experimental Results of the Feature Selection of Dataset A training feature selection algorithm
regression algorithm
number of features
R
RFE RFE RFE RFE RFE RFE F_regression F_regression RFE RFE
SVR SVR SVR SVR SVR SVR SVR SVR SVR SVR
40 50 30 100 60 150 150 200 200 20
0.78 0.79 0.78 0.77 0.78 0.76 0.70 0.70 0.75 0.71
2
Table 4. Experimental Results of the Feature Selection of Dataset B
testing
training
MAPE
R
MAPE
feature selection algorithm
7.77 7.75 7.82 7.94 7.86 8.20 9.16 9.10 8.37 8.98
0.70 0.70 0.68 0.68 0.69 0.66 0.64 0.65 0.65 0.62
8.11 8.25 8.33 8.38 8.42 8.70 8.99 9.00 9.16 9.46
tree F_regression mutual info RFE RFE F_regression F_regression RFE RFE RFE
2
D
testing
regression algorithm
number of features
R2
MAPE
R2
MAPE
tree tree tree tree SVR tree tree tree tree SVR
30 40 8 14 50 30 20 12 5 20
0.95 0.91 0.90 0.90 0.89 0.94 0.96 0.89 0.89 0.87
3.73 4.86 4.97 5.12 5.95 4.05 2.54 5.21 5.16 6.14
0.85 0.85 0.90 0.81 0.81 0.70 0.68 0.80 0.70 0.88
5.44 5.89 5.95 6.10 6.47 6.69 6.75 6.87 6.99 7.05
DOI: 10.1021/acs.jcim.9b00203 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling Table 5. Description of Molecular Descriptors Used in This Papera Data Set A anions AATSC2i AATSC2s ATS0c ATS2e ATS2i ATS2m
ATSC0i ATSC0s AATSC1e AATSC1s Bpol DELS2
ETA_Beta_s ETA_dBeta ETA_Epsilon_3 ETA_Epsilon_5 ETA_Eta_B_RC ETA_Eta_L
cations GATS1v MATS 2p maxdNH MLFER_BH MLFER_BO MLFER_S Data Set B
MPC2 nBondsS3 nHBAcc3 SHBa SP-2 SpMAD_Dzp
SpMin1_Bhs SpMin2_Bhs SpMin6_Bhv VABC VC-3 Zagreb
anions VE2_DzZ VE1_DzZ Mp ETA_Psi_1 TDB1s
GATS1v SpMax2_Bhm SpMax2_Bhs WTPT-2
cations
VE1_Dzm AATS0v Mv ETA_Eta_F_L a_MIC2
VE1_Dzs VE2_Dzm ETA_AlphaP ASP-1
ASP-4 MATS1c MATS3c P2e
ASP-6 E3e P2u GATS5i
VE3_Dzi E3s P2p RPCG
AVP-2 AATSC3c P2s AATS1i
a
Descriptors are described in the Supporting Information.
Table 6. Y-Scrambling Results of Dataset A training 2
trial
R
svr_00 svr_01 svr_02 svr_03 svr_04 svr_05 svr_06 svr_07 svr_08 svr_09 svr_10 svr_11 svr_12 svr_13 svr_14
0.69 0.20 0.01 0.05 0.28 0.04 0.01 0.00 0.02 0.00 0.00 0.02 0.00 0.00 0.09
test 2
MAPE
R
9.15 14.91 16.34 16.62 14.14 16.47 16.21 17.16 17.75 16.11 17.35 16.59 16.80 16.40 16.90
0.50 −0.07 0.00 −0.04 −0.25 0.02 −0.03 0.00 −0.05 −0.05 0.00 −0.15 0.00 −0.01 −0.30
training 2
test 2
MAPE
trial
R
MAPE
R
11.25 17.49 19.10 15.99 18.13 18.15 20.66 16.20 14.32 20.57 15.48 18.57 17.75 18.85 17.80
svr_15 svr_16 svr_17 svr_18 svr_19 svr_20 svr_21 svr_22 svr_23 svr_24 svr_25 svr_26 svr_27 svr_28 svr_29
0.06 0.02 0.00 0.00 0.00 −0.01 −0.01 0.02 −0.01 0.00 0.00 0.00 0.00 0.01 0.20
15.55 16.77 17.05 16.48 16.29 17.58 16.92 16.39 17.75 17.61 16.74 17.53 16.56 16.59 14.78
−0.09 −0.06 −0.14 0.00 0.00 −0.02 −0.01 −0.07 −0.06 0.00 0.00 0.00 0.00 −0.09 −0.11
MAPE 21.43 16.19 15.57 19.02 19.45 14.57 17.16 18.94 13.78 14.44 17.86 14.76 18.35 18.34 17.84
Table 7. Y-Scrambling Results of Dataset B training
test
training
test
trial
R2
MAPE
R2
MAPE
trial
R2
MAPE
R2
MAPE
svr_00 svr_01 svr_02 svr_03 svr_04 svr_05 svr_06 svr_07 svr_08 svr_09 svr_10 svr_11 svr_12 svr_13 svr_14
0.78 0.00 0.13 0.16 0.14 0.06 0.01 0.01 0.06 0.30 0.00 0.00 0.06 0.00 0.10
7.33 19.39 19.69 17.84 17.85 18.01 18.90 19.40 18.39 16.44 19.37 18.93 18.67 18.53 18.59
0.85 −0.08 −0.30 0.03 −0.07 0.02 −0.06 0.00 −0.10 −0.02 −0.03 −0.04 −0.04 −0.03 −0.05
6.47 18.92 21.40 19.11 18.56 20.82 21.12 16.95 19.20 20.07 18.54 18.68 17.88 22.55 18.14
svr_15 svr_16 svr_17 svr_18 svr_19 svr_20 svr_21 svr_22 svr_23 svr_24 svr_25 svr_26 svr_27 svr_28 svr_29
0.07 0.00 0.23 0.03 0.00 0.31 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.01 0.04
19.09 19.00 16.31 18.57 18.32 15.76 18.82 18.78 18.67 18.57 19.52 19.52 13.59 19.27 18.18
−0.10 −0.05 −0.03 −0.14 0.00 −0.12 −0.06 0.00 −0.01 −0.03 −0.04 −0.01 −0.71 −0.02 −0.08
19.03 17.65 21.98 17.97 22.61 19.20 18.72 19.37 22.29 20.16 18.24 16.74 19.85 17.90 19.49
MAPE =
100% n
n
∑ i=1
The features selected for both data sets are shown in Table 5. For these data sets, the anions have a higher relevance than the cations based exclusively on the number of selected features. As a consequence, the anions have a more significant contribution
yi − yi ̂ yi
(2) E
DOI: 10.1021/acs.jcim.9b00203 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling
while the latter module estimates the melting temperature using the best models selected on the model selection module.) A random split of 80%, 10%, and 10% was prepared for training, validation, and testing. The training set was used to train the regression models; the validation set was used to select the best regression method for each generated cluster. The test was used to test the robustness of the model. Clustering Generation. In the context of this paper, a cluster is a subset of ILs. A clustering method outputs a set of groups, subsets or clusters. This set is named a cluster configuration. In that sense, the cluster creation module generates multiple cluster configurations to be tested in model validation module. The clustering of ILs into subsets is a promising idea; it is a feasible way to reach a robust estimation model. Dividing ILs into groups is not a new idea. For instance, Katritzky organized a set of 104 ILs into three subgroups.8 Despite the possible benefits of ILs clustering, the creation of subsets raises the problem of creating and selecting the best grouping scheme of ILs. The conventional approach in the literature is the manual creation of cluster configurations guided by an expert in the area. In this paper, we made the clusters by machine-learning algorithms based on their molecular descriptors. In this way, we can test multiple cluster configurations in a fast, reliable way through validation and testing. The machine-learning algorithms used in this paper are MeanShift (MS), k-means, and affinity propagation (AP). k-Means was executed with a k value for the number of groups in the range [2, 5]. We analyzed AP with preference parameters in the range (−20, 20). Also, we experimented with the MinMax normalization of the molecular descriptor in the range (0,1). We label these tests with the postfix MM. Additionally, for data set A we made a manual cluster configuration based on the substituents of the imidazolium cation. The clusters were defined as follows: cluster 1 is composed of all the ILs with cations that have equivalent chains in N1 and N3, cluster 2 is composed of cations where N3 always has a methyl, cluster 3 is composed of cations with short alkyl chains in N3, and cluster 4 contains the remaining cations. For the cluster configurations generated by machine-learning algorithms, we eliminated the cluster configurations with more than five clusters to avoid groups with very few elements. This action produced our best Tm estimations. The parameters configurations of the different algorithms produced 13 different cluster configurations for data set A and 9 for data set B. Model Training. To select the best cluster configuration of the cluster creation module, we made estimation models for each of the generated clusters. We evaluated the quality of the clusters based on the estimated accuracy of their respective model. On the other hand, each cluster configuration was evaluated with the average accuracy of the clusters. The models used in the training phase were SVRs and regression trees. These models were trained with the training data set. A hyperparameter optimization was performed with an exhaustive grid search for all the estimators. To avoid the overfitting problem, the training phase was implemented with the cross-validation technique. In this way, the architecture looks to prevent the negative impact of the discrepancies of the data. Furthermore, with cross-validation, we can estimate how accurately the models will perform. For the hyperparameters of the SVR, the kernels tested were linear, polynomial, RBF, and sigmoid. The values for C tested were in the range from 0.1 to 6 and ε from 0.1 to 6.0. For the tree regression algorithm, the max
to defining the melting temperature in this case. A correlation analysis indicates that high correlations exist between some features. In spite of this, this was the best configuration we found, so we decided to continue with the variables selected. Validation of Data Sets. To denote the importance of our model, we want to eliminate any doubt that randomness affects our model. The Y-scrambling22 determines if the data sets are useful to predict the target variable. For this, a model was trained with the molecular descriptors to estimate fusion temperatures, named the reference model. Then multiple models are trained with randomly shuffled Y values, lets denote these models as Yscrambled models. Working with the assumption that the molecular description of a molecule is related to its melting temperature, the reference model should be better than the Yscrambled models. This validation was made with 30 models with each data set with one as the reference model and the rest as Y-scrambled models. We denoted the reference model as svr_00 and the Y-scrambled models as svr_01 to svr_29. The results for data set A are shown in Table 6 and the results for data set B are shown in Table 7. Our test determines a clear advantage of the reference models over the Y-scrambled models. For R2, in the test the reference models have an 0.48 advantage in data set A and 0.83 in data set B. Proposed Architecture. The state-of-art is mainly composed of simple regression models to make estimations of melting point in ionic liquids. Even though they are powerful tools to make predictions, it is possible to improve their predictive power with ensemble methods with clustering. This is possible because they allow a training phase with specialized models for the features in the groups. Because of that, we propose an ensemble method for prediction of ionic liquids with clustering analysis. We present in Figure 3 a framework for the
Figure 3. Framework proposed in this paper for the estimation of the melting point of ILs. All the ILs are preprocessed with the molecular description (MD) of the 40 most relevant features found.
estimation of the melting points of ILs. The proposed software architecture is wide-ranging and can be applied to any family of ILs or any physicochemical property. As is common in the machine-learning area, the framework presented consists of two main phases: (1) the training phase, which in our case consists of cluster generation, model training, model validation, and model selection modules (in contrast to traditional regression models, these components are based on the clustering of ILs and are described in detail in the following sections), and (2) the estimation phase, which in this architecture consists of classification and estimation modules (the former module classifies new ILs to assign them to the corresponding cluster, F
DOI: 10.1021/acs.jcim.9b00203 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling Table 8. Clustering Algorithms Used in This Paper and Their Evaluation Results in Dataset A training clustering method AP_MM
no clustering 2-means 2-means_MM 3-means
3-means_MM
4-means
4-means _MM
5-means
5-means _MM
MS
MS_MM
based on structure
validation
cluster
data points
model
MAE
MAPE
MAE
MAPE
0 1 2 3 4 0 0 1 0 1 0 1 2 0 1 2 0 1 2 3 0 1 2 3 0 1 2 4 0 1 2 3 4 0 1 2 3 0 1 2 0 1 2 3
52 68 53 8 100 281 261 20 216 65 151 19 111 8 208 65 111 146 8 16 84 124 65 8 111 19 45 106 55 8 106 45 67 228 11 34 8 240 8 33 31 34 183 33
SVR tree SVR tree tree tree tree tree tree tree SVR SVR tree tree tree tree tree tree tree SVR tree SVR tree tree tree tree SVR SVR SVR tree SVR SVR tree tree tree tree tree tree SVR SVR SVR SVR SVR SVR
23.99 13.03 38.19 4.33 2.86 20.33 5.36 2.59 5.07 22.51 34.84 13.37 16.89 4.33 7.47 23.99 24.02 14.74 4.33 16.34 10.49 23.19 22.51 4.33 24.02 2.32 26.92 30.59 31.74 4.33 22.05 47.59 16.22 13.38 0.00 17.46 4.33 20.13 5.00 37.96 50.48 33.00 28.51 40.35
8.52 4.37 12.23 1.22 1.03 7.19 1.98 0.87 1.86 7.15 12.52 4.50 6.16 1.22 2.61 7.70 8.04 5.14 1.22 5.94 3.78 8.04 7.15 1.22 8.04 0.65 9.71 11.16 9.84 1.22 7.48 17.18 6.16 4.67 0.00 6.19 1.22 7.43 1.41 10.71 16.12 12.31 10.76 15.43
35.43 8.33 28.91 8.33 24.30 20.13 13.67 17.25 15.00 37.29 24.66 15.35 13.98 8.33 25.80 47.09 16.33 21.61 8.33 10.85 20.49 26.06 37.29 8.33 16.33 14.63 8.52 27.97 25.68 8.33 18.59 13.81 14.83 20.58 13.00 6.01 8.33 24.72 12.00 30.67 10.21 63.46 18.57 8.22
11.93 5.73 8.74 2.44 14.87 7.87 7.94 4.92 11.00 11.29 12.39 6.24 7.70 2.44 14.10 12.68 10.70 11.95 2.44 3.66 11.55 13.08 11.29 2.44 10.70 5.04 7.38 9.53 17.32 2.44 6.19 5.18 4.53 8.04 5.51 8.98 2.44 10.02 3.51 6.74 15.48 18.69 7.70 4.17
The mean average error (MAE) is used in addition to the previous metrics for state-of-art comparisons and is shown in eq 3.
depth was set in a range from 4 to 10. The output of this is training. This module produces an SVR and tree regression model for each cluster, but only the one with the lower MAPE gets paired with their respective cluster. Model Validation. In this module, the estimation models created in the model training module are validated; this is carried out to determine their efficiency. Also, this module filters and selects the best estimation models for each of the cluster configurations. For this phase, the validation data generated in the clustering phase was used. All the cluster configurations were tested and validated, the results are shown in Table 8 for data set A and Table 9 for data set B. Some values of R2 yield negative values in validation, and this phenomenon indicates that the regression is inadequate. This could be a sign of overtraining.
n
MAE =
∑i = 1 |yi − yi |̂ n
(3)
Model Selection. The purpose of this stage is the selection of the best configuration of clusters/models. This selection was made with the MAPE average value of validation of the models for each cluster set. The cluster sets were ordered as ascending with validation MAPE. The clustering methods with the least MAPE were MeanShift and k-means with two centroids. The average of training, validation, and test MAPE was obtained for G
DOI: 10.1021/acs.jcim.9b00203 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling
centroids of the clusters; from this process, the new data are assigned to the nearest group. After the new data are labeled, the ILs are sent to the estimation module. Estimation. This module receives the labeled ILs in the form of molecular descriptors and makes estimations with the corresponding models. The output is the estimation of the previously unseen data and denormalizes it to obtain the melting point in kelvin. The classification and estimation modules are used to estimate ILs. In comparison, the modules in the training phase only work with training and validation data sets.
Table 9. Clustering Results for Dataset B training clustering method AP-10
AP-13
AP-14
AP-15
no clustering 2-means 3-means
4-means
5-means
validation
cluster
data points
model
MAE
MAPE
MAE
MAPE
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 0
25 35 14 60 25 35 14 60 25 35 14 60 55 14 65 134
SVR SVR SVR tree SVR SVR SVR tree SVR SVR SVR tree SVR SVR tree tree
23.10 14.10 12.86 14.54 23.10 14.10 12.86 10.44 23.10 14.10 12.86 1.36 21.82 12.86 10.76 9.34
7.08 5.10 4.58 3.63 7.08 5.10 4.58 2.39 7.08 5.10 4.58 0.31 7.24 4.58 2.58 2.68
6.24 10.20 21.00 19.60 6.24 10.20 21.00 21.29 6.24 10.20 21.00 23.75 13.50 21.00 15.76 38.45
2.03 3.52 6.73 5.12 2.03 3.52 6.73 5.27 2.03 3.52 6.73 5.93 4.36 6.73 3.30 9.18
0 1 0 1 2 0 1 2 3 0 1 2 3 4
65 69 55 65 14 35 65 14 20 20 33 35 14 32
tree tree SVR tree SVR SVR tree SVR tree tree tree SVR SVR tree
8.32 11.61 21.82 12.96 12.86 14.10 11.86 12.86 26.08 26.08 14.61 14.10 12.86 7.27
2.10 3.70 7.24 3.15 4.58 5.10 2.90 4.58 7.81 7.81 3.49 5.10 4.58 1.79
25.75 18.51 13.50 19.55 21.00 10.20 24.09 21.00 0.72 0.72 15.03 10.20 21.00 45.39
5.59 6.82 4.36 4.17 6.73 3.52 5.21 6.73 0.24 0.24 3.89 3.52 6.73 9.40
■
ANALYSIS OF RESULTS Data Set A. The MeanShift (MS) and k-means (with two clusters) algorithms obtained an average MAPE of 5.28 and 4.57, respectively; the estimation of the full data set A has an R2 of 0.78 and a MAPE of 6.26%, as shown in Table 12, and the residual plots are shown in Figure 4. Both models yield better estimations that the nongrouped model, which generated a MAPE of 9.27. However, poorly calibrated groupings can increase the error as in the case of the k-means with five clusters and the grouping based on structure. In comparison with the state-of-art, Varnek et al. obtained a MAE of 31.5 in his imidazolium combined data set.4 In our method, we archive a better MAE, since MS got a test MAE of 18.19. Similarly, the MAPE presented by Katritzky et al.8 is bigger than the best MAPE found on our tests. This shows that the presented model has state-of-art precision. Also, the results presented indicate that there are advantages in the grouping of ILs compared to not using the clustering technique; however, this situation does not occur in all the cases. Data Set B. In the case of data set, B the best clustering algorithm found is the 4-means, with an average of 4.47% with a full data set prediction of 4.43% in MAPE. Although there are lower average values, such as for AP-15, 4-means has the lowest validation error; therefore, it was selected as the best temperature predictor of ionic liquids in our paper for data set B. The results found in this way indicate an improvement of 1.81% with respect to the results presented by Huo6 in his Supporting Information. Because of this, our methodology offers a viable alternative to the estimation models of melting temperatures of ILs based on imidazolium. Validation of the Clustering Approach. To validate the predictive accuracy offered by the separation of ILs into subgroups, a double cross-validation (DCV)23,24 test was applied. It is important to emphasize that the DCV test is not
each cluster set and they are reported in Tables 10 and 11; the model efficiency is determined with the test data set. Classification. The classification module is in charge of assigning the unseen ILs to their corresponding cluster. For this process, the ILs must be treated by a molecular description in a similar way to the training and validation data sets. The cluster configuration used in this module is the one chosen from the model selection module. For the classification task, the architecture calculates the distance from the new data to
Table 10. Summary of Clustering Algorithms Used in the Paper and the Evaluation Results of Dataset A MAE
MAPE
clustering method
training
validation
test
training
validation
test
average
MS 2-means MS_MM 5-means_MM 4-means no clustering 5-means AP_MM 3-means 4-means_MM 3-means_MM 2-means_MM
8.79 3.98 21.03 24.39 14.86 20.33 20.96 16.48 21.70 15.13 11.93 13.79
16.81 21.62 22.03 20.25 21.92 23.68 24.44 28.04 23.94 28.70 30.77 35.88
20.11 16.42 33.77 29.66 27.18 37.30 45.92 23.86 33.45 31.30 34.03 31.93
3.02 1.43 6.52 8.38 5.09 7.19 7.39 5.47 7.73 5.05 3.84 4.51
6.24 6.43 6.76 7.13 7.18 7.87 8.16 8.74 8.77 9.59 9.74 11.15
6.59 5.85 12.64 10.21 9.21 12.74 15.56 7.76 11.83 10.26 11.00 11.51
5.28 4.57 8.64 8.57 7.16 9.27 10.37 7.33 9.45 8.30 8.19 9.06
H
DOI: 10.1021/acs.jcim.9b00203 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling Table 11. Summary of Clustering Algorithms Used in the Paper and the Evaluation Results of Dataset B MAE
MAPE
clustering method
training
validation
test
training
validation
test
average
4-means AP-10 AP-13 AP-14 5-means AP-15 3-means 2-means no clustering
16.23 16.15 15.13 12.86 14.98 15.15 15.88 9.96 9.34
14.00 14.26 14.68 15.30 18.47 16.75 18.02 22.13 38.45
13.91 28.66 27.21 32.27 25.95 11.06 13.09 21.71 35.29
5.09 5.10 4.79 4.27 4.55 4.80 4.99 2.90 2.68
3.92 4.35 4.39 4.55 4.75 4.80 5.09 6.20 9.18
4.40 8.36 7.99 9.24 6.59 3.30 3.85 6.38 8.94
4.47 5.94 5.72 6.02 5.30 4.30 4.64 5.16 6.93
Table 12. Results of the Best Models Found for Full Dataset A and B data set
clustering algorithm
MAE
R2
MAPE
A B
2-means 4-means
18.19 15.07
0.78 0.93
6.26 4.43
Table 13. Experimental Results of DCV full data set clusters
meant to obtain a final model. DCV (as its name implies) consists of two nested cross-validations, named internal and external. The internal CV is responsible for creating and selecting predictive models, while the external CV is responsible for validating models. To discard out any uncertainty that randomness affects the results obtained, the external CV is repeated multiple times to arrive at a stable state, each time with a reshuffle and different splits of the original data set. The final error is obtained by averaging all the validation errors obtained. Since we want to test the variability of our method compared to a training phase without clustering, the DCV was performed in two scenarios. In the first one, A DCV was used with the whole data set B. In a second test, a DCV was applied for each cluster generated in the section cluster generation. In this way, we guarantee that our proposed method is a substantial contribution to the area. The proposed experimentation was performed with the internal cross-validation of size 5, a validation ratio of 0.05, and a repetition of 100 times, each time with different random splits. The results are shown in Table 13. The DCV results of the approach using clusters are shown to have a lower error than the DCV using the complete data set. This demonstrates the value of separating ILs into smaller groups. This could indicate a possible subclassification of ILs derived from the characteristics that may better reflect their
0 1 2 3 average
training MAE
test MAE
test MAPE
20.033 14.544 23.447 12.341 26.261 19.148
28.209 16.684 29.803 11.715 29.296 21.875
8.087 6.023 7.704 4.043 8.651 6.606
properties, but this is out of the scope of this paper. The generalized error of our method using DCV is 6.60% of MAPE. In comparison, Huo shows a MAPE error of 6.28% for his bestfound result without DCV.
■
CONCLUSIONS This paper presents an architecture for the grouping of ionic liquids in clusters to improve the robustness of models for estimation of physicochemical properties, in particular, the melting point. The presented QSPR model was constructed with the 40 and 30 most relevant molecular descriptors found. To the best of our knowledge, this paper shows for the first time a clustering architecture for ILs. The descriptors were selected with the RFE algorithm and tree feature selection algorithm. We analyzed 281 and 134 ionic liquids in training, validation, and test sets. The set of ILs used are a wide range and have multiple anions and cations. A grouping was made on the basis of the structural form of the cations and 13 groups based on molecular descriptors with machine-learning algorithms for data set A. In the case of data set B, nine groups were created. The proposed model is a combination of regression trees and SVR. However,
Figure 4. Residual plots for melting point temperatures with k-means (left) and MeanShift (center) for data set A and k-means (right) for data set B. I
DOI: 10.1021/acs.jcim.9b00203 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling
(6) Huo, Y.; Xia, S.; Zhang, Y.; Ma, P. Group Contribution Method for Predicting Melting Points of Imidazolium and Benzimidazolium Ionic Liquids. Ind. Eng. Chem. Res. 2009, 48, 2212−2217. (7) Lazzús, J. A. A Group Contribution Method to Predict the Melting Point of Ionic Liquids. Fluid Phase Equilib. 2012, 313, 1−6. (8) Katritzky, A. R.; Jain, R.; Lomaka, A.; Petrukhin, R.; Karelson, M.; Visser, A. E.; Rogers, R. D. Correlation of the Melting Points of Potential Ionic Liquids (Imidazolium Bromides and Benzimidazolium Bromides) Using the CODESSA Program. J. Chem. Inf. Model. 2002, 42, 225−231. (9) López-Martin, I.; Burello, E.; Davey, P. N.; Seddon, K. R.; Rothenberg, G. Anion and Cation Effects on Imidazolium Salt Melting Points: A Descriptor Modelling Study. ChemPhysChem 2007, 8, 690− 695. (10) Carrera, G.; Aires-de-Sousa, J. Estimation of Melting Points of Pyridinium Bromide Ionic Liquids with Decision Trees and Neural Networks. Green Chem. 2005, 7, 20. (11) Torrecilla, J. S.; Rodríguez, F.; Bravo, J. L.; Rothenberg, G.; Seddon, K. R.; López-Martin, I. Optimising an Artificial Neural Network for Predicting the Melting Point of Ionic Liquids. Phys. Chem. Chem. Phys. 2008, 10, 5826−31. (12) Gharagheizi, F.; Ilani-Kashkouli, P.; Mohammadi, A. H. Computation of normal melting temperature of ionic liquids using a group contribution method. Fluid Phase Equilib. 2012, 329, 1−7. (13) Sattari, M.; Gharagheizi, F.; Ilani-Kashkouli, P.; Mohammadi, A. H.; Ramjugernath, D. Estimation of the Heat Capacity of Ionic Liquids: A QSPR Approach. Ind. Eng. Chem. Res. 2013, 52, 13217−13221. (14) Farahani, N.; Gharagheizi, F.; Mirkhani, S. A.; Tumba, K. Ionic Liquids: Prediction of Melting Point by Molecular-based Model. Thermochim. Acta 2012, 549, 17−34. (15) Mauri, A.; Consonni, V.; Pavan, M.; Todeschini, R. DRAGON Software: An Easy Approach to Molecular Descriptor Calculations. MATCH Commun. Math. Comput. Chem. 2006, 56, 237−248. (16) Zhang, S.; Lu, X.; Zhou, Q.; Li, X.; Zhang, X.; Li, S. Ionic Liquids Physicochemical Properties; Elsevier, 2009; p 478. (17) ChemAxom. MarvinSketch; 2018. https://chemaxon.com/ products/marvin (accessed Mar 7, 2019). (18) O’Boyle, N. M.; Banck, M.; James, C. A.; Morley, C.; Vandermeersch, T.; Hutchison, G. R. Open Babel: An Open Chemical Toolbox. J. Cheminf. 2011, 3, 33. (19) Yap, C. W. Padel-descriptor: An Open Source Software to Calculate Molecular Descriptors and Fingerprints. J. Comput. Chem. 2011, 32, 1466−1474. (20) Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Duchesnay, M. P. É . Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825−2830. (21) Spiess, A.-N.; Neumeyer, N. An Evaluation of R2 as an Inadequate Measure for Nonlinear Models in Pharmacological and Biochemical Research: A Monte Carlo Approach. BMC Pharmacol. 2010, 10, 6. (22) Rücker, C.; Rücker, G.; Meringer, M. y-Randomization and Its Variants in QSPR/QSAR. J. Chem. Inf. Model. 2007, 47, 2345−2357. (23) Baumann, D.; Baumann, K. Reliable Estimation of Prediction Errors for QSAR Models under Model Uncertainty Using Double Cross-validation. J. Cheminf. 2014, 6, 47. (24) Roy, K.; Ambure, P. The “Double Cross-validation” Software Tool for MLR QSAR Model Development. Chemom. Intell. Lab. Syst. 2016, 159, 108−126.
with the presented method, it is possible to determine the best regression method for each one of the subsets based only on the characteristics of the ILs. From the results shown in the paper, the conclusions are the following: (1) The selection of the anion is of great importance, at least within the same family of ILs. This conclusion was also found in the Torrecilla method. (2) The tuning of the clustering method for ILs plays a vital role in the robustness of the final model. The difference of MAPE in the various groups shown supports this notion. (3) The best cluster configurations take the distinct pattern of having a large subgroup. This observation indicates that the clustering algorithms are a useful tool for training robust regression models. Moreover, the clustering also helps to identify outliers. The proposed estimation based on clustering can estimate the melting temperatures of ILs more accurately than other models available in the literature. A possible improvement of the system is the reduction of the relevant variables, eliminating those with high correlation. Besides, it is possible to perform a secondary phase of feature selection where the selection is made for each subgroup of ILs. Another significant activity is to include more ILs in a future study, since a larger volume of data could positively affect the robustness of the algorithms presented, for example, in the accuracy of the groups with the lowest number of ILs.
■
ASSOCIATED CONTENT
S Supporting Information *
The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.9b00203.
■
Melting points of the data sets used and relevant features found with description (PDF)
AUTHOR INFORMATION
Corresponding Author
*E-mail:
[email protected]. ORCID
Juan Frausto Solís: 0000-0001-9307-0734 Notes
The authors declare no competing financial interest.
■
ACKNOWLEDGMENTS The authors thank, with appreciation and gratitude, CONACYT and TECNM for their support. We also thank Instituto Tecnológico de Ciudad Madero (ITCM) for the support (6443.18-P). J.A.C.-C. would like to thank the CONACYT for support (434694).
■
REFERENCES
(1) Plechkova, N. V.; Seddon, K. R. Applications of Ionic Liquids in the Chemical Industry. Chem. Soc. Rev. 2008, 37, 123−150. (2) Wasserscheid, P.; Welton, T. Ionic Liquids in Synthesis; John Wiley & Sons, 2008. (3) Karunanithi, A. T.; Mehrkesh, A. Computer-aided design of tailormade ionic liquids. AIChE J. 2013, 59, 4627−4640. (4) Varnek, A.; Kireeva, N.; Tetko, I. V.; Baskin, I. I.; Solov’ev, V. P. Exhaustive QSPR Studies of a Large Diverse Set of Ionic Liquids: How Accurately Can We Predict Melting Points? J. Chem. Inf. Model. 2007, 47, 1111−1122. (5) Aguirre, C. L.; Cisternas, L. A.; Valderrama, J. O. Melting-point Estimation of Ionic Liquids by a Group Contribution Method. Int. J. Thermophys. 2012, 33, 34−46. J
DOI: 10.1021/acs.jcim.9b00203 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX