Toward Navigating Chemical Space of Ionic ... - ACS Publications

Stargate GTM: Bridging Descriptor and Activity Spaces. Héléna A. Gaspar , Igor I. Baskin , Gilles Marcou , Dragos Horvath , and Alexandre Varnek. Jo...
2 downloads 0 Views 456KB Size
Article pubs.acs.org/IECR

Toward Navigating Chemical Space of Ionic Liquids: Prediction of Melting Points Using Generative Topographic Maps Natalia Kireeva,*,†,‡ Sergey L. Kuznetsov,† and Aslan Yu. Tsivadze† †

Institute of Physical Chemistry and Electrochemistry RAS, Leninsky pr-t 31, 119071 Moscow Russian Federation Laboratoire d’Infochimie, UMR 7177 CNRS, Université de Strasbourg, 4 rue B. Pascal, Strasbourg 67000, France



ABSTRACT: In this work, we apply generative topographic maps as a universal approach for data visualization and structure− property modeling of melting points (mp), which is one of the most important physical properties for the design and application of ionic liquids (ILs) as green solvents. Data visualization is part of a more general concept of chemography, which is a relatively new field dealing with visualization of chemical data, representation of chemical space, and navigation in this space. This field has received much attention by chemists as it may help to analyze and to intuitively comprehend relevant molecular features and relationships. In this study, to our knowledge for the first time, we proposed the universal approach that can be used both for the visualization of the chemical space of ILs according to their melting point values and for the development of the classification models able to predict the melting points of novel ILs. The structurally diverse data set of 717 ILs containing bromides of nitrogen-containing organic cations and including 126 pyridinium bromides (PYR), 384 imidazolium and benzoimidazolium bromides (IMZ), and 207 quaternary ammonium bromides (QUAT) was involved in model development. This study was carried out in several descriptor spaces analyzing the impact of descriptor choice. The clear criteria for data visualization and classification quality were used to assess the performance of the developed models. of imidazolium cations16) melt from the eutectic mixtures of several crystalline polymorphs with temperature values considerably lower than mp’s of individual polymorphs. The process of vitrification can also take place.17 In the latter case, mp represents the glass transition temperature, which is rather different from the mp of the corresponding crystalline state. In previous studies, QSPR models for the prediction of melting points were obtained for different sets of nitrogen-containing organic cations: pyridinium bromides,18−20 imidazolium bromides,4,19,21 chlorides,21 tetrafluoroborates and hexafluorophospates,22 benzoimidazolium bromides,4,19 and quaternary ammonium cations.23 Several linear and nonlinear machine learning methods have been involved in model development in these studies: multilinear regression techniques,4,21−26 decision trees,27 different types of neural networks,20,21,26−28 and projection pursuit regression.19 In this work, we apply generative topographic maps (GTM)29−31 as a universal approach for data visualization and structure−property modeling of melting points for a structurally diverse data set of 717 ILs containing bromides of nitrogencontaining organic cations and including 126 pyridinium bromides (PYR), 384 imidazolium and benzoimidazolium bromides (IMZ), and 207 quaternary ammonium bromides (QUAT). Data visualization is a part of a more general concept of chemography. Chemography32 is a relatively new field dealing with visualization of chemical data, representation of chemical space, and navigation in this space. This field has received much attention by chemists as it may help to analyze and to intuitively

1. INTRODUCTION Ionic liquids (ILs) are compounds that are composed of ions incorporating at least one organic ion in an ion pair and that melt at relatively low temperatures. Their green and tunable properties make ILs an alternative to traditional organic (volatile) solvents1−3 and motivate the use of ILs for a wide range of applications, including chemical synthesis, separation, and catalysis. Careful choice of cation/anion combination allows the design of ILs with physical and chemical properties well fitted to a specific problem. According to Katritzky et al.,4 there exist approximately 1018 combinations of ions that could lead to useful ILs, which makes it impractical to search for ILs with specific properties using trial-and-error methods and evidently leads to a necessity to develop predictive computational tools allowing one to design new ILs with desirable properties. In previous QSPR (quantitative structure−property relationships) studies concerning the prediction of the physical properties of ILs, QSPR models were obtained for ionic conductivities,5−7 viscosities,5−9 densities,10 surface tensions,11 solubility of ILs in water,12 heats of fusion,13 and polarity.14 One of the most important physical properties for the design and application of ILs as a green solvents is a melting point (mp), which was the subject of numerous studies.1 Melting point characterizing a transition from a solid to liquid state has a very complex relationship with the structure of constituent ions because of many different factors.15 Thus, in both solid and liquid phases, various types of interactions between ions should be taken into account: electrostatic and van der Waals interactions, hydrogen bonds, and aromatic π−π-stacking. The symmetry and conformational flexibility of individual species play an important role because they affect the crystal packing and, hence, melting points. Another problem is related to the phase content of the solids. Unlike high-melting salts, certain types of IL (i.e., halides © 2012 American Chemical Society

Received: Revised: Accepted: Published: 14337

August 15, 2012 September 28, 2012 October 9, 2012 October 9, 2012 dx.doi.org/10.1021/ie3021895 | Ind. Eng. Chem. Res. 2012, 51, 14337−14343

Industrial & Engineering Chemistry Research

Article

Figure 1. GTM describes the data probability distribution in the data space by means of the mixture of embedded Gaussians situated on manifold (twodimensional rubber sheet) located in the high-dimensional data space in a way to provide the best fit of data distribution. Each Gaussian is generated by nonlinear transformations y (x, W) from the grid nodes (●) in the latent space. GTM is a planar map resulting from the manifold’s unbending. The location of the data on the map is defined by the probability distribution. This presentation is inspired by Figure 2 in the original publication by Bishop et al.29,38,55 and by the figure in ref 40.

functions for this nonlinear transformation. The centers of these Gaussian basis functions form a regular grid in the latent space, and their number H and variance (width factor) σ are the method parameters. The coordinates of the Gaussians are computed as a linear combination of Gaussian basis functions. The images of this mapping form a 2D manifold, which can be considered as a flexible “rubber sheet” embedded in the data space (Figure 1). The smoothness of the manifold is controlled by the value of parameter σ. The points that are neighboring in the latent space remain neighbors in the data space. The GTM is often considered as a probabilistic extension of the selforganizing maps (SOM),29,30 which is a popular and commonly used approach with, however, some limitations. Particularly, SOM has no well-defined objective function to be optimized in the course of training,30,31,44,45 and, therefore, no theoretical framework to proof its convergence and to select the method’s parameters can be defined. This leads to some ambiguity in the selection of the “best” SOMs. Another serious drawback of SOM is that it is not a probabilistic approach; that is, the corresponding models do not define probability density functions for data distribution. As a consequence, it is rather difficult or even impossible to assess the robustness of information contained in such data maps. For example, an assignment of the compound to a SOM node can be chemically meaningful (if the node is populated by its analogues) or be an artifact. The latter typically happens for structural outliers, which should nevertheless be assigned somewhere or for the molecules almost equidistant to different neurons and therefore may “occasionally” drop to irrelevant neuron. Generative topographic mapping approach (GTM)30,31,44,46 not only overcomes all of the above-mentioned drawbacks, but offers some additional advantages resulting from the rigorous probabilistic character of 2D maps. Like SOM, GTM operates with a grid of K nodes, which can be considered as analogues of nodes in SOM. Defining a probability distribution over the latent space induces the corresponding distribution over the manifold in the data space and, thus, imposes the probabilistic relationships between two spaces. The iterative expectation-maximization algorithm (EM-algorithm) is used to find the parameters of RBF network (W and β) maximizing the so-called, log likelihood function, which measures a correspondence between the data distribution and the model.

comprehend relevant molecular features and relationships. The main problem of visualization of high-dimensional data concerns their representation in two or three dimensions with minimal information loss. Comparative analysis of various dimensionality reduction techniques is given in books33,34 and reviews.35−37 Particularly, applications of GTM to visualization of chemical data have recently been discussed in refs 38,39. Owen et al.39 considered two modifications of the GTM method, latent trait model (LTM) and linear latent trait model (LTM-LIN), which are specially tailored to deal with binary data, such as molecular fingerprints. The utility of GTM as a universal and an efficient tool for data visualization, structure−activity modeling, and database comparison has been recently demonstrated in ref 40. In this study, to our knowledge for the first time, we proposed the universal approach, which can be used both for the visualization of the chemical space of ILs according to their melting point values and for the development of the classification models able to predict the melting points of novel ILs. This study was carried out in several descriptor spaces analyzing the impact of descriptor choice. The clear criteria for data visualization and classification quality were used to assess the performance of developed models.

2. METHOD GTM considers the data in the initial space as generated from the objects situated in 2D space. Any stochastic data generation can be viewed as a sampling of a random variable that describes a data distribution law by means of a probability distribution function.43 Thus, GTM is an unsupervised machine learning approach describing the data probability distribution in the data space RD by means of the mixture of Gaussians nonlinearly embedded in this space from the latent space through a transformation carried out by RBF (radial basis function) neural network (Figure 1): y = y(x ; W ) = Wϕ(x) H

yd =

H

(1a)

⎛ || x − xh || ⎞ ⎟ ⎠ 2σ

∑ Whdϕh(x) = ∑ Whd exp⎜⎝ h=1

h=1

(1b)

Here, W is the weight of RBF for connections between H hidden and D output units, and ϕ(x) are Gaussian activation functions for hidden units, which can be considered as the basis 14338

dx.doi.org/10.1021/ie3021895 | Ind. Eng. Chem. Res. 2012, 51, 14337−14343

Industrial & Engineering Chemistry Research

Article

Figure 2. Typical structures of quaternary nitrogen-containing organic bromides in the data set involved in model development (the figure was borrowed from ref 26). N

3(W , β) =

⎧1 ⎪

K



bromides (QUAT). Experimental values of the melting point (mp, °C) were taken from ref 26, where all of the details of data preparation are given. Typical structures of IL are represented in Figure 2, and the corresponding distribution of melting point values for the data sets is given in Figure 3. Several types of molecular descriptors have been involved in model development. Molecular Operating Environment 2D Descriptors (MOE 2D).47 2D descriptors containing different physical properties, subdivided surface areas, atom and bond counts, Kier and Hall connectivity and Kappa shape indices, adjacency and distance matrix descriptors, pharmacophore feature descriptors, and partial charge descriptors were involved in model development. ISIDA Substructural Molecular Fragments (SMF). Substructural molecular fragments48 are the subgraphs of a molecular graph, whereas their occurrences are the descriptor values. The subclass of the SMF descriptors consisting of the shortest topological paths with representation of atoms and bonds was used, where the values of minimal nmin and maximal nmax number of atoms varied from 2 to 15. The Floyd algorithm49 was applied for



∑ ln⎨ ∑ p(tn|xi , W , β)⎬ ⎪

n=1

⎩K



i=1



(2)

where 3 is the log likelihood function, β is the inverse of variance, W is the output weights of RBF, K is the number of the nodes, and N is the number of compounds. The visualization of the data is performed after the RBF network is trained, when the inverse mapping from the data space to the latent space (unbending this flexible sheet into the rectangular 2D map) is performed using Bayes theorem. Thus, for each molecule, GTM calculates its probability to be located in the given point of this map represented by the latent space and visualizes this molecule according to this probability.

3. EXPERIMENTS 3.1. Data and Descriptors. Data. The calculations have been performed on the structurally diverse data set of 717 bromides of nitrogen-containing organic cations containing 126 pyridinium bromides (PYR), 384 imidazolium and benzoimidazolium bromides (IMZ), and 207 quaternary ammonium 14339

dx.doi.org/10.1021/ie3021895 | Ind. Eng. Chem. Res. 2012, 51, 14337−14343

Industrial & Engineering Chemistry Research

Article

Figure 3. Distribution of melting point values in the data sets of bromides of quaternary nitrogen-containing organic cations: (a) PYR, 126 compounds, mp = 30−200 °C; (b) IMZ, 384 compounds, mp = 5.5−319 °C; and (c) QUAT, 207 compounds, mp = 39−281 °C.

finding the shortest paths in the molecular graphs. Single, double, triple, and aromatic bonds were recognized. Fingerprints and Descriptors of RCDK Package. Fingerprints are one of the popular types of the molecular representations. The CDK (Chemistry Development Kit) MACCS keys were involved in this work for model development. The RCDK package50 of the R software51 has been used for their calculations. 3.2. Computational Procedure. The basic implementation of GTM approach has been taken from the Netlab package (MATLAB toolbox for neural networks and pattern recognition, version 3.3).52,53 Additional MATLAB procedures have been written to apply GTM to the classification problem. Principal component analysis (PCA)56 has been used as a preprocessing step in the model development. Grid Optimization of Selected GTM Parameters. There are several parameters to be adjusted in GTM design: the number of RBF basis functions, the number of the latent points, and several hyper-parameters, inverse variance of the prior over the weights (α), inverse noise variance (β), and width of the Gaussian basis functions (σ). A grid search has been performed to adjust these parameters. Because the discriminatory power of the GTM was considered as a quality criterion, the adjusted parameters were selected by optimizing the accuracy value obtained with the 5-fold external cross-validation procedure (5-CV). Classification Models. Classification models can be developed using the values of the class-conditioned probability distribution function p(t|Ck) computed for each class Ck, where t is its molecular descriptor vector. Such a function can be built, for each class, by training a separate GTM model on the data belonging to class Ck. The class-conditioned probabilities p(t|Ck) can be used for computing posterior probabilities of class membership P(Ck|t) for a given compound using the Bayes theorem: P(Ck|t ) =

p(t |Ck) × P(Ck) p(t )

According to statistical decision theory,54 the optimal class assignment is determined by the maximal value of posterior class probabilities P(Ck|t). Performance of classification models can be measured by accuracy: Acc =

∑ p(t |Ck) × P(Ck) k

(5)

calculated as a function of the number of true positive (tp), true negative (tn), false positive (f p), and false negative ( f n) assessed in the cross-validation procedure, where P and N are the number of positives and negatives, accordingly.

4. RESULTS AND DISCUSSION 4.1. Visualization of the Data. GTM is an unsupervised approach (the information about compounds property is not used in model development). As a function of selected descriptors for one same set of molecules, different chemical spaces could be generated. In this work, four different descriptor types have been involved in a generation of different chemical spaces. The assessment of the performance of data visualization models has been carried out involving the quantitative parameter recently introduced in ref 41 as Γ-score. This score characterizes the ability of a model to produce similar-structure clustering in a visualization and can be calculated for a data set if the information about the classes is available. The greater is the value of Γ-score, the better is the separation of the objects in the visualization. For the obtained maps, this value is in the range from 0.64 (PYR with SMF descriptors) to 0.71 (PYR with MOE descriptors) (for the model parameters, see the information in Table 1). Figure 4 represents the obtained map of the distribution of ionic liquids according to their melting point values for IMZ IL using SMF descriptors. Here, the color corresponds to the average for this node melting point value, while the white color of the background specifies the “empty” nodes. It should be noticed that almost all of the obtained maps have “high resolution”, where each node contains a small number of compounds, which were concerned to the fact that these maps demonstrated the best statistical parameters of classification in parameter optimization. The “holes” in the obtained maps (the “empty” regions in the maps) provide ample opportunities for the correct mapping of new data even structurally different from those represented before. The information about the optimized parameters (see Computational Procedure) of the “best” models for each descriptors−data set combination is represented in Table 1. In some cases, GTM interpretation is not straightforward, when the areas of low mp values are neighboring with the regions of high mp values. Two different interpretations related to this

(3)

where P(Ck) = Nk/Ntot is a prior probability of class membership (Nk, the number of compounds belonging to class Ck; Ntot, the total number of compounds), whereas p(t), the marginal probability density function, is the normalization factor: p(t ) =

tp + tn P+N

(4)

The latter ensures that the estimated posterior probabilities are normalized. By applying function (3) to each class Ck, one can assess the posterior probability of class membership for each compound. 14340

dx.doi.org/10.1021/ie3021895 | Ind. Eng. Chem. Res. 2012, 51, 14337−14343

Industrial & Engineering Chemistry Research

Article

described in ref 40. In other cases, it can be explained by the curved shape of the manifold in which two corners are in close vicinity. Another possible specificity of GTM maps is demonstrated in Figure 4. One can see that structurally quite similar compounds but having some distinctions in their structures that lead to differences in mp values can be projected to exactly the same area of the map. In this case, the coloration of the node is not directly informative due to the significant difference in mp values (for a demonstrated example, this difference varies in the range of 100 °C). The considered example is quite rare and is shown here to demonstrate the possible limitations of the approach. Despite the seemingly negative character of this specificity, it can be used as an additional advantage; this is a simple and intuitively understandable way to identify the outliers (incorrectly measured ILs). 4.2. Predictive Performance of Structure−Property Modeling. Developing the classification models, we categorized the ionic liquids into individual classes according to the values of their melting points with a step of 30 °C. Thus, IMZ, PYR, and QUAT contained accordingly 11, 6, and 9 classes. The accuracy (Acc) of the models is represented as a function of data set and type of descriptors in use and varies from 0.81 to 0.87 depending on the descriptor type and data set involved in model development (Figure 5).

Table 1. Optimized Parameters of the Best GTM Models descriptor type

N princ. comp.

MOE SMF MACCS RCDK

35 35 30 20

MOE SMF MACCS RCDK

25 30 30 25

MOE SMF MACCS RCDK

30 35 35 30

map resolution (number of latent points) Data Set: PYR 15 × 15 25 × 25 35 × 35 35 × 35 Data Set: IMZ 20 × 20 20 × 20 20 × 20 20 × 20 Data Set: QUAT 15 × 15 20 × 20 30 × 30 20 × 20

N RBF centers

width factor, σ

model number

16 25 16 16

2 3 2.5 0.5

1 2 3 4

25 25 25 25

1 3 3 1

5 6 7 8

25 16 25 25

0.5 0.5 1 3

9 10 11 12

Figure 5. Prediction performance (accuracy, see section 3.2) of GTM classification models involving different types of descriptors: (a) PYR, (b) IMZ, and (c) QUAT (5-fold cross-validation procedure).

One can see that the obtained models are still far from the efficient prediction of the properties of novel ILs. Nevertheless, they can be considered as the acceptable results for developing a universal predictive tool for navigation in chemical space of ionic liquids in view of objective factors that hinder the development of such system. The first one concerns the quality of available experimental data. In the literature, the case when significantly different mp values are reported for the same compounds can be met quite often. One should also avoid using the data for which the temperature of decomposition or glass transition is used instead of melting point values. Another factor is related to difficulties to take into account the structural features of ILs in the solid state (polymorphic effects, eutectics, glass formation). It seems that further improvement of the performance of the

Figure 4. Considered example of GTM map (map corresponds to model 6 in Table 1): in some cases, nodes can contain similar compounds with different mp values.

observation can be considered. Typically, this is observed when the compound has fragments in common with other compounds located in other regions of the map. This case was already 14341

dx.doi.org/10.1021/ie3021895 | Ind. Eng. Chem. Res. 2012, 51, 14337−14343

Industrial & Engineering Chemistry Research

Article

the performance of developed models. (accordingly, Γ-score41 and accuracy42). The analysis of the impact of descriptor choice on the predictive ability of developed models has been carried out. The performance of the obtained models (accuracy) varies from 0.81 to 0.87 depending on the descriptor type and data set involved in model development. External validation of the models confirmed the possibility to use GTM as an approach tfor in silico design of novel ionic liquids.

models for mp of ionic liquids could be achieved moving first in these directions. 4.3. Impact of Descriptor Type. Four different descriptor types have been involved in models development in this study (see Data and Descriptors). The pertinence of the introduced descriptor spaces for structure−property modeling was evaluated for all involved types of descriptors. In most of cases, the models developed with MOE and RCDK descriptors demonstrated the best predictive performance. For other types of descriptors, the results are comparable, and the variations in the predictive performance are insignificant. 4.4. External Validation of GTM Models. To additionally approve the predictive performance of developed models, the external validation has been carried out. For this purpose, at the initial stage of the data preparation 20% of compounds of the data sets were extracted in the external test set. The rest of the compounds have been involved in the calculations as a training set using the best parameters estimated in 5-fold cross-validation procedure. The accuracy (Acc) of the models in external validation is represented as a function of data set and type of descriptors in use in Figure 6. These values vary from 0.78 to



AUTHOR INFORMATION

Corresponding Author

*Tel.: (+7) 916-8257704. Fax: (+7) 495-9520462. E-mail: [email protected], [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We thank the Russian Foundation for Basic Research (project no. 11-03-00161), GDRE SupraChem, the ARCUS project, CNRS, and the French Embassy in Russia for support. We acknowledge Prof. Alexandre Varnek (Université de Strasbourg) and Prof. Igor Baskin (Moscow State University).



REFERENCES

(1) Ionic Liquids in Synthesis; Wiley-VCH Verlag GmbH & Co: New York, 2002. (2) Sheldon, R. A. Green Solvents for Sustainable Organic Synthesis: State of the Art. Green Chem. 2005, 7, 267. (3) Welton, T. Room-Temperature Ionic Liquids. Solvents for Synthesis and Catalysis. Chem. Rev. 1999, 99, 2071. (4) Katritzky, A. R.; Jain, R.; Lomaka, A.; Petrukhin, R.; Karelson, M.; Visser, A. E.; Rogers, R. D. Correlation of the Melting Points of Potential Ionic Liquids (Imidazolium Bromides and Benzoimidazolium Bromides) Using CODESSA Program. J. Chem. Inf. Comput. Sci. 2002, 42, 225. (5) Bini, R.; Malvaldi, R.; Pitner, W. R.; Chiappe, C. QSPR Correlation for Conductivities and Viscosiites of Low-Temperature Melting Ionic Liquids. J. Phys. Org. Chem. 2008, 21, 622. (6) Matsuda, H.; Yamamoto, H.; Kurihara, K.; Tochigi, K. ComputerAided Reverse Design for Ionic Liquids by QSPR Using Descriptors of Group Contribution Type for Ionic Conductivities and Viscosities. Fluid Phase Equilib. 2007, 261, 434. (7) Tochigi, K.; Yamamoto, H. Estimation of Ionic Conductivity and Viscosity of Ionic Liquids Using a QSPR Model. J. Phys. Chem. C 2007, 111, 15989. (8) Billiard, I.; Marcou, G.; Ouadi, A.; Varnek, A. In Silico Design of Ionic Liquids Based on Quantitative Structure-Property Relationship Models of Ionic Liquid Viscosity. J. Phys. Chem. B 2011, 115, 93. (9) Mirkhani, S. A.; Gharagheizi, F. Predictive Quantitative StructureProperty Relationship Model for the Estimation of Ionic Liquid Viscosity. Ind. Eng. Chem. Res. 2012, 51, 2470. (10) Lazzus, J. A. r(T,p) Model for Ionic Liquids based on Quantitative Structure-Property Relationship Calculations. J. Phys. Org. Chem. 2009, 22, 1193. (11) Gardas, R. L.; Coutinho, J. A. P. Applying a QSPR Correlation to the Prediction of Surface Tensions of Ionic Liquids. Fluid Phase Equilib. 2008, 265, 57. (12) Freire, M. G.; Neves, C.; Ventura, S.; Pratas, M. J.; Marrucho, I. M.; Oliveira, J.; Coutinho, J. A. P.; Fernandes, A. M. Solubility of NonAromatic Ionic Liquids in Water and Correlation Using a QSPR Approach. Fluid Phase Equilib. 2010, 294, 234. (13) Bai, L.; Zhu, J.; Chen, B. Quantitative Structure-Property Relationship Study on Heat of Fusion for Ionic Liquids. Fluid Phase Equilib. 2011, 312, 7. (14) Palomar, J.; Torrechilla, J. S.; Lemus, J.; Ferro, V. R.; Rodriguez, F. A COSMO-RS based Guide to Analyze/Quantify the Polarity of Ionic

Figure 6. External predictive performance (accuracy, see section 3.2) of GTM classification models involving different types of descriptors: (a) PYR, (b) IMZ, and (c) QUAT (external test set).

0.85 depending on the descriptor type and the data set involved in model development, which is in agreement with accuracy values obtained using the 5-fold cross-validation procedure. The latter confirms the possibility to use GTM as an approach for in silico design of novel ionic liquids.

5. CONCLUSIONS In this study, to our knowledge for the first time, we proposed the universal approach that can be used both for the visualization of the chemical space of ILs according to their melting point values and for the development of the classification models able to predict the melting points of novel ILs. The models have been built using several types of descriptors for three structurally diverse data sets containing 717 bromides of nitrogen-containing organic cations and including 126 pyridinium bromides (PYR), 384 imidazolium and benzoimidazolium bromides (IMZ), and 207 quaternary ammonium bromides (QUAT). The clear criteria for data visualization and classification quality were used to assess 14342

dx.doi.org/10.1021/ie3021895 | Ind. Eng. Chem. Res. 2012, 51, 14337−14343

Industrial & Engineering Chemistry Research

Article

Liquids and Their Mixtures with Organic Cosolvents. Phys. Chem. Chem. Phys. 2010, 12, 1991. (15) Katritzky, A. R.; Jain, R.; Lomaka, A.; Petrukhin, R.; Maran, U.; Karelson, M. Perspective on the Relationships between Melting Points and Chemical Structure. Cryst. Growth Des. 2001, 1, 261. (16) Holbrey, J. D.; Reichert, W. M.; Nieuwenhuyzen, M.; Johnston, S.; Seddon, K. R.; Rogers, R. D. Crystal Polymorphism in 1-Butyl-3methylimidazolium Halides: Supporting Ionic Liquid Formation by Inhibition and Crystallization. Chem. Commun. 2003, 1636. (17) Xu, W.; Cooper, E. I.; Angell, C. A. Ionic Liquids: Ion Mobilities, Glass Temperatures and Fragilities. J. Phys. Chem. B 2003, 107, 6170. (18) Katritzky, A. R.; Lomaka, A.; Petrukhin, R.; Jain, R.; Karelson, M.; Visser, A. E.; Rogers, R. D. QSPR Correlation of the Melting Point for Pyridinium Bromides, Potential Ionic Liquids. J. Chem. Inf. Comput. Sci. 2002, 42, 71. (19) Ren, Y.; Qin, J.; Liu, H.; Yao, X.; Liu, M. QSPR Study on the Melting Points of a Diverse Set of Potential Ionic Liquids by Projection Pursuit Regression. QSAR Comb. Sci. 2009, 28, 1237. (20) Bini, R.; Chiappe, C.; Duce, C.; Micheli, A.; Solaro, R.; Starita, A.; Tine, M. R. Ionic Liquids: Prediction of Their Melting Points by a Recursive Neural Network Model. Green Chem. 2008, 10, 306. (21) Yan, C.; Han, M.; Wan, H.; Guan, G. QSAR Correlation of the Melting Points for Imidazolium Bromides and Imidazolium Chlorides Ionic Liquids. Fluid Phase Equilib. 2010, 292, 104. (22) Sun, N.; He, X.; Dong, K.; Zhang, X.; Lu, X.; He, H.; Zhang, S. Prediction of the Melting Points for Two Kinds of Room Temperature Ionic Liquids. Fluid Phase Equilib. 2006, 246, 137. (23) Eike, D.; Brennecke, J.; Maginn, E. Predicting Melting Points of Quaternary Ammonium Ionic Liquids. Green Chem. 2003, 5, 323. (24) Trohalaki, S.; Pachter, R. Prediction of Melting Points for Ionic Liquids. QSAR Comb. Sci. 2005, 24, 485. (25) Trohalaki, S.; Pachter, R.; Drake, G.; Hawkins, T. Quantitative Structure-Property Relationships for Melting Points and Densities of Ionic Liquids. Energy Fuels 2005, 19, 279. (26) Varnek, A.; Kireeva, N.; Tetko, I. V.; Baskin, I. I.; Solov’ev, V. P. Exhaustive QSPR Studies of a Large Diverse Set of Ionic Liquids: How Accurately Can We Predict Melting Points? J. Chem. Inf. Model. 2007, 47, 1111. (27) Carrera, G.; Aires-de-Sousa, J. Estimation of Melting Points of Pyridinium Bromides Ionic Liquids with Decision Trees and Neural Networks. Green Chem. 2004, 7, 20. (28) Carrera, G.; Branco, L. C.; Aires-de-Sousa, J.; Afonso, C. Exploration of Quantitative Structure-Property Relationships (QSPR) for the Design of New Guanidinium Ionic Liquids. Tetrahedron 2008, 64, 2216. (29) Bishop, C. M.; Svensen, M. GTM: The Generative Topographic Mapping. Neural Computation 1998, 10, 215. (30) Bishop, C. M.; Svensen, M.; Williams, C. L. I. GTM: A Principled Alternative to the Self-Organizing Map. Tech. Report. Neural Comput. Res. Group, 1997. (31) Svensen, M. GTM: The Generative Topographic Mapping; Aston University, 1998. (32) Oprea, T. I.; Gottfries, J. Chemography: The art of navigating in chemical space. J. Comb. Chem. 2001, 3, 157. (33) Gorban, A. N.; Kegl, B.; Wunsch, D. C.; Zinovyev, A. Principal Manifolds for Data Visualisation and Dimension Reduction; Springer: Berlin−Heidelberg−New York, 2007. (34) Lee, J. A.; Verleysen, M. Nonlinear Dimensionality Reduction; Springer: New York, 2007. (35) Balakin, K. V., Ed. Pharmaceutical Data Mining: Approaches and Applications for Drug Discovery; Wiley: Hoboken, NJ, 2010. (36) Ivanenkov, Y. A.; Bovina, E. V.; Balakin, K. V. Nonlinear Mapping Techniques for Prediction of Pharmacological Properties of Chemical Compounds. Russ. Chem. Rev. 2009, 78, 465. (37) Ivanenkov, Y. A.; Savchuk, N. P.; Ekins, S.; Balakin, K. V. Computational Mapping Tools for Drug Discovery. Drug Discovery Today 2009, 14, 767.

(38) Maniyar, D. M.; Nabney, I. T.; Williams, B. S.; Sewing, A. Data Visualization during the Early Stages of Drug Discovery. J. Chem. Inf. Model. 2006, 46, 1806. (39) Owen, J. R.; Nabney, I. T.; Medina-Franco, J. L.; López-Vallejo, F. Visualization of Molecular Fingerprints. J. Chem. Inf. Model. 2011, 51, 1552. (40) Kireeva, N.; Baskin, I. I.; Gaspar, H. A.; Horvath, D.; Marcou, G.; Varnek, A. Generative Topographic Maps (GTM): universal tool for data visualization, structure-activity modeling and database comparison. Mol. Inf. 2012, 31, 301. (41) Owen, J. R.; Nabney, I.; Medina-Franco, J. L.; Lopez-Vallejo, F. Visualization of Molecular Fingerprints. J. Chem. Inf. Model. 2011, 51, 1552. (42) Sokolova, M.; Japkowicz, N.; Szpakowicz, S. Beyond Accuracy, FScore and ROC: A Family of Discriminant Measures for Performance Evaluation. Adv. Artif. Intell. 2006, 4304, 1015. (43) Jaynes, E. T. Probability Theory. The Logic of Science; Cambridge University Press: Cambridge, 2003; p 727. (44) Bishop, C. M.; Svensén, M.; Williams, C. K. I. GTM: The Generative Topographic Mapping. Neural Comput. 1998, 10, 215. (45) Erwin, E.; Obermayer, K.; Schulten, K. Self-Organizing Maps Ordering, Convergence Properties and Energy Functions. Biol. Cybernetics 1992, 67, 47. (46) Bishop, C. M.; Svensén, M.; Williams, C. K. I. Developments of the Generative Topographic Mapping. Neurocomputing 1998, 21, 203. (47) http://www.chemcomp.com/journal/descr.htm. (48) Varnek, A.; Fourches, D.; Horvath, D.; Klimchuk, O.; Gaudin, C.; Vayer, P.; Solov’ev, V.; Hoonakker, F.; Tetko, I. V.; Marcou, G. ISIDA Platform for virtual screening based on fragment and pharmacophoric descriptors. Curr. Comput.-Aided Drug Des. 2008, 4, 191. (49) Swamy, M. N. S.; Thulasiraman, K. Graphs, Networks, and Algorithms; John Wiley & Sons: New York, 1981. (50) Guha, R. Chemical Informatics Functionality in R. J. Stat. Software 2007, 18, 1−16. (51) R project. http://www.r-project.org/foundation/. (52) Netlab. http://www1.aston.ac.uk/eas/research/groups/ncrg/ resources/netlab/. (53) Nabney, I. Algorithms for Pattern Recognition; Springer: London, 2002. (54) Bishop, C. M. Pattern Recognition and Machine Learning; Springer: New York, 2006. (55) Bishop, C. M.; Svensen, M.; Williams, C. L. I. GTM: A Principled Alternative to the Self-Organizing Map. Technical Report. Neural Computing Research Group, 1997. (56) Jolliffe, I. T. Principal Component Analysis; Springer: New York, 2002.

14343

dx.doi.org/10.1021/ie3021895 | Ind. Eng. Chem. Res. 2012, 51, 14337−14343