1678
Ind. Eng. Chem. Res. 2009, 48, 1678–1682
CORRELATIONS Estimation of Aniline Point Temperature of Pure Hydrocarbons: A Quantitative Structure-Property Relationship Approach Farhad Gharagheizi,*,†,‡ Behnam Tirandazi,† and Reza Barzin§ Department of Chemical Engineering, Faculty of Engineering, UniVersity of Tehran, P.O. Box 11365-4563, Tehran, Iran, Department of Chemical Engineering, Medicinal Plants and Drug Research Institute, Shahid Behesti, UniVersity, EVin, Tehran, Iran, and Department of Computer Science & Engineering, UniVersity of California San Diego, La Jolla, California 92093
In the present work, a quantitative structure-property relationship (QSPR) study is performed to predict the aniline point temperature of pure hydrocarbon components. As a powerful tool, genetic algorithm-based multivariate linear regression (GA-MLR) is applied to select most statistically effective molecular descriptors on the aniline point temperature of pure hydrocarbon components. Also, a three-layer feed forward neural network (FFNN) is constructed to consider the nonlinear behavior of appearing molecular descriptors in GAMLR result. The obtained results show that the constructed FFNN can accurately predict the aniline point temperature of pure hydrocarbon components. Introduction One of the usual problems in science and engineering is the selection of proper solvent for a special application. This selection implies that the solvent must form with the desired solute a thermodynamically stable mixture in the whole practical conditions. Such selection can be facilitated by use of a numerical criterion of solvent power. Various numerical criterions were proposed to estimate solvent power of materials, but of them, only a few methods were widely used. The solubility parameter,1-6 Hansen solubility parameters,7-10 Kauri butanol value,11 and aniline point11,12 are the most widely used parameters to estimate solubility of a solute in a solvent. A parameter which is widely used for estimating solubility of paints, varnish, and lacquer thinners in the paint industry is the aniline point.11-13 This parameter is widely used in the petroleum industry to classify petroleum cuts.11-13 The aniline point is defined as the lowest temperature at which equal volumes of aniline and the sample become completely soluble (ASTM D611). Below the aniline point, aniline/sample phase separation occurs. In other words, the aniline point indicates to the phase separation temperature for the mixture. Since the aniline molecule is both polar and readily polarizable, there exists a strong molecular cohesion between aniline and other aromatic compounds. As a result, the aniline point can be used as a parameter for evaluation of the degree of aromaticity of a sample. Prediction of physicochemical properties of materials from their molecular structures has been one of the wishes of scientists and engineers for a long time. One of the useful methods applied for this purpose is that of quantitative structure-property relationship (QSPR).14-18 QSPR is defined as a mathematical model which predicts the physical, mechanical, or chemical properties of materials from their chemical structures. The main * To whom correspondence should be addressed. Fax: +98 21 66957784. E-mail:
[email protected] and
[email protected]. † University of Tehran. ‡ Shahid Behesti University. § University of California San Diego.
goal of QSPR studies is to find a relationship between the structure of a compound expressed in terms of numeric characteristics associated with its chemical structure (which are called molecular descriptors) and the properties of interest. Once a correlation between structure and desired property is found, any number of compounds, including those not yet prepared or even not synthesized, can be readily screened on computer in order to select structures with the properties desired. Thus the QSPR approach conserves resources and accelerates the process of development and application of new molecules for use as any purpose. Also, since theoretical descriptors derived solely from the molecular structure would be involved, the relation should, in principle, be applicable to any chemical structure. There are certain, rather obvious limitations to its use: (i) the family of compounds used to derive the QSPR (the “training set”) should be chemically similar and (ii) realistic predictions can only be made for compounds that are chemically related to some of those from which the QSPR model was derived; i.e., predictions should be of interpolations or short extrapolations.18 In this work, a QSPR study is performed to develop an accurate model for the prediction of the aniline point of hydrocarbons. For this purpose, genetic algorithm based multivariate linear regression (GA-MLR) and feed forward neural networks (FFNN) are used. Methodology Data Set Preparation. Before beginning our study, we need a data set contains the experimental aniline point temperature
Figure 1. Schematic structure of three layer feed forward neural network used in this study.
10.1021/ie801212a CCC: $40.75 2009 American Chemical Society Published on Web 12/15/2008
Ind. Eng. Chem. Res., Vol. 48, No. 3, 2009 1679 Table 1. Five Molecular Descriptors Entered into the Best Obtained Multilinear Equation and Their Physical Meanings ID
molecular descriptor
type
definition
1 2 3
nR06 AAC BEHp1
constitutional descriptor information index Burden eigenvalue
4 5
Mor07m DP16
3D-MoRSE descriptor Randic molecular profile
number of 6-membered ring mean information index on atomic composition highest eigenvalue no. 1 of Burden matrix weighted by atomic polarizability 3D-MoRSE-signal 07 weighted by atomic masses molecular profile no. 16
Table 2. Correlation Matrix for Five Selected Descriptors nR06 nR06 AAC BEHp1 Mor07m DP16
1 0.37 0.36 0.024 0.21
AAC
BEHp1
Mor07m
DP16
Table 3. Weight and Bias Matrices of the Best Obtained Three-Layer Feed Forward Neural Network W1
1 0.48 0.02 0.69
1 0.03 0.61
1 0.02
n
∑ (y - yˆ )
2
i
QLoo2 ) 1 -
ic
i)1 n
∑ (y - jy)
2
i
i)1
-0.2038 -2.7662 2.3615 1.3475 -1.609 2.0126 -0.7705 -0.1157
1
for hydrocarbon components. One of the useful compilations for this purpose is the API Technical data book.12 Therefore, 126 pure hydrocarbon components were found in this data book, and the values of aniline point temperature of these 126 hydrocarbon components were extracted. Determination of Molecular Descriptors. In this step, molecular structures of all 126 components were drawn into Hyperchem software19 and optimized using the MM+ molecular mechanics force field. Thereafter, by Dragon software,20 molecular descriptors were calculated using these optimized molecular structures. Of course, these molecular descriptors have been calculated for about 234 000 pure compounds using Dragon software and are freely accessible from Milano Chemometrics and QSAR research group web site (http://michem.disat. unimib.it/mole_db). For every molecule, 1664 molecular descriptors were calculated using Dragon software. For more information about the types of the molecular descriptors which Dragon can calculate, refer to Dragon software user’s guide.20 GA-MLR Calculations. In QSPR studies, after calculating molecular descriptors from optimized chemical structures of all components available in data set, the problem is to find a linear equation that can predict the desired property with the least number of variables as well as highest accuracy. In other words, the problem is to find a subset of variables (most statistically effective molecular descriptors of aniline point) from all available variables (all molecular descriptors) that can predict aniline point of pure components, with minimum error in comparison with the experimental data. A generally accepted method for this problem is GA-MLR technique. In this method, genetic algorithm is used for selection of best subset variables with respect to an objective function. This algorithm was presented by Leardi et al.21 for the first time. There are many standard fitness functions such as R2, adjusted 2 R , Q2, Akaike information content, LOF function, and so on, which are used as objective function in GA-MLR technique.22 RQK fitness function is a new fitness function for model searching proposed to avoid unwanted model properties, such as chance correlation, presence in the models of noisy variables and other model pathologies that cause lack of model prediction power.22 This fitness function is a constrained fitness function based on QLoo2 (leave-one-out cross validated variance) statistics and four tests that must be fulfilled contemporarily.22 The QLoo2 is defined as
(1)
b1 3.2168 1.7227 -1.0876 -0.9686 -3.6439 -2.409 1.3437 3.4544
-2.3064 2.7985 -0.2066 -1.5965 0.7598 4.1988 0.6338 0.3461
-0.7902 2.1783 1.5375 -1.4502 -1.4204 -2.1286 -0.5016 -0.0354
-1.6375 2.4763 -0.0132 -1.3849 0.3587 -2.7964 0.3492 -2.2688
1.8724 0.354 -0.6796 -0.3563 -1.5907 -1.1576 -1.434 2.3786
W2 2.1466 -1.1286 1.3064 -1.9461 0.7988 0.3074 -1.1748 -2.5172 b2 0.7044
where yi, jy, and yˆic are the experimental aniline point for ith component of the data set, the mean value of the aniline point of all components, and the response of the ith object estimated by using a model obtained without using the ith object, respectively. Since many conditions during this algorithm are checked, we can quietly ensure that the final model is valid and has the prediction power and is not a chance correlation. These conditions decrease the possibility in obtaining undesired models such as chance correlations. In this study, the RQK function is used. In order to perform GA-MLR, a program was written based on MATLAB software (Mathworks Inc. software). This program was used in author’s previous works.6,23-32 Before performing GA-MLR, the data set must be divided into two new collections. The first one is applied for training, and the second one is applied for testing. By means of a training set, the best model is found and then the prediction power of this obtained model is checked by test set, as an external data set. In this work, 80% of the database was used for training set and 20% for test set (In each running program, from 126 components, 101 components are in training set and 25 components are in test set.) The selection was randomly done from the groups that have predominant descriptors already existent in the training set FFNN Calculations. The three-layer feed forward neural networks with the sigmoidal (hyperbolic tangent) transfer function have been the standard techniques used in QSPR modeling. 33,34 However, in order to consider the nonlinear behavior of appearing molecular descriptors in the GA-MLR result, three-layer feed forward neural networks (FFNN) are used. Neural networks are good at fitting functions, and there is a proof that a simple neural network can fit any data set very well. As a result, for checking the prediction power of the neural network and also for preventing from overfitting, the use of test set is needed. A test set is only used for checking the produced neural network and is not used to train it. In this section, the same training set and test set which are used in GA-MLR section are used to construct a three-layer FFNN.
1680 Ind. Eng. Chem. Res., Vol. 48, No. 3, 2009
Figure 2. Comparison between the best multilinear results obtained by GA-MLR (eq 2) and the experimental data.
Figure 3. Comparison between the best-obtained FFNN results and the experimental data.
The schematic structure of the three-layer FFNN which we use in this work is shown in Figure 1. Three-layer feed forward neural networks are available in the Neural Network Toolbox in the MATLAB software. All programming of this section is performed in the MATLAB workspace. Usually, all inputs and outputs of FFNN are normalized between -1 and +1, for decreasing the error of calculations. In this work, we normalize inputs and outputs by means of the minimum and the maximum values of each molecular descriptor in the input matrix. The values of W1, W2, b1, and b2 (see Figure 1) are obtained by minimization of an objective function which is commonly the sum of squares error between the outputs of neural network and the target values (aniline point of hydrocarbons). This minimization is usually performed by the Levenberg-Marquart algorithm. This algorithm is rapid and accurate in process of training neural networks.33,34
In most cases, usually, the number of neurons in hidden layer (n) is fixed, then it is tried to produce a neural network which can predict the target values as accurate as expected. Then, this work is repeated till the best neural network structure is obtained. In many cases, especially, in three-layer FFNNs, it is better that, as a complementary work, the number of neurons of hidden layer is optimized. In this study, a three-layer FFNN is produced for predicting aniline point of pure hydrocarbon components, then the number of neurons in hidden layer of this three-layer FFNNs is postoptimized for obtaining the best number of neurons of hidden layer. This procedure was used in author’s previous works.25,28,30,33,35 Results and Discussion By the presented procedure, using GA-MLR technique, the best multivariate linear equation was obtained. For obtaining
Ind. Eng. Chem. Res., Vol. 48, No. 3, 2009 1681
this equation, first, the best one molecular descriptor model was obtained. Then the best two molecular descriptor model was obtained. This procedure was repeated to obtain the best three, four, five, and so on molecular descriptor models. The best multivariate linear model has five parameters because an increase in the number of molecular descriptors does not have any considerable effect on the accuracy of the best model. This equation and its statistical parameters are presented as follows: AP ) 2214.72876((122.6869) + 20.02071 × ((3.67065)nR06 - 2804.15625((94.97153)AAC+ 115.54459((21.92706)BEHp1 + 5.77585 × ((0.79122)DP16 + 81.73048((13.82693)Mor07m ntraining ) 101; QLoo2 ) 0.9613;
ntest ) 25;
(2)
R2 ) 0.9660
QL15O2 ) 0.9313; QBoot2 ) 0.9599;
s ) 11.95;
a(R2) ) 0.006;
QEXT2 ) 0.9479
F ) 8229.781
∆K ) 0.030; ∆Q ) 0.001; RP ) 0.004; RN ) 0.001 where ∆K, ∆Q, RP, and RN are four constraints of the eq 1 which must be equal or greater than zero, as stated by Todeschini et al.22 The molecular descriptors and their physical meanings are presented in Table 1. Table 2 presents the correlation matrix, where it is clear that the five selected molecular descriptors are not highly correlated. The terms ntraining and ntest are the number of components available in training set and test set, respectively. For more checking validity of the model, the bootstrap, y-scrambling, and external validation techniques were used. The bootstrapping was repeated 5000 times. Also y-scrambling was repeated 300 times. As can be seen, the small differences between QLoo2, QBoot2, QEXT2 (QEXT2is defined by eq 1 but over the test objects), and R2 show that the obtained model is a good model and has good prediction power. When the number of objects in the data set is quite large (such as in this work), the predictive ability obtained is too optimistic. This is due to a too small perturbation of the data when only one object is left out. Therefore, in these types of problems, the leave-more-out cross validation technique is used. In this work, the leave-15-out cross validation was used. This technique was repeated 100 times, and the average value of the cross validation coefficient was equal 0.9321 (QL15O2). Also the intercept value of the y-scrambling technique has low value (a(R2) ) 0.006) that reveals the validity of the model (Todeschini et al.22 presented the procedure of y-scrambling). Also, the values of four constraints of the model are equal or greater than zero. This fact shows that this model is valid and is not chance correlation. All the validation techniques show that the obtained model is a valid model and can be used to predict the aniline point of pure components. “nR06” is the number of the six-membered rings in a molecule. Some of these rings are aromatic rings (such as benzene) and as a result in this case, increase in AP with increase in “nR06” is expected. “AAC” is a measure of atomic composition. When the molecule is smaller and its elemental composition is simpler, then this descriptor decreases and therefore the AP is increases. “BEHp1” gives information about the chemical similarity/ diversity of the considered molecules. When this descriptor increases, the AP is increases. “Mor07m” is based on the idea of obtaining information from the 3D atomic coordinates by the transform used in electron diffraction studies for preparing
theoretical curves.38 Therefore this descriptor gives information about the chemical similarity/diversity of 3D structures of considered molecules. “DP16” is a measure of atomic distances in a molecule. When the average distances between atoms in a molecule decrease, the polarizability of that molecule is increases, and as a result, the AP of that molecule increases. The predicted values of aniline point by eq 2 in comparison with the experimental data are shown in Figure 2. The values of the predicted aniline point in comparison with the experimental data are presented as Supporting Information. Also the values of the descriptors and status of all components (training set or test set) are presented as Supporting Information. For obtaining the optimized number of neurons, some threelayer FFNNs were checked. The number of neurons in the hidden layer between 1 and 15 were checked. The best threelayer FFNN which was obtained for prediction of the aniline point of pure hydrocarbons has 5-8-1 structure. The values of W1, W2, b1, and b2 for this three-layer FFNN are presented in Table 3. Also, the predicted values of the aniline point by this FFNN in comparison with the API experimental data are presented in Figure 3. The comparison between the predicted values of aniline point by obtained FFNN and the data set values shows that the FFNN can predict the aniline point with squared correlation coefficient 0.9888 (R2 ) 0.9888). The average absolute error of the model over all 126 compounds is equal 11.39%. As can be found, the constructed FFNN can predict aniline point of pure hydrocarbon components more accurate than eq 2. Conclusion In this paper, a QSPR analysis was performed for prediction of aniline point of pure hydrocarbons. This QSPR study contains two parts: linear study and nonlinear study. The result of first part is input to the second part. The linear part contains selecting most statistically effective molecular descriptors on aniline point of pure hydrocarbon components. This part was performed using GA-MLR technique based on 1664 molecular descriptors. The result of this part is a five-parameter multilinear equation. All five parameters of this equation can be calculated only from the chemical structure of these components. The nonlinear part of this study is related to generating a neural network on the five parameters of the multilinear model obtained by GA-MLR in first part of this study. In this part, the five parameters of the multilinear were used as input to a FFNN, and finally, an optimized FFNN with structure 5-8-1 was presented for prediction of aniline point of pure components. This optimized FFNN is accurate and can be applied for prediction of aniline point of any regular pure hydrocarbon. Supporting Information Available: Table showing 126 pure components used in this study, their aniline point extracted from API Technical Data Book, and the predicted values using eq 2, and the optimized neural network are presented. Also, the five molecular descriptors used to predict aniline point are presented. This material is available free of charge via the Internet at http:// pubs.acs.org. Literature Cited (1) Hildebrand, J. H.; Scott, R. L. Solubility of Non-Electrolytes; 3rd ed., Reinhold: New York, 1964. (2) Hildebrand, J. H.; Scott, R. L. Regular Solutions; Prentice-Hall: Englewood Cliffs, NJ, 1962.
1682 Ind. Eng. Chem. Res., Vol. 48, No. 3, 2009 (3) Hildebrand, J. H.; Scott, R. L. Regular and Related Solutions; van Nostrand-Reinhold: Princeton, NJ, 1970. (4) Barton, A. F. M. Handbook of Solubility Parameters and Other Cohesion Parameters; CRC Press: Boca Raton, FL, 1983. (5) Barton, A. F. M. Handbook of Polymer-Liquid Interaction Parameters and Solubility Parameters; CRC Press: Boca Raton, FL, 1990. (6) Gharagheizi, F. QSPR Studies for Solubility Parameter by Means of Genetic Algorithm-Based Multivariate Linear Regression and Generalized Regression Neural Network. QSAR Comb. Sci. 2008, 27, 165. (7) Hansen, C. M. Hansen Solubility Parameters: A User’s Handbook; CRC Press: Boca Raton, FL, 2000. (8) Gharagheizi, F. New Procedure to Calculate Hansen Solubility Parameters of Polymers. J. Appl. Polym. Sci. 2007, 103, 31. (9) Gharagheizi, F.; Angaji, M. T. A New Improved Method for Estimating Hansen Solubility Parameters of Polymers. J. Macromol. Sci. 2006, B45, 285. (10) Gharagheizi, F.; Sattari, M.; Angaji, M. T. Effect of Calculation Method on the Values of Hansen Solubility Parameters of Polymers. Polym. Bull. 2005, 57, 377. (11) Wypych, G., Ed. Handbook of SolVents; ChemTech Publishing: Toronto, 2001. (12) API Technical Data Book-Petroleum Refining; 7th ed., Epcon International and The American Petroleum Institute, 2005. (13) Shoemaker, B. H.; Bolt, J. A. Determination of Mixed Aniline Points of Hydrocarbon Solvents. Ind. Eng. Chem. 1942, 14, 200. (14) Katritzky, A. R.; Lobanov, V.; Karelson, M. QSPR: The Correlation and Quantitative Prediction of Chemical and Physical Properties from Structure. Chem. Soc. ReV. 1995, 24, 279. (15) Karelson, M.; Lobanov, V.; Katritzky, A. R. Quantum-Chemical Descriptors in QSAR/QSPR Studies. Chem. ReV. 1996, 96, 1027. (16) Katritzky, A. R.; Maran, U.; Lobanov, V.; Karelson, M. Structurally Diverse Quantitative Structure-Property Relationship Correlations of Technologically Relevant Physical Properties. J. Chem. Inf. Comput. Sci. 2000, 40, 1. (17) Katritzky, A. R.; Dobchev, D. A.; Karelson, M. Physical, Chemical, and Technological Property Correlation with Chemical Structure: The Potential of QSPR. Z. Naturforsch. 2006, B61, 373. (18) Katritzky, A. R.; Fara, D. C. How Chemical Structure Determines Physical, Chemical, and Technological Properties: An Overview Illustrating the Potential of Quantitative Structure-Property Relationships for Fuels Science. Energy Fuels 2005, 19, 922. (19) HyperChem, Release 7.5 for Windows, Molecular Modeling System, Hypercube, Inc., 2002. (20) Talete srl, Dragon for Widows (Software for molecular Descriptor Calculation), version 5.4, 2006; http://www.talete.mi.it/. (21) Leardi, R.; Boggia, R.; Terrile, M. Genetic Algorithms as a Strategy for Feature Selection. J. Chemom. 1992, 6, 267. (22) Todeschini, R.; Consonni, V.; Mauri, A.; Pavan, M. Detecting “bad” Regression Models: Multicriteria Fitness Functions in Regression Analysis. Anal. Chim. Acta 2004, 515, 199.
(23) Gharagheizi, F. QSPR Analysis for Intrinsic Viscosity of Polymer Solutions by means of GA-MLR and RBFNN. Comput. Mater. Sci. 2007, 40, 159. (24) Gharagheizi, F.; Mehrpooya, M. Prediction of Standard Chemical Exergy by a Three Descriptors QSPR Model. Energy ConVers. Manage. 2007, 48, 2453. (25) Gharagheizi, F. A New Neural Network Quantitative StructureProperty Relationship for Prediction of θ (Lower Critical Solution Temperature) of Polymer Solutions. e-Polymers 2007, Article Number, 114. (26) Gharagheizi, F. A New Molecular-Based Model for Prediction of Enthalpy of Sublimation of Pure Components. Thermochim. Acta 2008, 469, 8. (27) Gharagheizi, F. A Simple Equation for Prediction of Standard Net Heat of Combustion of Pure Chemicals. Chemom. Intell. Lab. Syst. 2008, 91, 177. (28) Gharagheizi, F.; Fazeli, A. Prediction of Watson Characterization Factor of Hydrocarbon Compounds from Their Molecular Properties. QSAR Comb. Sci. 2008, 27, 758. (29) Gharagheizi, F.; Alamdari, R. F. Prediction of Flash Pont Temperature of Pure Components Using a Quantitative Structure-Property Relationship Model. QSAR Comb Sci. 2008, 27, 679. (30) Gharagheizi, F.; Alamdari, R. F. A Molecular-Based Model for Prediction of Solubility of C60 Fullerene in Various Solvents. Fuller. Nanotub. Car. N. 2008, 16, 40. (31) Vatani, A.; Mehrpooya, M.; Gharagheizi, F. Prediction of Standard Enthalpy of Formation by a QSPR Model. Int. J. Mol. Sci 2007, 8, 407. (32) Sattari, M.; Gharagheizi, F. Prediction of Molecular Diffusivity of Pure Components into Air: A QSPR Approach. Chemosphere 2008, 72, 1298. (33) Taskinen, J.; Yliruusi, J. Prediction of Physicochemical Properties Based on Neural Network Modelling. AdV. Drug DeliV. ReV. 2003, 55, 1163. (34) Karelson, M.; Dobchev, D. A.; Kulshyn, O. V. Neural Networks Convergence Using Physicochemical Data. J. Chem. Inf. Model. 2006, 46, 1891. (35) Gharagheizi, F.; Alamdari, R. F.; Angaji, M. T. A New Neural Network-Group Contribution Method for Estimation of Flash Point. Energy Fuels 2008, 22, 1628. (36) Gharagheizi, F. Quantitative Structure-Property Relationship for Prediction of Lower Flammability Limit of Pure Compounds. Energy Fuels 2008, 22, 3037. (37) Gharagheizi, F.; Mehrpooya, M. Prediction of Some Important Physical Properties of Sulfur Compounds Using QSPR Models. Mol. DiVers. 2008, 12, 143-155. (38) Schuur, J. H.; Selzer, P.; Gasteiger, J. The Coding of the ThreeDimensional Structure of Molecules by Molecular Transforms and Its Application to Structure-Spectra Correlations and Studies of Biological Activity. J. Chem. Inf. Comput. Sci. 1996, 36, 334.
ReceiVed for reView August 6, 2008 ReVised manuscript receiVed October 16, 2008 Accepted November 4, 2008 IE801212A