Flash Point and Cetane Number Predictions for ... - ACS Publications

Jul 25, 2011 - Carlos Nieto-Draghi , Guillaume Fayet , Benoit Creton , Xavier .... Pinto , Maurício B. de Souza , Gilberto Xavier , Mario Jorge Lima ...
13 downloads 0 Views 2MB Size
ARTICLE pubs.acs.org/EF

Flash Point and Cetane Number Predictions for Fuel Compounds Using Quantitative Structure Property Relationship (QSPR) Methods Diego Alonso Saldana,† Laurie Starck,† Pascal Mougin,† Bernard Rousseau,‡ Ludivine Pidol,† Nicolas Jeuland,† and Benoit Creton*,† † ‡

IFP Energies nouvelles, 1 et 4 Avenue de Bois-Preau, 92852 Rueil-Malmaison, France Laboratoire de Chimie Physique, Universite Paris Sud 11, UMR 8000 CNRS, 91405 Orsay Cedex, France

bS Supporting Information ABSTRACT: In the present work, we report the development of models for the prediction of two fuel properties: flash points (FPs) and cetane numbers (CNs), using quantitative structure property relationship (QSPR) approaches. Compounds inside the scope of the QSPR models are those likely to be found in alternative jet and diesel fuels, i.e., hydrocarbons, alcohols, and esters. A database containing FPs and CNs for these types of molecules has been built using experimental data available in the literature. Various approaches have been used, ranging from those leading to linear models, such as genetic function approximation and partial least squares, to those leading to nonlinear models, such as feed-forward artificial neural networks, general regression neural networks, support vector machines, and graph machines. Except for the case of the graph machine method, for which the only inputs are the simplified molecular input line entry specification (SMILES) formulas, previously listed approaches working on molecular descriptors and functional group count descriptors were used to build specific models for FPs and CNs. For each property, the predictive models return slightly different responses for each molecular structure. Thus, final models labeled as “consensus models” were built by averaging the predicted values of selected individual models. Predicted results were compared with respect to experimental data and predictions of existing models in the literature. Models were used to predict FPs and CNs of molecules for which to the best of our knowledge there is no experimental data in the literature. Using information in the database, evolutions of properties when increasing the number of carbon atoms in families of compounds were studied.

1. INTRODUCTION The problem of climate change implies that greenhouse gas emissions should be reduced. Moreover, fuel availability at a reasonable cost seems more and more uncertain. While energy demand remains high, peak oil is expected to be seen in the coming years.1 In that context, alternative fuels and especially biofuels respecting durability criteria seem to be a promising solution. Biofuels can be thought of as mixtures of “renewable” molecules, such as normal and isoparaffins, naphthenic and aromatic compounds, normal and iso-olefins, alcohols, and/or esters. Industrial processes, such as FischerTropsch (FT) and hydrotreatment of vegetable oils (HVO), can provide normal and isoparaffinic compounds.2,3 Naphthenic and aromatic compounds can be obtained from the liquefaction or pyrolysis of biomass.4,5 Fermentation processes lead to the synthesis of alcohols, and transesterification processes enable the production of esters from raw vegetable oils, such as rapeseed, sunflower, palm, or jatropha.6,7 The introduction of these molecules with different chemistry compared to that of petroleum derivatives requires a large amount of research and development work. In fact, it seems essential to understand the impact of the introduction of these compounds on the physical properties of alternative fuels. The knowledge of fuel properties is extremely important because they drive the conditions for storage, transportation, and combustion quality. Dependent upon the fuel (gasoline, jet, or diesel), the properties and their required specifications can be different. Safety-related properties are crucial for jet fuel, and this is made apparent in the specification by a limit on the flash point r 2011 American Chemical Society

(FP). The FP defines the lowest temperature at which the vapor of a volatile liquid ignites when brought into contact with a flame. In the specific case of jet fuels and more precisely Jet A-1, the American Standard Test Method (ASTM) D1655 requires the FP to be of at least 38 °C.8 Liaw et al. have proposed a methodology to predict the FP of mixtures based on Le Chatelier’s rule, Antoine equation, and the composition of the vapor phase.9 During the past decade, it has been shown that mixtures can exhibit non-ideal FP behaviors (with minimum or maximum), and a large effort has been devoted to evaluate activity coefficients mainly using universal functional activity coefficient (UNIFAC), universal quasichemical (UNIQUAC), Wilson, or non-random two-liquid (NRTL) models to account for this non-ideal behavior.1014 Within a blend of many compounds, such as fuels, it has been observed that the FP of the mixture is in first approximation governed by that of the molecule exhibiting the smallest FP value.15 Thus, the knowledge of the FP of individual compounds remains essential during the formulation of alternative fuels. Measurements can be difficult to carry out when complex and/or difficult to isolate molecules are considered. Correlative models involving various mathematical models and experimental properties as coefficients have been developed to predict the FP of pure compounds and Received: May 31, 2011 Revised: July 22, 2011 Published: July 25, 2011 3900

dx.doi.org/10.1021/ef200795j | Energy Fuels 2011, 25, 3900–3908

Energy & Fuels mixtures.16,17 Most of the predictive models developed during the last few years are based on quantitative structure property relationship (QSPR) approaches.18 Linear regressions have been established to predict the FP of esters using a genetic algorithm (GA) approach.19,20 Recent studies have shown that the use of artificial neural network (ANN) approaches leads to a significant improvement of models when predicting the FP.18,2126 Most of these models differ from chemical families considered; e.g., some predictive models are trained over data sets containing only hydrocarbons,22,23 while others are intended to estimate the FP of diverse organic compounds, including heteroatoms, such as nitrogen, oxygen, sulfur, etc.21,2426 A neural network model based on group contributions has been developed by Gharagheizi et al. to predict the FP considering 1294 diverse organic compounds in the data set and reporting a squared correlation coefficient, R2, of 0.97 computed over a test set.27 Carroll et al. have recently proposed a method to predict the FP of hydrocarbons using experimental boiling points and additive groups and reporting 2.9 K of average absolute deviation for their model.28 Batov et al. have proposed an additive group technique to estimate the FP of alcohols, ketones, and esters; however, the authors have only considered 89 compounds to fit the parameters.29 Caroll et al. have very recently adapted their model to account for various chemical compounds using a database containing 1000 molecules.30 Moreover, Gharagheizi et al. have developed a similar empirical method trained over a database containing 1471 compounds.31 Absolute average errors reported by Caroll et al. and Gharagheizi et al. are 2.5 and 2.4 K, respectively. Nevertheless, these empirical models need the knowledge of boiling point values as the input parameter, and no external test set of molecules seems to have been used to validate the models, omitting any discussion over their predictive power. With regard to diesel fuel, one of the most stringent properties linked to combustion is the control of the ignition, which is expressed by the cetane number (CN). The CN mimics the tendency of a molecule to autoignite when exposed to heat and pressure, as it happens in a diesel engine under working conditions. Two standard compounds are used to define the CN scale: isocetane (2,2,4,4,6,8,8heptamethylnonane, also called HMN) and cetane (n-hexadecane), which are fixed to 15 and 100, respectively. CN, which is used to quantify the combustion quality in diesel engines, is required by ASTM D97532 to be at least 40 and by EN59033 to be at least 51. CN measurements are time-consuming and require a large volume of sample (about 1 L) when obtained by running the product in a single-cylinder cooperative fuel research (CFR) engine according to ASTM D61334 or EN ISO 5165.35 Empirical equations where parameters are physical property values have been developed to predict the CN of diesel fuels.36,37 Because of the variety of compounds in fuels, recent studies concern the simplification of the composition leading to surrogate fuels.38,39 CNs of these mixtures are estimated using a linear volume fraction mixing rule, which suggests the knowledge of the CN of each surrogate component.40 Some models dedicated to pure hydrocarbons have been reported in the literature.18,4144 Yang et al. have developed a predictive model for CN using a neural network trained over 21 n- and isoparaffins and reporting a R2 of 0.97.41 A few years later, Santana et al. extended the database to 147 hydrocarbons and developed neural network models to predict the CN.42 The hydrocarbons in the database were divided into two subgroups: saturated and unsaturated compounds, with both models showing standard errors of 8 CN units. Recently, a QSPR approach based on genetic function approximation was used by Creton et al. to predict the CN of hydrocarbons.43 To improve the prediction (absolute average error between 2.3 and 6.9 CN units),

ARTICLE

the authors divided the database into four different classes of compounds, with each one corresponding to a specific chemical family: (i) n- and isoparaffins, (ii) naphthenes, (iii) aromatics, (iv) and n- and iso-olefins. To the best of our knowledge, only one of the models presented in the literature accounts for oxygenated compounds, which are thought to be the main components of alternative fuels.44 The aim of the present work is to develop models for the prediction of the FP and CN for envisaged compounds in alternative jet fuels and diesel fuels, respectively. Thus, models of FP and CN presented hereafter will be dedicated to hydrocarbons and oxygenated compounds. Models have been established using various QSPR approaches. The paper is organized as follows: in section 2, we detail the databases and present QSPR methodologies followed; in section 3, we detail predictive models and the calculated values are discussed and compared to other models and/or experimental results, when available. This paper ends with section 4, which gives the conclusions.

2. MATERIALS AND METHODS 2.1. Experimental Data. The predictive quality of QSPR models is largely influenced by the size and quality of the database used to train them. Our FP database was built on the basis of experimental data gathered from QSPR studies available in the literature,28,45,46 and additional experimental FPs were gathered from sources, such as Design Institute for Physical Properties (DIPPR), websites of chemical companies, the chemical database of the University of Akron, and the information available in Yaws’ handbook.4752 Only experimental FPs of molecules belonging to a family of interest were considered. In some cases, experimental FP values reported for a molecule can vary depending upon the sources of data; only one value has been retained for each molecule. The priority was given to values reported as “accepted” by the DIPPR staff members. The complete database containing FPs of 625 hydrocarbons and oxygenated compounds can be extracted from the Supporting Information. The database used by Creton et al. and the compendium by Murphy et al. were used to build the database of CNs of hydrocarbons and oxygenated compounds, respectively.43,53 However, because most of the data were acquired several decades ago, an uncertainty is associated with the CN measurement for many of these compounds.42 This error is mainly due to (i) the impurities in the chemical products, (ii) the way that CN was measured (pure molecule or in a mixture with a fuel), and (iii) in some cases, CN calculated from reported cetene or octane number values. For many compounds, multiple reported values may differ by 5 or 10 CN units,53 and the median value was taken into account. A complete database containing CNs of 299 hydrocarbons was established, and all these data can be extracted from the Supporting Information. In the case of data for which high doubt can be attributed to the experimental reported value, compounds were set aside and comparisons will be performed with predictions of our model. 2.2. Molecular Structures. Simplified molecular input line entry specification (SMILES) formulas were assigned to each molecule. Most of the SMILES were extracted from the DIPPR database, but for molecules outside the DIPPR database as well as for molecules having particularly complex specifications, such as chirality or configurations around double bonds, SMILES formulas were manually created. The Python library Pybel was used to convert all SMILES formulas to canonical SMILES forms.54 Three-dimensional (3D) structures of molecules were automatically generated on the basis of SMILES formulas using Pybel. All geometries were then optimized using Materials Studio 5.0 software,55 and its Forcite module with the condensed-phase optimized molecular potentials for atomistic simulation studies (COMPASS) force field and atomic charges were attributed using the Gasteiger method.5658 Finally, all structures were 3901

dx.doi.org/10.1021/ef200795j |Energy Fuels 2011, 25, 3900–3908

Energy & Fuels

ARTICLE

Table 1. Functional Groups Identified To Be Relevanta number

group

SMARTS notation

1

H

[H]

2

CH3

[CX4H3]

3

CH2

[CX4H2]

4

>CH

[CX4H1]

5

>C
CH)R

[CX4H1R]

13

(>CCdO)R CHO

[CX3H0R]d[O] [CX3H1]d[O]

26

COOH

[CX3H0](d[O])[OX2H1]

27

COO

[CX3H0](d[O])[OX2H0]

28

aaCa

[cX3H0](:*)(:*):*

a

Symbols are defined as follows: s, a, and R stand for simple bond, aromatic bond, and aliphatic ring, respectively. translated and rotated to minimize moments of inertia around the x, y, and z axes of the Cartesian coordinate system. 2.3. Molecular Descriptor Calculation and Selection. On the basis of minimized geometries, the Materials Studio software was used to compute a wide number of one-dimensional (1D), two-dimensional (2D), and 3D molecular descriptors.43 Given the impossibility of a full search, heuristic methods have been developed to select descriptors of particular importance.59 The main goals of this stage of preselection can be summarized as (i) maximizing the usefulness of the descriptors and (ii) minimizing the presence of descriptors that are excessively intercorrelated. Linear correlation coefficients, such as the well-known Pearson correlation coefficient, are commonly used and included in offthe-shelf QSPR software packages, such as Materials Studio’s QSAR module. Katritzky et al. have successfully used multilinear regression to explore the descriptor space in QSPR problems.60 For nonlinear dependencies, it is possible to use a nonlinear model using a single variable as a replacement for linear correlation.61 A different family of approaches, known as wrapper methods, consists of descriptor selection methods “wrapped around” learning algorithms.62 Because of the convenience of support vector machines (SVMs),63 a number of SVMs based wrapper methods, such as recursive feature elimination (RFE)64 and GAs,65 have been reported in the literature. Moreover, some simpler descriptor selection methods, such as forward selection and backward elimination, are considered effective.66 Some of the learning algorithms used and described hereafter include a built-in feature selection or dimensionality reduction methods. For

methods having no implicit feature selection, such as ANN, SVMs, and generalized regression neural networks (GRNNs), forward selection wrapped around a SVM fit was applied to the descriptors to find the best set of descriptors. Descriptions of selected molecular descriptors and their values computed for each molecule belonging to the databases are given in the Supporting Information. The decomposition of molecular structures into functional groups is also a well-known technique used to generate so-called functional group count descriptors.67 In this study, on the basis of the functional groups listed by Pan et al.,67 28 functional groups were identified to be relevant by analyzing chemical structures in our databases. These 28 functional groups are given in Table 1. The substructure search was carried out using the SMARTS language over SMILES formulas for each compound belonging to the database. The results of the substructure search can be found in the Supporting Information. 2.4. Data Partitioning. Before building the QSPR models, the main database was split into three data subsets: (i) The training set is used to train each learning algorithm, and during cross-validation, it is split in several folds, which are used as cross-validation training and cross-validation test sets. (ii) The validation set serves as an additional tool to estimate the predictive error of individual QSPR models, as well as an indicator to determine which models generalize better. (iii) The test set is used to estimate the predictive error of the final model. The total number of compounds in the FP and CN databases allows us to build data subsets: training, validation, and test sets consisting of 70, 20, and 10% of the entire database, respectively. The split was carried out using the random sampling functions implemented in the R statistical environment. Indeed, a random selection was preferred instead of other methods in order not to favor one of the approaches. Figure 1 shows that distributions of FPs and CNs are similar for all of the data subsets. This ensures that all subsets are representative of the original data set in terms of FP and CN values. 2.5. Multivariate Analysis. 2.5.1. Linear Models. Genetic function approximation (GFA), as implemented in Accelrys Materials Studio,55 was chosen because of its ability to find linear relationships over large numbers of features without overfitting. GFA uses a series of selection and reproduction cycles coupled with an objective function, known as the Friedman lack of fit (LOF) score, to choose the best fitting models while controlling the number of descriptors. Partial least squares (PLS) regression, which is a well-known multivariate data analysis technique, was built using R functionalities. PLS builds new variables, known as latent variables, by constructing orthogonal sets of linear combinations of descriptors, resulting in a new space with fewer dimensions. 2.5.2. Nonlinear Models. ANNs are not only widely used in QSPR but also in machine learning in general.68 An ANN consists of a number of nodes also called neurons connected by edges also called synapses. ANN can be used for both the task of supervised classification as well as regression. There are many different types of ANN models. For this study, the specific types of ANN models used were feed-forward artificial neural networks69 (FF-ANNs), graph machines70 (GMs), and GRNNs, a kernel-based method.71 Another widely used nonlinear regression method, SVM regression, was also used to build predictive models.

3. RESULTS AND DISCUSSION Tables 2 and 4 show values of various statistical parameters computed with respect to the validation set for predictive models of FP and CN obtained using methods such as GFA, PLS, FFANN, GRNN, SVM, and GM and two families of descriptors: molecular descriptors and functional group count descriptors. For all models, with the exception of GFA and GM, 10-fold crossvalidation was applied for both fine tuning of model parameters and predictive error estimation.72 For GFA and GM, crossvalidation took place in the form of leave-one-out (LOO) 3902

dx.doi.org/10.1021/ef200795j |Energy Fuels 2011, 25, 3900–3908

Energy & Fuels

ARTICLE

Figure 1. Distributions of molecules in the FP and CN databases and in their respective data subsets.

Table 2. Statistical Parameters Computed over the Validation Set for the FP Predictive Models GFA

PLS

FF-ANN

GRNN

SVM

0.907

Molecular Descriptors R2

0.902

0.930

0.957

0.772

RMSE (K)

16.3

13.7

10.7

22.8

16.3

AAE (K)

10.7

9.4

7.7

13.8

8.4

AARE (%)

3.3

2.9

2.4

4.2

2.5

Functional Group Count Descriptors 2

R RMSE (K)

0.775 23.1

0.811 22.3

0.957 10.9

0.427 31.4

0.951 11.6

AAE (K)

16.9

15.7

7.6

23.3

7.6

AARE (%)

5.5

5.0

2.3

7.4

2.3

GM 2

R

0.922

RMSE (K)

15.1

AAE (K)

10.4

AARE (%)

3.2

validation using the cross-validated R2 statistic and virtual-leaveone-out (V-LOO) cross-validation,70 respectively. 3.1. FP Predictions. 3.1.1. Selection of the Best Predictive Models. Statistical parameters presented in Table 2 indicate that three of our models stand out in terms of predictive power. These models are FF-ANN—MD, FF-ANNFGCD, and SVM— FGCD, where MD and FGCD denote molecular descriptors and functional group count descriptors, respectively. The three models are roughly equivalent, exhibiting a R2 value close to 1 and the lowest values of root-mean-square error (RMSE), average absolute error (AAE), and average absolute relative error (AARE) among all predictive models. With the exception of FF-ANN and SVM approaches, the use of functional group count descriptors seems to lead to less accurate models than when using

molecular descriptors. Statistical coefficients obtained from the results of GM are not too far from those of the best models; however, the RMSE is quite high compared to that of the three selected models, which explains our choice to set aside this model. 3.1.2. Consensus Modeling. QSPR models mentioned above are based on a variety of approaches, with each one having a slightly different response for molecular structures in terms of FP. Values returned by some of these QSPR models may underestimate experimental data, while others may lead to accurate or overestimated predictions. Thus, a final model hereafter labeled as the “consensus model” was built by averaging the predicted values of the models. Consensus modeling has previously shown improved generalization and prediction compared to individual predictive models.73,74 A script has been created to find the best consensus model, minimizing the sum of RMSE, AAE, and absolute bias over all combinations of single predictive models. The validation set was used to select the best consensus model. The thus-obtained consensus model is defined as the average of the predicted values of the following models: FF-ANN—MD, FF-ANNFGCD, SVM— FGCD, and PLS—MD. The prediction ability of the consensus model is summarized in Table 3 through statistical parameters and is illustrated in Figure 2. The RMSE obtained for the consensus model with respect to the validation set is smaller than the RMSE values of all individual models. None of the RMSE values computed for the three subsets is excessively greater than the experimental uncertainty, which is commonly assumed to be 10 K, and the RMSE obtained for the test set is close to that obtained for the training set. AAE values are well below the experimental uncertainty. These results confirm that consensus modeling improves the generalization with respect to the individual models. 3.1.3. Comparisons to Models of the Literature. Figure 2 shows a comparison between experimental and predicted FPs for the three data subsets. Predicted values are in good agreement with respect to experimental data, with the exception of a few molecules, e.g., the molecule having the highest experimental FP (533 K), 2,2-bis(hydroxymethyl)propane-1,3-diol. Although this molecule belongs to the training set, the four selected models 3903

dx.doi.org/10.1021/ef200795j |Energy Fuels 2011, 25, 3900–3908

Energy & Fuels

ARTICLE

Table 3. Statistical Parameters for the FP Consensus Model training

validation

test

total

0.959

Consensus Model (All Compounds) R2

0.960

0.967

0.944

RMSE (K)

10.9

9.4

13.2

10.9

AAE (K)

7.2

6.3

8.4

7.1

AARE (%)

2.2

1.9

2.5

2.2

bias (K)

0.3

0.7

0.9

0.2

n

437

125

63

625

Consensus Model (Hydrocarbons) R2 RMSE

0.970 8.2

0.977 5.8

0.965 8.1

0.971 7.8

AAE

5.3

4.3

5.1

5.1

AARE (%)

1.9

1.5

1.7

1.8

bias (K)

0.1

0.1

1.5

0.1

n

111

35

16

162

R2

0.883

0.930

0.917

0.896

RMSE

15.4

11.1

11.0

14.4

AAE AARE (%)

10.0 2.6

8.0 2.0

9.6 2.9

9.5 2.5

Consensus Model (Alcohols)

bias (K)

1.1

3.4

9.6

2.1

n

49

15

4

68

R2

0.900

0.905

0.857

0.896

RMSE

13.3

13.8

21.2

14.3

AAE

9.4

9.5

15.8

10.0

AARE (%) bias (K)

2.8 1.5

2.8 3.9

4.4 3.3

2.9 2.1

n

227

63

38

328

Consensus Model (Esters)

individually underestimate the experimental FP and the consensus model leads to 470 ( 7 K. For 2,2-bis(hydroxymethyl)propane-1,3-diol, ref 48 indicates a rough value (FP greater than 423 K), other websites such as www.chemblink.com and www. chemicalbook.com return 513 K, and ref 50 gives the value of 473 K. Thus, a large experimental uncertainty exists for the FP value of this molecule, and our model after learning over all compounds of the database indicates a value lower than that accepted by the DIPPR staff members. With regard to methylcyclopentadiene (experimental, 322 K; predicted, 263 K), the latest version of the DIPPR database advocates the use of a predicted value of 255 K, in agreement with the value obtained with our consensus model. One can also mention the case of methyl-(Z)-docos-13enoate (experimental, 306 K; predicted, 372 K); the website www. chemnet.com reports an experimental value of 357.75 K, in better agreement with our prediction. These comparisons suggest that the consensus model is resistant to noise. Table 3 presents statistical parameters computed for specific chemical families of the database. In the case of hydrocarbons, our model leads for the test set to 8.1 and 5.1 K for RMSE and AAE, respectively. Using the values presented by Tetteh et al., we have found that their neural network model leads to a RMSE of 44.0 K and an AAE of 8.1 K for the proposed test set, which contained 19 hydrocarbons.24 Pan et al. have reported, for a neural network model developed especially for n- and isoparaffins, an

Figure 2. Comparison between experimental and predicted FPs using the consensus model intended for hydrocarbons, alcohols, and esters. The dashed blue line is the ideal prediction, and the gap between blue lines denotes the experimental uncertainty on measurements (commonly assumed to be 10 K).

absolute average error of 4.8 K computed over 15 molecules of a test set.23 Comparisons between experimental and predicted values in the revised work by Patel et al. lead to a high value of RMSE of 101.9 K and an AAE of 19.2 K when considering only hydrocarbons of the test set.46 These latest values seem to be quite high compared to that of our model and others. Among the 10 compounds in the test set of Patel et al., predictions for 2 (tricosane and but-2-yne) molecules strongly differ with respect to proposed experimental values. Noting that we have found in the literature experimental data that are more in agreement with predicted values for these two compounds, when setting aside these two outliers, the RMSE and AAE of the model by Patel et al. become 22.1 and 6.8 K, respectively. Comparisons between AAE show that our model is in line with the predictive power of models intended for the prediction of the FPs of hydrocarbons. From the RMSE values of models, it appears that the use of consensus modeling improves the stability of predictions. In the case of alcohols, Table 3 shows that the RMSE and AAE of our model are for the test set to 11.0 and 9.6 K, respectively. Using proposed experimental and predicted data in the paper by Tetteh et al., the RMSE and AAE are 41.3 and 7.3 K when considering the 14 alcohols in the test set, respectively.24 From experimental and predicted data presented in the revised work by Patel et al., the computed RMSE and AAE are 83.1 and 17.3 K for compounds in the test set, respectively.46 AAE of models show that our model has a predictive power similar to that by Tetteh et al.; moreover, the use of consensus modeling leads to a low RMSE value. In the case of esters, Table 3 indicates that our model leads for compounds belonging to the test set to 21.2 and 15.8 K for RMSE and AAE, respectively. The slightly high values of RMSE and AAE compared to those obtained for hydrocarbon and alcohol compounds can be related to the highest observed uncertainty of measurements when considering fatty acid methyl esters (reproducibility of about 15 K).75 We have computed using the experimental and predicted values proposed in the paper by Tetteh et al. the RMSE and AAE for esters in the test set, which leads to 34.2 and 7.2 K, respectively.24 Taking into account 3904

dx.doi.org/10.1021/ef200795j |Energy Fuels 2011, 25, 3900–3908

Energy & Fuels the slightly larger range of temperatures considered in our study (253466 K in our study, against 253421 K in the study by Tetteh et al.), our consensus model leads to average predictions with higher errors than that of the model by Tetteh et al.; however, their model seems to be less stable than ours. Khajeh et al. have developed QSPR models only intended for the prediction of the FPs of esters, using two approaches: GFA and adaptive neuro-fuzzy inference system (ANFIS).76 The models that result from these two approaches present similar RMSE and AAE: 21.8 and 16.7 K (GFA) and 20.2 and 16.2 K (ANFIS). Statistical parameters computed for our model that has been trained on various chemical families are similar to those obtained for the model by Khajeh et al. 3.1.4. Prediction of FPs for Some Compounds. The consensus model was used to predict FPs of compounds for which to the best of our knowledge no experimental value exists in the literature. The evolutions of FP values when increasing n are presented in Figures 3 and 4; n denotes either the number of carbon atoms along the principal chain or the position of a specific functional group. Figures 3 and 4 were drawn using information contained in the database (see the Supporting Information), and because of the uncertainty of predictions, data have been fit using polynomial equations. Resulting tendencies presented in Figure 3 show that moving a group (methyl or ethyl) from position 2 to positions 3, 4, and more on the main chain has scarcely any impact on the FP value. Figure 3 shows that the FP of nC-cyclohexane (a single linear chain attached to a cyclohexane) is slightly lower than that of nC-benzene when n is lower than 4, while it is the contrary when larger values of n are considered. We have drawn curves of the iso-carbon atom number using the FPs of the molecules considered in Figure 3. The FP appears as mainly dependent upon the total number of carbon atoms in the molecule, in agreement with empirical correlations linking the FP with the boiling point, which is correlated to the number of carbon atoms.18 In the case of jet fuel applications, hydrocarbons satisfying a FP of at least 311.15 K (38 °C, which is the limit required by the ASTM D1655) are those containing a total number of carbon atoms of 10 and more. Figure 4, which is devoted to alcohols, indicates that, when considering a small length in the principal chain, FPs of linear alcohols are higher than those of nparaffins. Figure 4 shows that the FP of primary alcohol > the FP of secondary alcohol > the FP of tertiary alcohol, at a fixed number of carbon atoms. Furthermore, molecules with two alcohol functions are above the limit of 311.15 K (38 °C), and their FP seems to increase with the distance between the two alcohol functions. Because most experimental and/or predicted FPs of esters are greater than the limit fixed by the ASTM, we have chosen not to include FP tendencies of esters in this paper. Although triesters are not represented in the training set, we have attempted to predict FPs for some compounds, such as 2,3-di(dodecanoyloxy)propyl dodecanoate (predicted FP of 563 ( 231 K) or 2,3-bis[[(Z)-octadec9-enoyl]oxy]propyl (Z)-octadec-9-enoate (predicted FP of 571 ( 388 K). These huge uncertainties come from the PLS—MD, which leads to a FP of about 1000 K for these molecules. Considering only predictions of FF-ANN—MD, FF-ANNFGCD, and SVM— FGCD models, predicted FPs are 448 ( 45 and 379 ( 72 K for 2,3-di(dodecanoyloxy)propyl dodecanoate and 2,3-bis[[(Z)-octadec-9-enoyl]oxy]propyl (Z)-octadec-9-enoate, respectively. 3.2. CN Predictions. 3.2.1. Selection of the Best Predictive Models. Statistical parameters presented in Table 4 indicate the predictive power of the QSPR models developed using various approaches. Because of experimental values of CN equal to 0 in the database, we have chosen not to consider AARE for this

ARTICLE

Figure 3. Tendencies of the FP evolution for some hydrocarbons when increasing n. The parameter n denotes the number of carbon atoms along the principal chain in some hydrocarbons. The iso-number of carbon atom curves are represented in dashed gray.

Figure 4. Tendencies of the FP evolution for some alcohols with the parameter n. The parameter n denotes either the number of carbon atoms along the principal chain or the position of a specific functional group. The dotted line presents FP values for n-paraffins.

property. Among all models presented in Table 4, six seem to stand out in terms of predictive power: SVM—MD, GM, FF-ANN— MD, GFA—MD, SVM—FGCD, and FF-ANNFGCD. The GFA approach working on a set of molecular descriptors appears as an accurate approach to use when developing a QSPR model for the prediction of the CN. This fact is in line with the work by Creton et al., who have chosen this kind of approach to model the CN of pure hydrocarbons and have reported absolute average errors lower than 6.9.43 The fact that the AAE of our GFA—MD is 9.3 can be attributed (i) to the consideration of oxygenated compounds in the database and (ii) to the non-subdivision of the database into subsets representative of chemical families. These items are strengthened from the work by Taylor et al., who have used a GA approach on a database including hydrocarbons and oxygenated compounds and have reported a standard error of 9.1 for their predictive model.44 Taylor et al. have obtained lower standard errors after classifying their database by chemical family. Table 4 shows that the use of nonlinear approaches, such as GM, SVM, and FF-ANN, leads to absolute average errors of about 8.4 CNs. Taking into account that our database includes oxygenated compounds, the predictive ability 3905

dx.doi.org/10.1021/ef200795j |Energy Fuels 2011, 25, 3900–3908

Energy & Fuels

ARTICLE

Table 4. Statistical Parameters Computed over the Validation Set for the Global CN Predictive Models GFA

PLS

R2

0.794

0.635

RMSE AAE

12.0 9.3

15.1 11.4

FF-ANN

GRNN

SVM

0.845

0.435

0.897

11.3 8.2

16.0 11.6

8.6 6.9

Molecular Descriptors

Functional Group Count Descriptors 2

R

0.123

0.428

0.733

0.601

0.802

RMSE

16.9

15.3

12.5

18.1

12.4

AAE

13.7

11.9

9.3

14.2

8.4

GM 2

R

0.852

RMSE

11.0

AAE

8.1

Table 5. Statistical Parameters for the Consensus Global Model of CN 2

training

validation

test

total

R

0.950

0.883

0.913

0.934

RMSE

5.5

8.7

6.5

6.3

AAE

4.1

6.5

5.7

4.7

bias

0.0

0.5

0.4

0.0

n

164

46

19

229

of our nonlinear models developed using GM, SVM, and FF-ANN methods is consistent with models previously obtained by Santana et al.42 Indeed, these authors have reported AAE of 8 CNs for an ANN model intended for the prediction of hydrocarbons. 3.2.2. Consensus Modeling. Our script dedicated to the search for the best consensus model among all linear combinations of single predictive models was used. The selection of the consensus model was performed through comparisons between predicted and experimental data in the validation set. The thus-obtained consensus model is built by averaging the predicted values of FFANN—MD, GRNN—MD, SVM—MD, SVM—FGCD, and GM. The predictive ability of the consensus model is summarized in Table 5 through statistical parameters and is illustrated in Figure 5. None of the RMSE values computed for the three subsets is excessively greater than the experimental uncertainty, which is commonly assumed to be 5 for hydrocarbons and greater for oxygenated compounds, and the RMSE obtained for the test set is close to that computed for the training set. One can remark that AAE values are comparable to the experimental uncertainty and that AAE values resulting from our model are inferior to those reported in refs 42 and 44, which are 8 and 9 CN units, respectively. These results confirm that consensus modeling improves the generalization with respect to individual models. 3.2.3. Prediction of CN for Some Compounds. Figure 5 presents a comparison between experimental and predicted CNs for the three data subsets plus one that gathers all compounds set aside because of the doubt attributed to the reported experimental values. The agreement between the results of the consensus model and experimental values is good. However, large standard deviations can be observed, which indicate that

Figure 5. Comparison between experimental and predicted CNs using the consensus model intended for hydrocarbons, alcohols, and esters. The dashed blue line is the ideal prediction, and the gap between blue lines denotes the experimental uncertainty on measurements (commonly assumed to be 5 CN units).

Figure 6. Tendencies of the CN evolution for some families of compounds with the parameter n. The parameter n denotes either the number of carbon atoms along the principal chain or the position of a specific functional group.

individual predictive models do not agree. It is important to note that both CNs of isocetane (2,2,4,4,6,8,8-heptamethylnonane) and cetane (n-hexadecane) are well-predicted by our model. Predicted CNs of compounds placed in the prediction set seem to agree with reported experimental values. However, some large deviations between predicted and experimental values are observed in the extreme parts of Figure 5. As an example, for npropane, the reported experimental value is 20, while our model leads to 29 ( 11 CN units. Considering all experimental CNs of n-paraffins, a CN of 29 seems to be more reliable than the reported experimental value (see Figure 6). Octyloleate and dodecyloleate were set aside because of the large experimental CN reporting 131 and 134, respectively. The consensus model, as all of its constitutive predictive models, leads to values largely below experimental values for octyloleate and dodecyloleate (86 ( 7 and 94 ( 12, respectively). 3906

dx.doi.org/10.1021/ef200795j |Energy Fuels 2011, 25, 3900–3908

Energy & Fuels The consensus model was used to predict CNs of compounds for which to the best of our knowledge no experimental value exists in the literature. Using predicted CN, one can observe that, when the methyl group is in position 2 on the principal chain, it leads to a slightly higher CN than when this group is in positions 3, 4, and more, e.g., methyloctane, methylnonane, or methyldecane.43 The evolutions of CN when increasing a parameter n are presented in Figure 6; n denotes either the number of carbon atoms along the principal chain or the position of a specific functional group on the principal chain. Figure 6 was drawn using information contained in the database (see the Supporting Information). Because of the uncertainty associated with predictions, data have been fit with polynomial equations. From the thus-obtained tendencies, the evolution of the CN of n-paraffinic compounds when increasing the number of carbon atoms appears similar to previously published curves.43,77 Figure 6 presents a comparison between the CNs of nC-cyclohexane and nC-benzene compounds, with n in the range of 120. For each value of n, the difference between CNs of these two families of compounds is roughly constant, 25 CN units. Figure 6 also shows the impact of the presence of an alcohol group in a linear paraffin. Thus, the CNs of nC-1-ol compounds are about 20 CN units lower than that of the corresponding n-paraffin. A 5 CN unit difference is observed between CNs when the OH group is moved from position 1 to 2. Predictions in the database show that the CN of primary alcohols (nC-1-ol) > the CN of secondary alcohols (nC-2-ol) > the CN of tertiary alcohols (2-methyl-nC-2ol or 3-methyl-nC-3-ol), at a fixed number of carbon atoms. In the case of molecules with two alcohol groups, the localization of the group does not seem to impact the CN value, e.g., the case of C51,n-diol compounds. With regard to esters that can be written as ROC(dO)R0 , Figure 6 shows that the CN does not depend upon the localization of the main carbon chain, i.e., whether nC is located on R or R0 .

4. CONCLUSION QSPR models have been developed to model the FP and CN of molecules likely to be found in alternative fuels, i.e., hydrocarbons, alcohols, and esters. Experimental values of FP and CN were gathered using different sources and/or data available in the literature. For both properties, various approaches were investigated from linear modeling methods, such as GAs and PLS, to nonlinear methods, such as FF-ANNs, GRNNs, SVMs, and GMs. For both properties, none of the obtained models was significantly more accurate than others; thus, consensus modeling was used because this methodology is known to improve generalization and predictive power compared to individual predictive models. For the two properties of interest, the obtained computed results using the predictive consensus models are in good agreement with respect to experimental data and average absolute deviations are similar to the experimental uncertainties. For FP, the predictive power of our consensus model was compared to that of models in the literature. We have shown that our model reproduces the FPs of molecules belonging to the target families of compounds at least as well as other QSPR models in the literature specifically developed for each chemical family. Moreover, deviations for each chemical family matched their respective experimental uncertainty, particularly in the case of hydrocarbons and esters. Consequently, we have used our model to estimate the FPs of compounds for which to the best of our knowledge no

ARTICLE

experimental FP can be found in the literature. Using these values, we have extracted information about the evolution of the FP when increasing the number of carbon atoms or moving the position of a specific functional group. Thus, the FP appears to be mainly dependent upon the total number of carbon atoms in the molecule. It is noteworthy that correlations between the FP and boiling point have been established, with this later property being correlated to the number of carbon atoms or molecular weight.18 In the case of CN, only a few works of the literature deal with the prediction of the CNs of pure hydrocarbons and oxygenated compounds. The predictive power of our consensus model is at least the same as that of models in the literature, which are often only devoted to hydrocarbons. Predictions were realized for compounds for which to our knowledge no experimental CNs have been measured. Using this information, we have shown how the CN evolves when adding one or two alcohol groups to a carbon chain and when moving these groups along the carbon chain. This work shows that, when using good quality databases and various QSPR approaches and when applying consensus modeling, the thus-obtained predictive models are powerful tools to estimate the property values of complex and/or difficult to isolate hydrocarbons and oxygenated compounds. This tool can be useful to select a molecule according to its properties. As an example, the first linear alcohol that satisfies the limit of 311.15 K (38 °C) is pentan-1-ol. Ethanol, which is used as an alternative fuel for automotives, with a FP of 286 K, is not convenient for aeronautic uses. This study is to be extended to other stringent properties of gasoline, jet, and diesel fuels.

’ ASSOCIATED CONTENT

bS

Supporting Information. Brief description of the molecular descriptors in QSPR models and values of molecular descriptors computed for each molecule belonging to the training, validation, and test sets. This material is available free of charge via the Internet at http://pubs.acs.org.

’ AUTHOR INFORMATION Corresponding Author

*E-mail: [email protected].

’ REFERENCES (1) International Energy Agency (IEA). World Energy Outlook 2010; IEA: Paris, France, 2010; http://www.iea.org. (2) Schulz, H. Appl. Catal., A 1999, 186, 3. (3) Murata, K.; Liu, Y.; Inaba, M.; Takahara, I. Energy Fuels 2010, 24, 2404. (4) Weiss, W.; Dulot, H.; Quignard, A.; Charon, N.; Courtiade, M. Proceedings of the International Pittsburgh Coal Conference; Istanbul, Turkey, Oct 1114, 2010. (5) Carlson, T. R.; Tompsett, G. A.; Conner, W. C.; Huber, G. W. Top. Catal. 2009, 52, 241. (6) Demirbas, A. Energy Convers. Manage. 2009, 50, 14. (7) Demirbas, M. F. Appl. Energy 2009, 86, S151. (8) American Society for Testing and Materials (ASTM). ASTM D1655, Standard Specification for Aviation Turbine Fuels; ASTM: West Conshohocken, PA, 2011. (9) Liaw, H.-J.; Lee, Y.-H.; Tang, C.-L.; Hsu, H.-H.; Liu, J.-H. J. Loss Prev. Process Ind. 2002, 15, 429. (10) Vidal, M.; Rogers, W. J.; Mannan, M. S. Process Saf. Environ. Prot. 2006, 84, 1. (11) Catoire, L.; Paulmier, S.; Naudet, V. Process Saf. Prog. 2006, 25, 33. 3907

dx.doi.org/10.1021/ef200795j |Energy Fuels 2011, 25, 3900–3908

Energy & Fuels (12) Liaw, H.-J.; Lin, S.-C. J. Hazard. Matter. 2007, 140, 155. (13) Liaw, H.-J.; Gerbaud, V.; Chiu, C.-Y. J. Chem. Eng. Data 2010, 55, 134. (14) Liaw, H.-J.; Gerbaud, V.; Li, Y.-H. Fluid Phase Equilib. 2011, 300, 70. (15) Pidol, L; Lecointe, B.; Jeuland, N. SAE Int. J. Fuels Lubr. 2009No. 2009-01-1807. (16) Vidal, M.; Rogers, W. J.; Holste, J. C.; Mannan, M. S. Process Saf. Prog. 2004, 23, 47. (17) Liu, X.; Liu, Z. J. Chem. Eng. Data 2010, 55, 2943. (18) Katritzky, A. R.; Kuanar, M.; Slavov, S.; Hall, C. D. Chem. Rev. 2010, 110, 5714. (19) Gramatica, P.; Navas, N.; Todeschini, R. Trends Anal. Chem. 1999, 18, 461. (20) Khajeh, A.; Modarress, H. J. Hazard. Matter. 2010, 179, 715. (21) Zhokhova, N. I.; Baskin, I. I.; Palyulin, V. A.; Zefirov, A. N.; Zefirov, N. S. Russ. Chem. Bull. 2003, 52, 1885. (22) Mathieu, D. J. Hazard. Matter. 2010, 179, 1161. (23) Pan, Y.; Jiang, J.; Wang, Z. J. Hazard. Matter. 2007, 147, 424. (24) Tetteh, J.; Suzuki, T.; Metcalfe, E.; Howells, S. J. Chem. Inf. Comput. Sci. 1999, 39, 491. (25) Katritzky, A. R.; Stoyanova-Slavova, I. B.; Dobchev, D. A.; Karelson, M. J. Mol. Graphics Modell. 2007, 26, 529. (26) Gharagheizi, F.; Alamdari, R. F.; Angaji, M. T. Energy Fuels 2008, 22, 1628. (27) Gharagheizi, F.; Abbasi, R. Ind. Eng. Chem. Res. 2010, 49, 12685. (28) Carroll, F. A.; Lin, C.-Y.; Quina, F. H. Energy Fuels 2010, 24, 4854. (29) Batov, D. V.; Mochalova, T. A.; Petrov, A. V. Russ. J. Appl. Chem. 2011, 84, 54. (30) Carroll, F. A.; Lin, C.-Y.; Quina, F. H. Ind. Eng. Chem. Res. 2011, 50, 4796. (31) Gharagheizi, F.; Eslamimanesh, A.; Mohammadi, A. H.; Richon, D. Ind. Eng. Chem. Res. 2011, 50, 5877. (32) American Society for Testing and Materials (ASTM). ASTM D975, Standard Specification for Diesel Fuel Oils; ASTM: West Conshohocken, PA, 2011. (33) European Standards Organization (CEN). EN590 Standard Specification on the Quality of European Diesel Fuel; CEN: Brussels, Belgium, 2009. (34) American Society for Testing and Materials (ASTM). ASTM D613, Standard Test Method for Cetane Number of Diesel Fuel Oil; ASTM: West Conshohocken, PA, 2010. (35) European Standards Organization (CEN). EN ISO 5165, Standard Test Method for Cetane Number of Diesel Fuel Oil; CEN: Brussels, Belgium, 2009 (36) De la Paz, C.; Rodríguez, J. E.; Valentin, C. P.; Ramos, E. R. Pet. Sci. Technol. 2007, 25, 1225. € (37) Ozdemir, D. Pet. Sci. Technol. 2008, 26, 101. (38) Pitz, W. J.; Mueller, C. J. Prog. Energy Combust. Sci. 2011, 37, 330. (39) Huber, M. L.; Lemmon, E. W.; Bruno, T. J. Energy Fuels 2010, 24, 3565. (40) Ghosh, P.; Jaffe, S. B. Ind. Eng. Chem. Res. 2006, 45, 346. (41) Yang, H.; Fairbridge, C.; Ring, Z. Pet. Sci. Technol. 2001, 19, 573. (42) Santana, R. C.; Do, P. T.; Santikunaporn, M.; Alvarez, W. E.; Taylor, J. D.; Sughrue, E. L.; Resasco, D. E. Fuel 2006, 85, 643. (43) Creton, B.; Dartiguelongue, C.; de Bruin, T.; Toulhoat, H. Energy Fuels 2010, 24, 5396. (44) Taylor, J. D.; McCormick, R. L.; Clark, W. Report on the Relationship between Molecular Structure and Compression Ignition Fuels, Both Conventional and HCCI; National Renewable Energy Laboratory (NREL): Golden, CO, 2004; MP-540-36726 (45) Catoire, L.; Naudet, V. J. Phys. Chem. Ref. Data 2004, 33, 1083. (46) (a) Patel, S. J.; Ng, D.; Mannan, M. S. Ind. Eng. Chem. Res. 2009, 48, 7378. (b) Patel, S. J.; Ng, D.; Mannan, M. S. Ind. Eng. Chem. Res. 2010, 49, 8282.

ARTICLE

(47) Rowley, R. L.; Wilding, W. V.; Oscarson, J. L.; Yang, Y.; Zundel, N. A.; Daubert, T. E.; Danner, R. P. Design Institute for Physical Properties (DIPPR) Data Compilation of Pure Compound Properties; DIPPR, American Institute of Chemical Engineers (AIChE), New York, 2003. (48) http://www.sigmaaldrich.com/france.html (49) http://www.alfa.com/fr/gh100w.pgm (50) http://www.lookchem.com/ (51) http://ull.chemistry.uakron.edu/erd/ (52) Yaws, C. L. Chemical Properties Handbook: Physical, Thermodynamic, Environmental, Transport, Safety and Health Related Properties for Organic and Inorganic Chemicals; McGraw-Hill: New York, 1999. (53) Murphy, M. J.; Taylor, J. D.; McCormick, R. L. Compendium of Experimental Cetane Number Data; National Renewable Energy Laboratory (NREL): Golden, CO, 2004; SR-540-36805. (54) O’Boyle, N. M.; Morley, C.; Hutchison, G. R. Chem. Cent. J. 2008, 2, 5. (55) Accelrys Software, Inc. Materials Studio, Release 5.0; Accelrys Software, Inc., San Diego, CA, 2009. (56) Sun, H. J. Phys. Chem. B 1998, 102, 7338. (57) Sun, H.; Ren, P.; Fried, J. R. Comput. Theor. Polym. Sci. 1998, 8, 229. (58) Gasteiger, J.; Marsili, M. Tetrahedron 1980, 36, 3219. (59) Katritzky, A. R.; Lobanov, V.; Karelson, M. CODESSA: Reference Manual; University of Florida: Gainesville, FL, 1996. (60) Katritzky, A. R.; Kuanar, M.; Dobchev, D. A.; Vanhoecke, B. W. A.; Karelson, M.; Parmar, V. S.; Stevens, C. V.; Bracke, M. E. Bioorg. Med. Chem. 2006, 14, 6933. (61) da Costa Couto, M. P. Neural Comput. Appl. 2009, 18, 891–901. (62) Kohavi, R.; John, G. Artif. Intell. 1997, 97 (12), 273–324. (63) (a) Vapnik, V. N. The Nature of Statistical Learning Theory; Springer: Berlin, Germany, 1995. (b) Vapnik, V. N. Statistical Learning Theory; John Wiley and Sons: New York, 1998. (64) Xue, Y.; Li, Z. R.; Yap, C. W.; Sun, L. Z.; Chen, X.; Chen, Y. Z. J. Chem. Inf. Comput. Sci. 2004, 44, 1630–1638. (65) Pan, Y.; Jiang, J.; Wang, R.; Cao, H.; Cui, Y. J. Hazard. Mater. 2009, 168, 962–969. (66) Guyon, I.; Elisseeff, A. J. Mach. Learn. Res. 2003, 3, 1157. (67) Pan, Y.; Jiang, J.; Wang, R.; Cao, H.; Zhao, J. QSAR Comb. Sci. 2008, 27, 1013. (68) Simoes, M. IEEE Trans. Ind. Electron. Control Instrum. 2003, 50, 585. (69) Billings, S. Int. J. Control 1992, 56, 319. (70) Goulon, A.; Duprat, A.; Dreyfus, G. Lect. Notes Comput. Sci. 2006, 4135, 1. (71) Niwa, T. J. Chem. Inf. Comput. Sci. 2003, 43, 113. (72) Ambroise, C.; McLachlan, G. J. Proc. Natl. Acad. Sci. U.S.A. 2002, 99, 6562. (73) Tropsha, A. Mol. Inf. 2010, 29, 476. (74) Grammatica, P.; Giani, E.; Papa, E. J. Mol. Graphics Modell. 2007, 25, 755. (75) American Society for Testing and Materials (ASTM). ASTM D3828, Standard Test Method for Flash Point; ASTM: West Conshohocken, PA, 2009. (76) Khajeha, A.; Modarressb, H. J. Hazard. Mater. 2010, 79, 715. (77) Ghosh, P. Energy Fuels 2008, 22, 1073.

3908

dx.doi.org/10.1021/ef200795j |Energy Fuels 2011, 25, 3900–3908