Multiobjective Feature Selection Approach to Quantitative Structure

Apr 28, 2017 - In the proposed procedure, optimized subsets of descriptors are used to build the final QSPR models to predict the RON and MON of pure ...
3 downloads 11 Views 527KB Size
Subscriber access provided by UB + Fachbibliothek Chemie | (FU-Bibliothekssystem)

Article

Multiobjective Feature Selection Approach to Quantitative Structure Property Relationship (QSPR) Models for Predicting the Octane Number of Compounds Found in Gasoline Zhefu Liu, Linzhou Zhang, Ali Elkamel, Dong Liang, Suoqi Zhao, Chunming Xu, Stanislav Y Ivanov, and Ajay Kumar Ray Energy Fuels, Just Accepted Manuscript • Publication Date (Web): 28 Apr 2017 Downloaded from http://pubs.acs.org on April 29, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Energy & Fuels is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Multiobjective Feature Selection Approach to Quantitative Structure Property Relationship (QSPR) Models for Predicting the Octane Number of Compounds Found in Gasoline Zhefu Liu†, Linzhou Zhang*,†, Ali Elkamel‡, Dong Liang†, Suoqi Zhao†, Chunming Xu†, Stanislav Y. Ivanov§ and Ajay K. Ray§ †

State Key Laboratory of Heavy Oil Processing, China University of Petroleum,

Beijing, 102249, P. R. China ‡

Chemical Engineering department of University of Waterloo, Waterloo, N2L 3G1,

Canada §

Western University, London, N6A 3K7, Canada

Abstract Octane number is one of the most important factors for determining the price of gasoline. The increasing popularity of molecular models in petroleum refining has made predicting key properties for pure components more important. In this paper, quantitative structure property relationship (QSPR) models are developed to predict the research octane number (RON) and motor octane number (MON) of pure components using two databases. The databases include oxygenated and nitrogen-containing compounds as well as hydrocarbons collected from published data. QSPR models are widely utilized because they effectively characterize molecular structures with a variety of descriptors, especially different isomeric structures. Feature subset selection is an important step for increasing the performance and simplifying the complexity of a QSPR model by removing redundant and irrelevant descriptors. A two-step feature selection method is developed to identify appropriate subsets of descriptors from a multiobjective perspective: (1) a filter using the Boruta algorithm to remove noise features and (2) a multiobjective wrapper to simultaneously minimize the number of features and maximize the model accuracy. A multiobjective wrapper is developed to account for both the complexity and generalizability of models to resist overfitting, which commonly occurs when using a single-objective feature selection method. In the

ACS Paragon Plus Environment

Energy & Fuels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

proposed procedure, optimized subsets of descriptors are used to build the final QSPR models to predict the RON and MON of pure components via support vector machine regression. The proposed models are competitive with other models found in the literature. Keywords: Octane number; Pure components; QSPR; Overfitting; Multiobjective feature selection

ACS Paragon Plus Environment

Page 2 of 44

Page 3 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

1. Introduction The octane number reflects the antiknock properties of gasoline in an engine’s cylinder and is a determining factor in the price of gasoline. Engine knocking is a self-detonation phenomenon of a fuel-air mixture that occurs in the absence of a spark. Low knocking resistance of fuel may result in increased fuel consumption, loss of power and even engine damage.1 Two of the most commonly utilized approaches for measuring antiknock properties in the laboratory are the research and motor methods, which utilize different test engine working modes.2,3 The outcomes of these tests are quantitatively expressed as the research octane number (RON) and the motor octane number (MON). The engine tests used to assess the octane number are expensive and time consuming. Therefore, in recent decades, intensive research has focused on developing correlation models to predict the RON and MON. Researchers have combined multivariate calibration with different analytical techniques, such as near-infrared spectroscopy and gas chromatography, to predict the RON and MON of mixtures. Kelly et al. utilized near-infrared spectral features in the range of 660–1215 nm to predict the octane number of 43 unleaded gasoline samples.4 They adopted partial least squares (PLS) for the spectral correlation of octane numbers and obtained satisfactory results. Cooper et al. used Fourier transform (FT)-Raman spectroscopy coupled with PLS to determine the octane number of commercial petroleum fuels.5 Bao and Dai proposed a novel method based on PLS to detect outliers in spectral data collected via near-infrared and Raman spectroscopy.6 Nikolaou et al. developed a nonlinear model to calculate the RON of isomerization gasoline using gas chromatography data.7 They calculated the weight factor representing nonlinearity based on published pure component and blend data. Due to the application of molecular models in petroleum refining, predicting key properties for pure components has become more important. The models utilize an ensemble of molecular components rather than a bulk mixture of all components in a petroleum fraction to simulate the refining processes on a molecular level.8 Accurately predicting the properties, such as RON and MON, of individual compounds in the product fractions and the appropriate mixing rules is necessary for constructing effective molecular models. These models enable the designing of more effective ACS Paragon Plus Environment

Energy & Fuels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

methods for gasoline blending to produce higher octane number molecules in the target products while saving time and energy during the refining process. The octane number of pure components is highly dependent on the molecular structure. For example, the octane numbers of aromatics are affected by the ring type (five-member ring or six-member ring) and the location and branching degree of the alkyl groups attached to the rings. These factors are associated through nonlinear relationships to determine the target properties. Empirical models have been widely adopted to correlate chemical properties with descriptors characterizing molecular structures. Kahrs et al. proposed that correlation models can be divided into two groups: group/bond contribution methods and “most significant common features” methods.9,10 The first group utilizes predefined fragments (groups) in molecules to calculate the property of interest. Albahri proposed a structural group contribution method to predict the octane number of pure hydrocarbon liquids.11 Two types of models comprising 22 and 33 structural groups were built based on more than 200 hydrocarbons, and the best predicted average deviations of RON and MON were 4.0 and 5.7 units, respectively. However, the group contribution methods neglect the positions of the group connections within a molecule, which can make partitioning isomers difficult. In the second group of methods, molecular descriptors are calculated using group contributions, quantum chemical methods, simulated molecular mechanics, and the topology of the molecules.12 These methods are broadly designated quantitative structure property relationship (QSPR) models. QSPR approaches avoid the shortcomings of group contribution methods because they account for the molecular structures by using a variety of descriptors to characterize the differences between isomeric structures.13 QSPR models are highly promising for assessing target properties based on descriptors and are derived solely from molecular structures by fitting available experimental data in the absence of domain knowledge. The models have been successfully used to predict pure component properties of interest to researchers. Saldana et al. adopted QSPR methods to predict the flash point and cetane number for fuel compounds, including oxygenated compounds.14 The boiling points,

ACS Paragon Plus Environment

Page 4 of 44

Page 5 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

specific gravities and refraction indices of hydrocarbons were predicted by Ha et al.13 Additionally, Al-Fahemi et al.15 developed a QSPR model to predict the RON of hydrocarbons. They constructed an eight-parameter linear model using multiple linear regression (MLR) based on 40 training molecules to predict the RON of 25 molecules with a determination coefficient of R2 = 0.942 and a standard error of s = 6.328 units. However, the relations between the descriptors and octane number are complicated, and linear models do not always provide satisfactory outcomes. To determine the MON for paraffins, Zhu et al.16 developed a nonlinear QSPR model using support vector machine (SVM) techniques based on 72 molecules, with a determination coefficient of R2 = 0.934 and an average absolute error of 3.961 units. Six descriptors were selected using PLS wrapped in a genetic algorithm. Al-Fahemi and Zhu both used small databases, which restricts the application range of the models. This study develops an accurate general QSPR model to predict the RON and MON of pure components based on large and diverse databases that include oxygenates, nitrogen compounds and hydrocarbons. Thousands of descriptors are often included as independent variables in QSPR models. However, only a small subset carries the necessary information related to the target property. Although it is possible to construct QSPR models that include all the descriptors, there are advantages to minimizing the number of descriptors: (i) potentially improving the prediction accuracy by removing redundant and irrelevant descriptors, (ii) increasing the computing speed by decreasing the number of inputs, (iii) decreasing the degrees of freedom by reducing the number of descriptors when the number of descriptors is relatively larger than the number of available samples.17 The two main strategies used to reduce the number of descriptors are the filter and wrapper methods.18 The filter method is used to remove features that do not greatly contribute to the model, and this technique is independent of the regression algorithm (predictor). The filter method considers performance evaluation measures computed directly from the data without feedback from predictors.19 In most cases, relevance scores between features and properties are calculated using a function, such as the correlation coefficient or information content (mutual information), and low-scoring

ACS Paragon Plus Environment

Energy & Fuels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

features are removed. The filter method is substantially more computationally efficient than other methods since it avoids the evaluation of feature subsets; however, this method can produce bias in the prediction as it neglects the inter-correlations among features. Thus, the filter method is prone to generating subsets that contain variables with high collinearity, which may lead to losing important inputs.20 The wrapper method avoids the problems of the filter method by considering bias during the feature subset evaluation.21 The core of the wrapper comprises two components. An induction algorithm based on the training samples is used as a supervised evaluation model to assess the candidates of the feature subset, and a feature subset selection search searches for more accurate feature subsets based on the evaluation of the induction algorithm. The feature selection algorithm runs as a wrapper around the induction algorithm to optimize the feature subsets. The wrapper can achieve higher accuracy than the filter, but it always adopts a large number of variables for a small incremental improvement in model performance, which can result in an overfitted model. Occam’s razor, or the principle of parsimony, states that all necessary information should be used for modeling and nothing more; thus, it is undesirable to include unnecessary features in the model.22 Feature selection presents a multi-criterion optimization issue: the prediction accuracy and the complexity of the model (e.g., number of features). The approach of combining different criteria into one fitness function, which is used in genetic algorithms, has been studied to search for appropriate solutions for feature subsets.23,24 In one fitness function, a cost term representing the complexity of candidate feature subsets is summed with the accuracy term based on a particular weight. The main drawback of the fitness function is the difficulty of establishing an appropriate trade-off between the different terms in one fitness function. To avoid this problem, Emmanouilidis et al.25 proposed multiobjective optimization to simultaneously maximize the solution performance and minimize the number of features. Multiobjective optimization has been shown to be effective on various commonly used data sets in the field of feature selection. Wrapper methods based on multiobjective optimization prefer solutions with a smaller number of

ACS Paragon Plus Environment

Page 6 of 44

Page 7 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

variables, in contrast to the larger number of variables selected by single-objective wrappers. Therefore, multiobjective optimization is more powerful in resisting overfitting and has better generalizability in practical applications.26,27 Since the relationships between the RON and MON are not explicit for pure components, it is necessary to build separate models for RON and MON. In this paper, after data preprocessing, two steps of feature selection are used to predict the octane numbers of pure components: (1) Filter - the Boruta algorithm is used to conservatively filter out redundant descriptors to avoid losing important inputs and (2) Multiobjective wrapper - a wrapper method is applied using two objective functions for the number of descriptors and the cross-validation (CV) accuracy of the training data sets. The optimized solutions are tested on validation subsets, and the most suitable feature subsets are selected. 2. Materials 2.1 Experimental Data The quality and size of a database are critical for building and evaluating a QSPR model. In this study, two databases are built with 279 and 273 samples for the RON and MON, respectively, using API 4528 and the API technical data book,29 and approximately 30 heteroatom compounds are included from reference 30. The databases contain not only pure hydrocarbons but also a number of oxygenates and nitrogen compounds. We aim to create universal QSPR models to predict octane numbers for compounds that could exist in gasoline fractions or potential gasoline alternatives. The octane numbers in the databases range from negative values (for a few compounds) to more than 120. The distribution of the octane numbers is shown in Figure 1. The two dashed red lines indicate the average values calculated from the RON and MON databases. The RON and MON databases are randomly divided into three groups, the training set, validation set and test set, which contain 70%, 20% and 10% of the samples, respectively. The training set is used to train the machine learning algorithms and to find optimal subsets of descriptors during the feature selection procedure. The validation set is used to determine whether and when overfitting occurs during the

ACS Paragon Plus Environment

Energy & Fuels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

multiobjective optimization and provides accuracy scores to determine the feature subset with the greatest generalizability. The selected feature subset is then used to construct the final regression model and to evaluate the predictive error on the test set. 2.2 Molecular Structures Simplified molecular input line entry specification (SMILES) is manually performed for each molecule in our database considering complicated cases, such as chirality and configurations around double bonds. Open Babel (free software) is then used to convert the SMILES into canonical form.31 Next, the SMILES file is input into E-dragon32,33, a free online version of Dragon software, for the descriptor calculations. E-dragon supports three-dimensional (3D) atom coordinate calculations via CORINA34 and optimized 3D structure calculations using HYDROGENS online. A total of 1667 descriptors is generated for each molecule. 3. Methodology The RON and MON databases are represented by 279×1667 and 273×1667 matrices, respectively. Problems can arise when the number of variables is greater than the number of samples; therefore, it is necessary to reduce the dimensions by selecting subsets of the variables. However, obtaining suitable descriptor subsets is difficult because the computational complexity is O (2n), where n is the number of descriptors. When n is large, an exhaustive search is impractical. Thus, feature subset selection is a key procedure for constructing QSPR models. Figure 2 presents an overall flow chart of the proposed methodology. The data are pretreated before feature selection. In the first stage of feature selection, a filter is adopted to reduce the matrix dimensions since it is computationally fast. A multiobjective wrapper is then utilized to simultaneously further refine the prediction accuracy and decrease the number of descriptors. Furthermore, the elite solutions (optimized feature subsets) existing in the non-dominated fronts identified by the multiobjective optimization algorithm are collected to provide more options for the decision makers. The elite solutions are checked for overfitting using the validation data set because the CV used during the training process is not always able to overcome this intricate problem.35,36 Finally, the best-performing groups of descriptors

ACS Paragon Plus Environment

Page 8 of 44

Page 9 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

on both the training and validation data sets are employed as inputs to a machine learning algorithm to assess the generalizability to independent test subsets. 3.1 Data Preprocessing Before feature selection, data preprocessing is performed to remove all low-information descriptors from the two databases, including constant, low-variance and pairwise linearly correlated descriptors. (1) Constant descriptors with a standard deviation of less than 0.0001 and descriptors with missing values are removed. (2) Low-variance descriptors, which may become constant descriptors when the data are split into the training, validation and test subsets, are removed. (3) Pairwise linearly correlated descriptors are identified by the Pearson correlation coefficient (, ), which is commonly used in quantitative structure activity relationship (QSAR)/QSPR models.14,37 When the absolute value , for two descriptors is greater than or equal to a threshold value, then the two descriptors are matched as a pair, and the one that is highly correlated to all the other descriptors is deleted. The threshold value is set to 0.98, as recommended by Dragon software.38 The Pearson correlation coefficient, which is used to identify collinearity among descriptors, is calculated as:

, =

(,)

,

(1)

where , is the correlation coefficient between descriptors m and n; (, ) is the covariance; and    are the standard deviations of descriptors m and n, respectively. Some of the correlation coefficients between the descriptors derived from the RON database are shown in Table S1 in the supporting information. This table shows that two descriptors, Sp and Sv, are paired successfully. After data preprocessing, 698 and 682 of the 1667 total descriptors remain in the RON and MON databases, respectively. Each remaining descriptor is centered and scaled to a mean of 0 and a standard deviation of 1 to eliminate dimensional differences among descriptors.

ACS Paragon Plus Environment

Energy & Fuels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

3.2 Filter Step The Boruta algorithm39 efficiently identifies noisy variables from inputs via a variable importance procedure computed by multiple runs of a random forest40 (RF) algorithm. The algorithm randomly generates a copy variable (shadow attribute or permuted attribute) for each descriptor and ranks the original descriptors and shadow attributes together based on a Z-score. The Z-scores can be used to evaluate the variable importance distribution adopted by the RF. Only original features with Z-scores exceeding the highest values of the shadow attribute are considered relevant features in the prediction models, and the other features are omitted. The computational complexity of the Boruta algorithm depends on that of the RF and increases linearly with the number of features. Thus, this algorithm is suitable for problems in which the number of variables is larger than the number of objects, such as the case considered in this study. The post-filter descriptors are used as inputs to various machine learning algorithms to select the best algorithm for constructing the regression model. The algorithm is then used as the induction algorithm for the multiobjective wrapper in the next step. Linear and nonlinear machine learning algorithms are involved in the comparison, including PLS, feed forward neural networks41 (FFNNs), RF40 and SVM42, which have all been shown to be efficient in QSAR/QSPR. 3.3 Multiobjective Wrapper Step As mentioned in the Introduction, the wrapper method is composed of two components, an induction algorithm and a feature subset selection or optimization search. In the latter, the candidate feature subsets are encoded as strings of “0” and “1”, and the length of a string is equal to the number of features. When a descriptor is selected for a candidate feature subset, the position corresponding to the descriptor is set to “1” in the string; otherwise, it is set to “0”. These strings are decoded into feature subsets and imported into the induction algorithm to assess and rank the candidate feature subsets based on a predefined objective function. Depending on the evaluation results, the optimization component in the wrapper conducts a search to find better strings. The optimization component runs around the induction algorithm until the

ACS Paragon Plus Environment

Page 10 of 44

Page 11 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

optimization process converges or the objective function reaches the threshold. For regression problems, the common measures used for induction algorithms are the mean absolute error (MAE), mean squared error, and root mean squared error (RMSE). Single-objective wrappers are prone to selecting feature subsets compromising a large number of variables with little improvement in training accuracy, which can decrease the model performance on hold-out samples and result in overfitting. In this study, two objective functions are simultaneously evaluated and optimized using the wrapper method. In addition to the common measure of the MAE, a function to assess the feature subset size is added and conveniently implemented by summing the “1” values in each string. The multiobjective optimization search generates a convex Pareto front, where the solutions are non-dominated by other solutions, at least in one objective function. The Pareto front reveals the relationship between the size of a feature subset and the performance of the training data set. A curve can be generated by calculating the optimized feature subsets corresponding to the solutions over a validation data set to verify the generalizability of the Pareto-optimal solution. The Pareto front and validation curve can be compared to verify whether and when overfitting occurs during feature selection optimization. Furthermore, a multiobjective wrapper can generate an ensemble of feature subsets with different lengths rather than the single subset produced by a single-objective wrapper. The ensemble of feature subsets provides more options for decision makers depending on their demands. The non-dominated sorting genetic algorithm Ⅱ (NSGA-Ⅱ) is utilized in this study to search for the multiobjective global optima. The method was developed by Deb et al.43 and has been successfully implemented to solve multiobjective optimization problems in a wide range of domains. A detailed description can be found in the original report.44 3.4 Software The research is implemented in the R open source programming language and software environment.45 The following packages are used in this study: Boruta46 for the Boruta algorithm, randomForest47 for RF, e107148 for PLS, FFNN and SVM and

ACS Paragon Plus Environment

Energy & Fuels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

mco49 for NSGA-Ⅱ. The ggplot250 package is used for plotting, and doMC and foreach51 are used for multi-core parallel computing in an Ubuntu 14.04 LTS operating system. 4. Results and Discussion 4.1 Filter Feature Selection Results In the Boruta algorithm, the Z-score of each descriptor is computed based on 500 RF runs. The default values of the parameters in the RF algorithm are utilized as suggested by Svetnik et al.52 The parameters are set to 1000 for ntree and to one-third of the number of imported descriptors for mtry (233 for RON, 227 for MON). In the RF algorithm, ntree represents the number of “trees” used in the model, and mtry is the number of variables randomly sampled at each split. Boruta divides the inputs into three categories: Rejected, Tentative and Confirmed. These categories represent different ways of handling the inputs: deleting, pending and remaining. In Figure 3, the blue boxes represent the shadow variables that are randomly produced by Boruta, and the rightmost box represents the shadow variable with the highest Z-score. Descriptors with statistically lower Z-scores than that of the rightmost shadow variable are marked by red boxes and can be safely deleted. Descriptors marked with green boxes should be included in the model. Yellow boxes indicate Z-scores that are close to of the rightmost shadow variable and for which the Boruta algorithm cannot decide their status. In this study, to avoid losing important descriptors, all the descriptors in the Tentative group are retained and combined with those in the Confirmed group as inputs of the multiobjective wrapper. The distributions of the descriptors in the three categories are shown in Table 1. Table 1 shows that the descriptor distributions for the RON and MON databases are similar. Most of the descriptors in the two databases are in the Rejected group. After combining the descriptors in the Tentative and Confirmed categories, 180 and 185 descriptors remain in the RON and MON databases, respectively. 4.2 Selection of Regression Models In this section, several regression algorithms are implemented and compared on the post-filtered databases to determine the most effective algorithm for predicting the ACS Paragon Plus Environment

Page 12 of 44

Page 13 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

RON and MON. Only the most effective algorithm is used to construct the regression models. The tested algorithms include PLS, FFNN, RF and SVM. Five-fold cross-validation (CV) over the training set is adopted to tune the regression parameters. The samples included in each fold are the same for each method to compare the regression approaches under identical conditions. The validation and test subsets are used to measure the generalizability of the regression models. Five molecules are obtained with large residuals during training models for both the RON and MON databases. These molecules are identified as outliers and deleted from the databases. One of the main causes of outliers is that the databases contain various compounds with a large range of molecular weights. Thus, due to sparsity in the data space, the outliers are isolated from the other samples, which increases the difficulty of precisely extrapolating the values using machine learning algorithms. The outliers are not invalid samples, and they will be studied in our future work. The molecule distributions of the training, validation and test subsets in the databases after deleting the outliers are presented in Table 2. The performance of the regression methods is dependent on the tuning parameters. The following parameter ranges are used for tuning: 1 to 10 components for PLS; 5 to 60 (in increments of 5) neurons in the hidden layer that FFNN is from; 1000 “trees” and one-third of the descriptors for “mty” in RF; and grid search for SVM, where epsilon () of the insensitive-loss function ranges from 0.1 to 0.8 in increments of 0.1, the cost of constraints violation of C is a vector [23, 24, …, 28] and gamma (γ), which represents the width of the radial basis function, ranges from 0.005 to 0.020 in increments of 0.005. The global regression model performance for each subset of the two databases is shown in Table 3 and Table 4. The numbers in the brackets indicate the final optimized values of the tuning parameters. MAE and R2 are used to compare the model performance. The CV errors, as an internal validation measure of the training set fit, are combined with the external validation errors calculated for the validation and test subsets to determine the total error, which is displayed in the last row of these tables. Comparing the last rows in Tables 3 and 4 shows that the regression model

ACS Paragon Plus Environment

Energy & Fuels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

performance decreases in the following order for both RON and MON predictions: SVM > RF > FFNN > PLS. The MAEs of the SVM are superior to those of RF, which is the second best model, by 1.30 and 1.79 units for the RON and MON predictions, respectively. The differences among the other models is less than 1 unit. With the exception of the test subset of the MON database, SVM outperforms all other methods over all database subsets. Therefore, whether the entire database or only a partial database is used, SVM is the ideal induction algorithm for the multiobjective wrapper. To elaborate on the order of superiority, the linear PLS is inferior to the three nonlinear methods since there is a highly nonlinear relationship between the descriptors and octane number. SVM is preferable to FFNN because SVM utilizes structural risk minimization rather than empirical risk minimization and uses the artificial neural network to avoid overfitting.53 For SVM and RF, there is no definite theory to explain which algorithm is better. Statnikov et al.54 proved that SVM generally outperformed RF on 22 microarray data sets. In this study, the data sparsity is an additional factor that makes SVM better. Additionally, SVM utilizes support vectors instead of all the training samples to construct the regression model, which can decease the impact of unrelated samples. Furthermore, the performance of SVM relies on selecting the model parameters, which is time consuming. The 3D (, γ and C) scatterplots of SVM for RON and MON prediction using post-filter descriptors are shown in Figure S1 and Figure S2 in the supporting information. The average RMSE values in the five-fold CV are used to evaluate the parameter combinations. The scale is expressed by the color bars on the right side of the two plots. Both figures show that the average RMSE decreases with decreasing gamma (γ) and increasing cost (C). The impact of epsilon () on the performance of the SVM model can be neglected since the colors of all points in Figure S1 and Figure S2 are constant in the vertical direction. Thus, epsilon can be neglected during wrapper optimization, which improves the computing time. 4.3 Results of the Multiobjective Wrapper NSGA-Ⅱ uses the following parameters: mutation and crossover probabilities of 0.3 and 0.7, respectively; 5 mutation and crossover distribution indices; 100

ACS Paragon Plus Environment

Page 14 of 44

Page 15 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

individuals for each generation; and 200 total generations for each run. The impact of the NSGA-Ⅱ parameters on performance is not remarkable, and the procedure for tuning the parameters is not described in detail. In practice, it is difficult to guarantee that the solutions are truly Pareto optimal; thus, a set of non-dominated solutions forming a non-dominated front is utilized to approximate the Pareto front with acceptable trade-offs between objective functions. The performance of the multiobjective wrapper using SVM regression coupled with NSGA-Ⅱ is shown in Figures 4 and 5 for the RON and MON databases, respectively. The x axis shows the number of descriptors in the candidate feature subsets (i.e., the complexity of the model), and the y axis shows the errors. To distinguish the two curves in the figures calculated over the training and validation subsets, different measures are utilized to estimate the model performance over the two subsets. The mean RMSE averaged by the five-fold training data is used for the training error, and the MAE is used for the validation error. The non-dominated fronts generated by optimization over the training subsets are displayed as solid red lines, and the non-dominated solutions over the validation data sets are shown as dashed blue lines. Figures 4 and 5 show that the solid red lines decreasing with an increasing number of descriptors; however, as the number of descriptors increases, the rate of increase decreases. Therefore, adding descriptors to a feature subset enhances the internal prediction ability over the training data, but this improvement decreases with increasing feature subset length. In contrast, the dashed blue lines indicate that the external generalizability of a model is not as monotonous as the red lines. Figure 4 shows a clear valley in the blue line, while Figure 5 shows the blue line fluctuating along the x axis and increasing after the dashed black line, providing strong evidence of overfitting caused by the optimization process. One of the main causes of overfitting is that the wrapper method introduces an excessive number of descriptors into the feature subsets to maximize the internal accuracy of the training samples. Even CV fails to fully overcome this issue.35,36 The multiobjective wrapper successfully identifies the overfitting problem. To avoid overfitting, the validation subsets are used

ACS Paragon Plus Environment

Energy & Fuels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

to re-evaluate the solutions on the non-dominated fronts. Based on the performance over the validation subsets, feature subsets of 12 and 23 descriptors, indicated by the dashed black lines in Figure 4 and Figure 5, are selected to build the regression models for the RON and MON data, respectively. The feature subsets simultaneously provide the highest accuracies in the validation data sets and acceptable internal prediction capacity for the training samples. These options are not uniquely defined, allowing researchers to make their own decisions based on the two figures. For example, the point indicated by the triangle in Figure 5 has a slightly higher value than the lowest value over the validation subset, but this feature subset has only 13 descriptors. Thus, its length is much shorter than that of the subsets selected in this paper. The shorter subset for MON prediction is a good option in terms of simplicity. Furthermore, the optimization search depth can impact the quality of non-dominated fronts generated by NSGA-Ⅱ.55 The non-dominated fronts calculated by the 10th, 50th, 100th, 150th, and 200th generations of NSGA-Ⅱ over the RON and MON databases are displayed in Figures 6 and 7, respectively. The non-dominated fronts calculated by the 200th generation are displayed as purple curves, and the solutions produced in preceding generations are shown in other colors. The figures show that the points corresponding to different generations gradually approach the purple curves until the points of the 150th generation almost coincide with those of the 200th generation. The optimization is stopped at the 200th generation because the improvement is negligible in the final 50 generations. The determination of the optimal time to stop the search is beyond the scope of this work. The feature subsets selected by the multiobjective wrapper (MOW) over the RON and MON databases are coupled with the SVM (MOW_SVM) to construct the final regression models, and their performance is summarized in Table 5 and Table 6. The results of the regression models computed by the post-filter features combined with SVM (F_SVM) are provided for comparison. The bracketed values represent the number of descriptors used in the model. The parameters used in the MOW_SVM

ACS Paragon Plus Environment

Page 16 of 44

Page 17 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

models to predict RON are γ =0.02 and C=256, and those used to predict MON are γ =0.02 and C=128, which are optimized via 5-fold CV. Table 5 and 6 show that the performance of the feature subsets selected by the multiobjective wrapper is superior to that of the post-filter descriptors with respect to the total accuracy (in the second to last rows) calculated over the entire data sets. The multiobjective wrapper shows an improvement of 1.44 and 1.30 units for RON and MON, respectively. Considering the stability and regularization of the SVM56, this difference is not produced by deviation of the models. In Table 5, the MOW_SVM significantly improves the CV accuracy while showing similar performance to that of F_SVM over the validation and test data sets. In addition, MOW_SVM drastically reduces the complexity of the model, as reflected by the lower number of descriptors included in the models. For the MON database, as shown in Table 6, MOW_SVM increases the accuracies for all data subsets using only 23 descriptors compared to the 185 descriptors used in F_SVM. As shown in Tables 3 and 4, SVM outperforms the other regression methods; thus, MOW_SVM not only possesses the most accurate prediction ability but also significantly decreases the model complexity. The weighted average of the MAEs calculated over the validation and test subsets, which represents the prediction MAEs and reflects the external generalizability of the constructed models, are shown in the last rows of Tables 5 and 6. In these tables, the prediction MAEs of the MOW_SVM models for RON and MON are 3.90 and 4.24 units, respectively. This performance is superior to that obtained by Albahri,11 who used a group contribution method to obtain prediction MAEs of 4.0 and 5.7 units for RON and MON, respectively. To the best of our knowledge, the model of Albahri11 is the only one reported in the literature that includes more than 200 molecules in the database to predict the octane number of pure components. The scale of the database impacts the reliability and application range of the constructed models. The RON and MON databases used in this paper include more than 260 molecules and are larger than those in reference 11. Furthermore, our databases contain oxygenate and nitrogenous compounds and are more diverse than those in the previously mentioned research. Al-Fahemi et al.15 developed a QSPR model based on MLR to predict the RON; 40

ACS Paragon Plus Environment

Energy & Fuels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

training molecules were used to predict 25 other compounds, and the standard error was 6.328 units. The MAE calculated from the results was 5.341 units (since the MAE was not provided in the original paper). The results of this paper are superior to those of Al-Fahemi et al. because nonlinear models coupled with appropriate feature selection techniques are utilized instead of linear models. The MAE of the QSPR model used by Zhu et al.16 to predict MON was 3.96 units, which is slightly smaller than that in this work. However, that result was calculated over a smaller database of only 72 molecules. In addition, the model focused only on the MON of paraffins without considering other critical types of hydrocarbons, such as olefins and aromatics. Compared with the model of Zhu et al., the model presented in this paper has similar accuracy and a much wider application range, including hydrocarbons and oxygenated and nitrogen-containing compounds. The optimization time of the multiobjective wrapper was 13.3 hours for one run on 10 server cores for parallel computing. Although the computing time is longer than that of a wrapper using a single-objective optimization algorithm (e.g., 4.8 hours for a genetic algorithm), this time is acceptable when considering the satisfactory results produced by the multiobjective wrappers. 4.4 Confidence Interval using Bootstrap Parity plots comparing the predicted values of RON and MON computed by MOW_SVM and experimental data are displayed in Figures 8 and 9, respectively. The error bars in the two plots are computed using bootstrapping. The prediction provided by SVM regression is a point estimate; thus, the uncertainty assessment must be completed using other methods. Bootstrap methods are applied in SVM regression to obtain prediction intervals without making assumptions about the probability distribution.57 These methods tend to be more accurate on small data sets58,59 since they rely on a limited distribution constructed from the data rather than asymptotic results.60 There are two groups of bootstrap techniques that can be distinguished by sampling a regression model based on pairs or residuals.60 Percentile confidence intervals computed by bootstrap sampling based on pairs are used in this study, and the details of the algorithm are described in references 57 and 60. Each interval is calculated

ACS Paragon Plus Environment

Page 18 of 44

Page 19 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

based on 2000 runs. In Figures 8 and 9, some of the white points expressing the training samples have large intervals, especially those with octane numbers between 0 and 25. This result occurs in part because during the CV, only four-fifths of the molecules in the training set are used in the bootstrapping to estimate the confidence intervals of the other one-fifth of the samples. In contrast, all the molecules in the training set are adopted to estimate the confidence intervals of the molecules in the validation and test subsets. Larger confidence intervals are generally produced when a smaller subset is used for bootstrapping. Furthermore, using the entire training set yields narrower intervals and ensures better prediction accuracy. The other reason for this result is that the distribution of molecules with octane numbers between 0 and 25 is sparse, and this sparsity increases the size of the confidence intervals. Both of these effects contribute to the large confidence intervals for these points. However, most of the points are close to the diagonal line with reasonable confidence interval scales. The parity plots comparing the predicted values of RON and MON computed by the post-filter descriptors and the experimental values are displayed in Figure S3 and Figure S4 in the supporting information. 4.5 Meaning of the Selected Descriptors The 12 and 23 descriptors selected by the multiobjective wrapper for RON and MON are presented in Table 7 along with their meaning. The meanings of the descriptors are in agreement with petroleum chemistry knowledge about anti-knocking properties. The branching degree of hydrocarbons positively influences the octane number, and the descriptors ZM2V, SPAN, nCp and IC1 describe the branching of molecules from different viewpoints.61 For a straight-chain alkane and an isoparaffin with the same number of carbon atoms, the straight-chain molecule has a larger diameter (SPAN), a smaller number of terminal primary carbons (nCp), and less structural complexity per vertex (IC1). For a given set of molecules, the difference in the descriptor values among molecules in an optimized feature subset should reflect the octane number differences of the molecules and can be used to determine the rationality of the optimized feature

ACS Paragon Plus Environment

Energy & Fuels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

subset. For example, the descriptors ALOGP2 and ALOGPS_logS represent the hydrophobic and hydrophilic properties of the compounds, respectively. Because oxygenated compounds are contained in the RON database, the two descriptors can distinguish oxygenates from hydrocarbons to provide accurate predictions for the regression models. Compared to 2,2,4-trimethylpentane, isopropanol possesses stronger hydrophobicity and weaker hydrophilicity. As mentioned previously, the values of the descriptors are pretreated to eliminate dimensional differences in the databases. The ALOGP2 value of isopropanol is 0.04, which is much lower than the value of 4.49 for 2,2,4-trimethylpentane. Furthermore, the ALOGPS_logS values of isopropanol and 2,2,4-trimethylpentane are 0.80 and -4.14, respectively. The gap in these values reflects the difference in hydrophilicity between isopropanol and 2,2,4-trimethylpentane. One of the main difficulties for accurately predicting the octane number is the large number of isomers in the boiling range of gasoline, especially those that differ in only configuration, such as cis and trans 1,3-dimethylcyclohexane. These two molecules have similar values of most of descriptors, but their RON values are 71.7 and 66.9 units, respectively. This gap in RON values is far from negligible, but most of the descriptors do not reflect this gap. The Mor03u values for the two molecules are -7.181 and -5.517, and this gap adequately reflects the difference in the RON values of the two isomers. 3D-MoRSE descriptors can successfully handle the issue in which a QSAR study includes a structurally similar set of compounds. The corresponding values can be computed using a few pairs of neighboring atoms,62 which explains why many 3D-MoRSE descriptors are selected by our methods since a large number of isomers are included in the databases. The 3D-MoRSE descriptor Mor03v was selected by Zhu et al.16 to predict the MON of paraffins. If a QSPR model lacks proper feature selection, the useful descriptors are diluted by the large number of irrelevant descriptors. The examples mentioned above serve to verify the efficiency of the feature selection methods developed in this paper. 5. Conclusion QSPR models are developed to predict the pure component RON and MON. Two

ACS Paragon Plus Environment

Page 20 of 44

Page 21 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

databases composed of published data include the MON and RON values of oxygenated and nitrogen-containing compounds as well as hydrocarbons. Feature subset selection is used to increase the performance and simplify the complexity of a QSPR model by removing redundant and irrelevant descriptors. In this study, a two-step feature selection method is utilized to optimize both the length of feature subsets and the generalizability of the models. Based on multiple runs of RF, the Boruta algorithm is then used as a filter to remove irrelevant descriptors from the databases. After the filtering step, 180 and 185 descriptors remain and are used as inputs for the multiobjective feature selection method. Based on the post-filtered descriptors, several linear and nonlinear regression methods, including PLS, FFNN, RF and SVM, are compared, and SVM regression is selected as the evaluation algorithm. In the second step, a multiobjective optimization algorithm (NSGA-II) runs around the SVM as a wrapper to simultaneously minimize the lengths of the feature subsets and maximize the model performance. Thus, feature subsets with 12 and 23 descriptors are selected to predict the RON and MON based on the performance over the validation data sets. The RON and MON regression models for pure components are constructed using the two optimized subsets of descriptors coupled with the SVM regression. These models have higher generalizability and are able to avoid the overfitting problem that is commonly produced by single-objective feature selection. The prediction MAEs of RON and MON are 3.90 and 4.24 units, respectively. Compared with previous studies, the models developed in this work achieve higher accuracies, even though they are based on more diverse databases. Furthermore, bootstrap techniques are used to calculate the confidence intervals of the estimations produced by the SVM regression models. Associated Content Table S1 shows part of the correlation coefficients among the descriptors derived from the RON database. Figures S1 and S2 display the parameter scatter plots of SVM for the RON and MON data using post-filter descriptors. Figures S3 and S4 show the parity plots of the RON and MON prediction using post-filter descriptors. The RON

ACS Paragon Plus Environment

Energy & Fuels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

and MON databases are also includes in the supplementary material in csv format. This material is available free of charge via the Internet at http://pubs. acs.org. Author Information Corresponding Authors *Email: [email protected] (L. Zhang). Notes The authors declare no competing financial interest. Acknowledgments This work was supported by the National Natural Science Foundation of China (No. 21506254 and U1162204) and the Science Foundation of China University of Petroleum, Beijing (No. 2462014YJRC020) References (1) Reid, R. C.; Prausnitz, J. M.; Poling, B. E. The Properties of Gases and Liquids; 1987. (2) American Society for Testing and Materials (ASTM). ASTM D2699-15a, Standard Test Method Research Octane Number Spark-Ignition Engine Fuel; ASTM: West Conshohocken, PA, 2015. (3) American Society for Testing and Materials (ASTM). ASTM D2700, Standard Test Method Motor Octane Number Spark Ignition Engine Fuel; ASTM: West Conshohocken, PA, 2011. (4) Kelly, J. J.; Barlow, C. H.; Jinguji, T. M.; Callis, J. B. Anal. Chem. 1989, 61, 313– 320. (5) Cooper, J. B.; Wise, K. L.; Groves, J.; Welch, W. T. Anal. Chem. 1995, 67, 4096– 4100. (6) Bao, X.; Dai, L. Fuel 2009, 88, 1216–1222. (7) Nikolaou, N.; Papadopoulos, C. E.; Gaglias, I. A.; Pitarakis, K. G. Fuel 2004, 83, 517–523. (8) Neurock, M.; Nigam, A.; Trauth, D.; Klein, M. T. Chem. Eng. Sci. 1994, 49, 4153– 4177. (9) Kahrs, O.; Brauner, N.; Cholakov, G. S.; Stateva, R. P.; Marquardt, W.; Shacham, M. Comput. Chem. Eng. 2008, 32, 1397–1410. (10) Wakeham, W. A.; St. Cholakov, G.; Stateva, R. P. J. Chem. Eng. Data 2002, 47, 559–570.

ACS Paragon Plus Environment

Page 22 of 44

Page 23 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

(11) Albahri, T. Ind. Eng. Chem. Res. 2003, 42, 657–662. (12) St. Cholakov, G.; Wakeham, W. A.; Stateva, R. P. Fluid Phase Equilib. 1999, 163, 21–42. (13) (14)

Ha, H. Z.; Ring, Z.; Liu, S. Energy and Fuels 2005, 19, 152–163. Saldana, D. A.; Starck, L.; Mougin, P.; Rousseau, B.; Pidol, L.; Jeuland, N.;

Creton, B. Energy and Fuels 2011, 25, 3900–3908. (15) Al-Fahemi, J. H.; Albis, N. A.; Gad, E. A. M. J. Theor. Chem. 2014, 2014. (16) Zhu, X.; Juncheng, J.; Yong, P.; Rui, W. Nat. Gas Chem. Ind. C1 Chem. Chem. Eng. 2011, 36, 54–57. (in Chinese) (17) Shahlaei, M. Chem. Rev. 2013, 113, 8093–8103. (18) Dutta, D.; Guha, R.; Wild, D.; Chen, T. J. Chem. Inf. Model. 2007, 47, 989– 997. (19) Guyon, I.; Gunn, S.; Nikravesh, M.; Zadeh, L. A. Feature Extraction: Foundations and Applications; Springer-Verlag New York , 2006. (20) Saeys, Y.; Inza, I.; Larrañaga, P. Bioinformatics 2007, 23, 2507–2517. (21) Kohavi, R.; John, G. Artif. Intell. 1997, 97, 273–324. (22) Hawkins, D. M. J. Chem. Inf. Comput. Sci. 2004, 44, 1–12. (23) Yang, J.; Honavar, V. Intell. Syst. their Appl. IEEE 1998, 13, 44–49. (24) Chaikla, N.; Qi, Y. In Systems, Man, and Cybernetics, 1999. IEEE SMC’99 Conference Proceedings. 1999 IEEE International Conference on; IEEE, 1999; Vol. 5, pp 538–540. (25) Emmanouilidis, C.; Hunter, A.; MacIntyre, J. In Evolutionary Computation, 2000. Proceedings of the 2000 Congress on; 2000; Vol. 1, pp 309–316 vol.1. (26) Deb, K.; Raji Reddy. BioSystems 2003, 72, 111–129. (27) Oliveira, L. S.; Sabourin, R.; Bortolozzi, F.; Suen, C. Y. Int. J. Pattern Recognit. Artif. Intell. 2003, 17, 903–929. (28) American Society for Testing and Materials (ASTM), A. Knocking characteristics of pure hydrocarbons : developed under American Petroleum Institute Research Project 45; 1958. (29) Daubert, T. E.; Danner, R. P. API technical data book-petroleum refining. Am. Pet. Inst. (API), Washington DC; 1997. (30) Yanowitz, J.; Christensen, E.; McCormick, R. Utilization of renewable oxygenates as gasoline blending components; 2011. (31) O’Boyle, N. M.; Banck, M.; James, C. A.; Morley, C.; Vandermeersch, T.; Hutchison, G. R. J. Cheminform. 2011, 3, 33. (32) Tetko, I. V; Gasteiger, J.; Todeschini, R.; Mauri, A.; Livingstone, D.; Ertl, P.; Palyulin, V.; Radchenko, E. V.; Zefirov, N. S.; Makarenko, A. S.; Tanchuk, V. Y.; Prokopenko, V. V. J. Comput. Aided. Mol. Des. 2005, 19, 453–463.

ACS Paragon Plus Environment

Energy & Fuels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(33) VCCLAB, Virtual Computational Chemistry Laboratory. http://www.vcclab.org, 2005. (34) 1008. (35)

Sadowski, J.; Gasteiger, J.; Klebe, G. J. Chem. Inf. Model. 1994, 34, 1000– Radtke, P. V. W.; Wong, T.; Sabourin, R. In The 2006 IEEE International

Joint Conference on Neural Network Proceedings; 2006; pp 3327–3334. (36) Oliveira, L. S.; Sabourin, R.; Bortolozzi, F.; Suen, C. Y. In Proc. Int. Conf. on Pattern Recognition; 2002; Vol. 1, pp 568–571. (37) Mansouri, K.; Ringsted, T.; Ballabio, D.; Todeschini, R.; Consonni, V. J. Chem. Inf. Model. 2013, 53, 867–878. (38) Mauri, a; Consonni, V.; Pavan, M.; Todeschini, R. Match Commun. Math. Comput. Chem. 2006, 56, 237–248. (39) Kursa, M. B.; Jankowski, A.; Rudnicki, W. R. Fundam. Informaticae 2010, 101, 271–285. (40) (41) (42) (43)

Breiman, L. Mach. Learn. 2001, 45, 5–32. Chen, S.; Billings, S. A. Int. J. Control 1992, 56, 319–346. Vapnik, V. N. The Nature of Statistical Learning Theory; 1995. Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. IEEE Trans. Evol. Comput.

2002, 6, 182–197. (44) Deb, K. Multi-Objective Optimization Using Evolutionary Algorithms; 2001. (45) R Development Core Team. R Found. Stat. Comput. Vienna, Austria. 2012. (46) Kursa, M. B.; Rudnicki, W. R. J. Stat. Softw. 2010, 36, 1–13. (47) Liaw, a; Wiener, M. R news 2002, 2, 18–22. (48) Dimitriadou, E.; Hornik, K.; Leisch, F.; Meyer, D.; Weingessel, A. R Packag. 2008, 1–5. (49) Mersmann, O. mco: Multiple Criteria Optimization Algorithms and Related Functions; 2014. (50) Wickham, H. ggplot2: Elegant Graphics for Data Analysis; 2009. (51) Revolution Analytic; Weston, S. foreach: Provides Foreach Looping Construct for R; 2015. (52) Svetnik, V.; Liaw, A.; Tong, C.; Christopher Culberson, J.; Sheridan, R. P.; Feuston, B. P. J. Chem. Inf. Comput. Sci. 2003, 43, 1947–1958. (53) Kecman, V. Learning and soft computing: support vector machines, neural networks, and fuzzy logic models; MIT press, 2001. (54) Statnikov, A.; Wang, L.; Aliferis, C. F. BMC Bioinformatics 2008, 9, 319. (55) Loughrey, J.; Cunningham, P. Res. Dev. Intell. Syst. XXI 2005, 33–43. (56) Xu, H.; Caramanis, C.; Mannor, S. J. Mach. Learn. Res. 2008, 10, 1485–1510. (57) Lins, I. D.; Droguett, E. L.; Moura, M. D. C.; Zio, E.; Jacinto, C. M. Reliab.

ACS Paragon Plus Environment

Page 24 of 44

Page 25 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Eng. Syst. Saf. 2015, 137, 120–128. (58) Efron, B. Ann. Stat. 1979, 7, 1–26. (59) (60)

Efron, B.; Tibshirani, R. J. Refrig. Air Cond. 1993, 57, 436. De Brabanter, K.; De Brabanter, J.; Suykens, J. K.; De Moor, B. IEEE Trans.

Neural Netw. 2011, 22, 110–120. (61) (62) 203.

Todeschini, R.; Consonni, V. Handbook of molecular descriptors; 2000. Devinyak, O.; Havrylyuk, D.; Lesyk, R. J. Mol. Graph. Model. 2014, 54, 194–

ACS Paragon Plus Environment

Energy & Fuels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 44

Tables Table 1. Distribution of the Number of Descriptors in the Three Categories Derived from the Boruta Algorithm Category

RON

MON

Rejected

518

497

Tentative

54

66

Confirmed

126

119

Total

698

682

ACS Paragon Plus Environment

Page 27 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Table 2. Molecular Distributions over Subsets in the RON and MON Databases Category

RON

MON

Training

197

192

Validation

51

51

Test

26

25

Total

274

268

ACS Paragon Plus Environment

Energy & Fuels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 44

Table 3. Performance of the Global Regression Models over Each Subset in the RON Database Based on the Post-Filtered Descriptors SVM PLS

FFNN

RF

(8 components)

(30 neurons)

(1000, mty=60)

Data sets

(=0.1, γ=0.005, C=256) R2

MAE

R2

MAE

R2

MAE

R2

5-fold CV 9.25

0.75

9.08

0.75

8.49

0.79

7.20

0.78

Validation 5.84

0.88

6.87

0.86

4.33

0.93

3.66

0.96

Test

8.18

0.65

4.73

0.85

6.76

0.62

4.23

0.88

Totala

8.51

0.74

8.25

0.75

7.55

0.79

6.25

0.81

MAE

a

The total indicates the summed performance for all training, validation and test data sets. The training data set is included since its performance is based on the 5-fold CV, which is a type of internal validation.

ACS Paragon Plus Environment

Page 29 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Table 4. Performance of the Global Regression Models over Each Subset in the MON Database Based on the Post-Filtered Descriptors SVM PLS

FFNN

RF

(7 components)

(35 neurons)

(1000, mty= 62)

Data sets

(=0.8, γ=0.005, C =128) R2

MAE

R2

MAE

R2

MAE

R2

5-fold CV 8.66

0.77

7.62

0.75

7.82

0.74

5.44

0.86

Validation 6.37

0.80

6.90

0.80

5.98

0.81

5.50

0.83

Test

6.75

0.70

5.83

0.81

4.59

0.83

4.66

0.82

Totala

7.95

0.73

7.32

0.75

7.17

0.75

5.38

0.86

MAE

a

The total indicates the summed performance for all training, validation and test data sets. The training data set is included since its performance is based on the 5-fold CV, which is a type of internal validation.

ACS Paragon Plus Environment

Energy & Fuels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 44

Table 5. Performance of the Models Obtained by Feature Selection Methods on RON Data sets

F_SVM (180)

MOW_SVM (12)

MAE

R2

MAE

R2

5-fold CV

7.20

0.78

5.95

0.90

Validation

3.66

0.95

3.86

0.94

Test

4.23

0.88

3.97

0.85

Totala

6.25

0.81

5.37

0.90

Pred. MAEb

3.85

0.94

3.90

0.92

a

The total indicates the summed performance over all training, validation and test data sets. The training data set is included since its performance is based on the 5-fold CV, which is a type of internal validation.

b

The prediction MAE represents the weighted average of the MAEs calculated over the validation and test data sets.

ACS Paragon Plus Environment

Page 31 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Table 6. Performance of the Models Obtained after Feature Selection Methods on MON Data sets

F_SVM (185)

MOW_SVM (23)

MAE

R2

MAE

R2

5-fold CV

5.44

0.86

4.02

0.93

Validation

5.50

0.83

4.39

0.92

Test

4.66

0.82

3.94

0.86

Totala

5.38

0.86

4.08

0.92

Pred. MAEb

5.22

0.82

4.24

0.90

a

The total indicates the summed performance over all training, validation and test data sets. The training data set is included since its performance is based on the 5-fold CV, which is a type of internal validation.

b

The prediction MAE represents the weighted average of the MAEs calculated over the validation and test data sets.

ACS Paragon Plus Environment

Energy & Fuels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Page 32 of 44

Table 7. List of Molecular Descriptors Selected by Multiobjective Feature Selection Category

Symbol

RON

ZM2V

Second zagreb index by valence vertex degrees Molecule branching

DRAGON block Topological indices

SPAN

Size of descriptors

Geometrical descriptors

nCp

Number of terminal primary C

Functional group counts

IC1

Information content index (neighborhood symmetry of order 1) Structural complexity per vertex

Information indices

Mor03u

Signal 03 / unweighted

3D-MoRSE descriptors

Mor31m

Signal 31 / weighted by mass

3D-MoRSE descriptors

Mor10p

Signal 10 / weighted by polarizability

3D-MoRSE descriptors

Mor16m

Signal 16 / weighted by mass

3D-MoRSE descriptors

Mor29v

Signal 29 / weighted by van der Waals volume

3D-MoRSE descriptors

RDF020m

Radial distribution function - 020 / weighted by mass

RDF descriptors

ALOGP2

MON

Description

Squared Ghose-Crippen octanol-water partition coeff. (logP^2) Lipophilic contributions of atoms in the molecule

Molecular properties

ALOGPS_logS

Solubility in water hydrophilic descriptors

Molecular properties

nAB

Number of aromatic bonds

Constitutional indices

C.001

CH3R / CH4

Atom-centered fragments

C.002

CH2R2

Atom-centered fragments

ACS Paragon Plus Environment

Page 33 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Energy & Fuels

Continued SPAN

Size of descriptors

Geometrical descriptors

MEcc

Molecular eccentricity

Geometrical descriptors

JGI2

Mean topological charge index of order 2

2D autocorrelations

Average valence connectivity index of order 0,

X0Av

which is able to account for the presence of heteroatoms

Connectivity indices

in the molecule as well as double and triple bonds Structural information content index (neighborhood symmetry of order 2)

SIC2

2nd order, is defined in a normalized form of the information content to

Information indices

delete the influence of graph size

RDF020m

Radial distribution function - 020 / weighted by mass

RDF descriptors

RDF020u

Radial distribution function - 020 / unweighted

RDF descriptors

RDF050m

Radial distribution function - 050 / weighted by mass

RDF descriptors

Mor06m

Signal 06 / weighted by mass

3D-MoRSE descriptors

Mor26m

Signal 26 / weighted by mass

3D-MoRSE descriptors

Mor31m

Signal 31 / weighted by mass

3D-MoRSE descriptors

Mor25v

Signal 25 / weighted by van der Waals volume

3D-MoRSE descriptors

Mor29p

Signal 29 / weighted by polarizability

3D-MoRSE descriptors

Mor04u

Signal 04 / unweighted

3D-MoRSE descriptors

Mor18m

Signal 18 / weighted by mass

3D-MoRSE descriptors

Mor29m

Signal 29 / weighted by mass

3D-MoRSE descriptors

Mor23v

Signal 23 / weighted by van der Waals volume

3D-MoRSE descriptors

ACS Paragon Plus Environment

Energy & Fuels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Page 34 of 44

Continued Mor15e JhetZ ESpm12u

Signal 15 / weighted by Sanderson electronegativity Balaban-like index from Barysz matrix weighted by atomic number To account for heteroatoms and multiple bonds Spectral moment of order 12 from edge adjacency mat

ACS Paragon Plus Environment

3D-MoRSE descriptors

2D matrix-based descriptors Spectral moments

Page 35 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Figure Captions Figure 1. Distributions of the octane numbers: (a) RON; (b) MON. The dashed red line indicates the average value of the octane number. Figure 2. Research process flow chart. Figure 3. Z-scores of the descriptors selected using the Boruta algorithm. The blue boxes represent the shadow variables that are randomly produced by Boruta; the red boxes represent the deleted descriptors; the yellow boxes represent the pending descriptors; and the green boxes represent the remaining descriptors. Figure 4. Performance of multiobjective feature selection on the RON database. MAE is shown by the blue line, and RMSE is represented by the red line. Figure 5. Performance of multiobjective feature selection on the MON database. MAE is shown by the blue line, and RMSE is represented by the red line. Figure 6. Generation evolution of NSGA-II on RON. Figure 7. Generation evolution of NSGA-II on MON. Figure 8. Parity plot comparing the predicted values and experimental data of RON based on SVM using 12 descriptors. Figure 9. Parity plot comparing predicted values and experimental data of MON based on SVM using 23 descriptors.

ACS Paragon Plus Environment

Energy & Fuels

20

count

a_RON

1 2 15 3 4 5 10 6 7 8 5 9 10 11 0 12 13 14 20 15 16 17 15 18 19 20 10 21 22 23 24 5 25 26 27 0 28 29 30

Page 36 of 44

b_MON

ACS Paragon Plus Environment 0

40

80

Rating

120

Page 37 of 44

Energy & Fuels

Filter step 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Original Descriptors

Data Pretreating

1667

698 Left

Boruta Filter Generate a copy of variables

Compare the Z-score values of all the variables Delete irrelevant features

Multiobjective Wrapper

Model Validation

Elites of Feature Subsets

Optimization component

Feature subset (FS)

Length of FS

Induction algorithm ACS Paragon Plus Environment

Evaluated accuracy

Energy & Fuels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

ACS Paragon Plus Environment

Page 38 of 44

12.5

Page 39 of 44

Cross Validation

5.0

7.5

10.0

Prediction

2.5

Errors

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

Energy & Fuels

5

10

ACS Paragon Plus Environment

15

Number of Descriptors

20

Page 40 of 44

15

Energy & Fuels

Cross Validation

9

12

Validation

0

3

6

Errors

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

0

10 ACS Paragon Plus Environment 20

Number of Descriptors

30

40

Generation 10 50 100

12.5

150

7.5

10.0

200

5.0

RMSE

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

Energy & Fuels

15.0

Page 41 of 44

0

10 ACS Paragon Plus Environment 20

Number of Descriptors

30

40

Energy & Fuels

Page 42 of 44

15

Generation 10 50 100

12

150

6

9

200

3

RMSE

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

0

10

20

30

ACS Paragon Plus Environment

Number of Descriptors

40

50

125 100 75 50 25 0

Test Training Validation

−25

Prediction

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Energy & Fuels

150

Page 43 of 44

−25

0

25 ACS Paragon50 75 Plus Environment

Experiment

100

125

150

125 100 75 50 25 0

Test Training

−25

Prediction

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

Page 44 of 44

150

Energy & Fuels

Validation

−25

0

25 50 75 ACS Paragon Plus Environment

Experiment

100

125

150