Developing Quantitative Structure–Property Relationship Models To

Feb 1, 2019 - Developing Quantitative Structure–Property Relationship Models To Predict the Upper Flammability Limit Using Machine Learning...
2 downloads 0 Views 919KB Size
Article Cite This: Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX

pubs.acs.org/IECR

Developing Quantitative Structure−Property Relationship Models To Predict the Upper Flammability Limit Using Machine Learning Shuai Yuan,†,‡ Zeren Jiao,†,‡ Noor Quddus,† Joseph Sang-II Kwon,‡ and Chad V. Mashuga*,†,‡ †

Mary Kay O’Connor Process Safety Center, ‡Artie McFerrin Department of Chemical Engineering, Texas A&M University, College Station, Texas 77843, United States

Downloaded via LMU MUENCHEN on February 7, 2019 at 14:17:59 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.

S Supporting Information *

ABSTRACT: In this study, machine learning algorithms, such as support vector machine (SVM), k-nearest-neighbors (KNN), and rndom forest (RF), are applied to improve the accuracy of the quantitative structure−property relationship (QSPR) models to predict the upper flammability limit (UFL) of pure organic compounds. Ten molecular descriptors are utilized to develop the QSPR model. The experimental data set contains 79 chemicals and is split into 70% training and 30% test set in order to conduct cross-validation. The multiple linear regression (MLR) QSPR model of denary logarithms of the UFL obtained in this study has six molecular descriptors and an overall root-mean-square error (RMSE) of 0.145. The other four descriptors are eliminated based on statistical insignificance. The QSPR models aided by SVM and RF improve the prediction of the UFL as indicated by their overall RMSEs of 0.118 and 0.095, respectively. However, the QSPR model aided by KNN demonstrated the least performance with the overall RMSE of 0.163. first-principles model to predict the UFL. Alternatively, fundamental models, such as quantitative structure−property relationship (QSPR) models to correlate the UFL and the molecular structure of organic compounds, have been investigated by Gharagheizi and Pan et al.13,14 Both of the studies applied the genetic algorithm combined with multiple linear regression (GA-MLR) to develop a mathematical relationship between the UFL and the selected molecular descriptors. In recent years, machine learning algorithms, such as support vector machine (SVM), have been implemented in the development of QSPR models. For example, Pan et al. developed QSPR models to predict the LFLs, autoignition temperatures, and flash points for organic compounds using SVM.15−17 Wang et al. implemented SVM to build QSPR models to predict the minimum ignition energy (MIE) for hydrocarbon fuels and self-accelerating decomposition temperature of organic peroxides.18,19 He et al. used a SVM-based

1. INTRODUCTION The lower flammability limit (LFL) and the upper flammability limit (UFL) are important characteristics to assess the fire and explosion hazards of organic compounds.1−5 The fuel concentration is too lean to be ignited when the mixture is at the LFL and the concentration is too fuel rich with insufficient air at the UFL. Compared with the LFL, the UFLs of fuels have wider range. For instance, the UFL can be as high as 80 vol % fuel in air for acetylene and as low as 5.4 vol % fuel in air for decane.6 The determination of the UFL for a specified fuel is often conducted through the standard experimental measurement (i.e., ASTM E691).7 In addition to experimental measurement, there are several methods proposed in the literature to predict the UFL of organic compounds. Shimy developed mathematical equations to predict the UFLs for hydrocarbons and alcohols based on the number of carbon and hydrogen atoms.8 Mashuga et al. used a calculated adiabatic flame temperature (CAFT) to predict the flammability zone of methane and ethylene.9 Suzuki et al. and Gharagheizi correlated the UFL with functional groups and thermochemical properties.10−12 Because of the complexity of considering the kinetics and the dynamics of combustion at the molecular level, there is no © XXXX American Chemical Society

Received: Revised: Accepted: Published: A

November 29, 2018 January 24, 2019 February 1, 2019 February 1, 2019 DOI: 10.1021/acs.iecr.8b05938 Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX

Article

Industrial & Engineering Chemistry Research

Figure 1. Distribution of UFL values and the log-transformed UFL values.

comparing the traditional data set transformation methods such as logarithm, square root, cubic root, and normalization, the logarithmic transformation has the best performance for this data set. Thus, the logarithms of the observed UFL values are taken and used to prepare the data set. 2.2. Molecular Descriptors. Chemical descriptors are those parameters in Table 1 which originate from quantum

QSPR model to predict organic peroxide self-accelerating decomposition temperatures.20 In addition to the safety-related characteristics, SVM-based QSPR models have been widely used in predictions of glass transition temperatures of polymers, the acidic dissociation constant, and O−H bond dissociation energy.21−23 In addition to the SVM technique, other machine learning algorithms have been used for QSPR model development, such as k-nearest-neighbors (KNN) and random forest (RF) techniques.24−27 KNN-based QSPR models are widely used in drug development, such as prediction of metabolic stability of drugs formulations.24,25 RF-based QSPR models are developed for many purposes, such as to predict aqueous solubility, aquatic toxicity, and standard enthalpy of formation of hydrocarbons.26,27 However, based on the literature reviewed, neither KNN-based QSPR models nor RF-based QSPR models have been applied to predict safety-related parameters. Traditionally, the molecular descriptors used in development of QSPR model are calculated through commercial software, such as Gaussian and Dragon.15,18 As an alternative, this work applied three open source Python modules, Mordred, RDKit, and Psi4, to calculate the molecular descriptors.28−30 In this work, an UFL data set with 79 organic compounds was split into a training set (70%) and a test set (30%).38 The QSPR models were generated from the training set and then cross-validated by the test set. Ten molecular descriptors are used in this study which are calculated by the Python modules, Mordred, RDKit, and Psi4.28−30 The linear relationship between the molecular descriptors and the denary logarithms of the UFL was obtained through multiple linear regression (MLR). The nonlinear QSPR models used to predict the UFL were developed by the Python module, Scikit-Learn, using SVM, KNN, and RF techniques.31 This study found that SVMbased and RF-based QSPR models improve the performance of the UFL prediction.

Table 1. Molecular Descriptors Used To Predict the UFL descriptors IC0 PJ3 SIC0 GATS1v DM HOMO LUMO μ η ω

type

definition

information indices geometrical descriptors information indices 2D autocorrelations electronic information electronic information electronic information electronic information electronic information electronic information

information content index (neighborhood symmetry of 0-order) 3D Petitjean shape index structural information content index (neighborhood symmetry of 0-order) Geary autocorrelation of lag 1 weighted by van der Waals volume dipole moment energy of the highest occupied molecular orbital energy of the lowest unoccupied molecular orbital chemical potential hardness electrophilicity index

mechanics. There are two molecular descriptor categories: quantum chemical descriptors and nonquantum chemical descriptors. Quantum chemical descriptors used in this study are the dipole moment (DM), energy of the highest occupied molecular orbital (HOMO), energy of the lowest unoccupied molecular orbital (LUMO), chemical potential (μ), hardness (η), and electrophilicity index (ω). These six descriptors are calculated by the Psi4 module. Nonquantum chemical descriptors are the geometry- and structure-related descriptors. Nonquantum chemical descriptors used in this study include information content index (IC0), 3D Petitjean shape index (PJ3), structural information content index (SIC0), and Geary autocorrelation of lag 1 weighted by van der Waals volume (GATS1v). These four descriptors are calculated by the Mordred module. The type and definition of the descriptors are presented in Table 1.14,15,18 The selection of the descriptors considers both the geometry influence and the electron influence. The geometry influence is considered by nonquantum descriptors (PJ3) as studied by Gharagheizi and Pan et al., who examined the UFL.13,15 The electron influence

2. METHODOLOGY 2.1. Data Set. The UFL data were extracted from Appendix B in Chemical Process Safety: Fundamentals with Applications.6 The UFL values for 79 organic compounds resulted in a training set of 55 compounds used to develop the QSPR models. The remaining 24 UFL values were used to conduct cross-validation. The data set is randomly separated into a training set and a test set. The organic compounds studied in this work can be found in the Supporting Information. As shown in Figure 1, the raw data set of observed UFL is extremely positive skewed. Therefore, data set transformation methods are necessary in this situation. When B

DOI: 10.1021/acs.iecr.8b05938 Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX

Article

Industrial & Engineering Chemistry Research

Table 2. Excerpt from the Supporting Information Data Table Showing the Values of the 10 Molecular Descriptors Considered in This Study molecular descriptors no.

name

IC0

PJI3

SIC0

GATS1v

DM

HOMO

LUMO

μ

η

ω

1 2 3 4 5 6 7 8 9

ethane propane butane isobutane pentane isopentane neopentane hexane heptane

0.8113 0.8454 0.8631 0.8631 0.8740 0.8740 0.8740 0.8813 0.8865

0.2970 0.4953 0.3775 0.4950 0.4911 0.3165 0.4958 0.4220 0.4974

0.2704 0.2444 0.2267 0.2267 0.2138 0.2138 0.2138 0.2039 0.1960

2.0000 1.8333 1.7500 1.7500 1.7000 1.7000 1.7000 1.6667 1.6429

0.0000 0.0656 0.0652 0.0917 0.0708 0.0614 0.0002 0.0341 0.0490

−0.3395 −0.3258 −0.3156 −0.3207 −0.3033 −0.3078 −0.3151 −0.2984 −0.2922

0.1036 0.0958 0.0905 0.0861 0.0930 0.0868 0.0760 0.0915 0.0910

−0.1180 −0.1150 −0.1126 −0.1173 −0.1052 −0.1105 −0.1196 −0.1035 −0.1006

0.4431 0.4216 0.4061 0.4068 0.3963 0.3946 0.3911 0.3899 0.3832

0.0157 0.0157 0.0156 0.0169 0.0139 0.0155 0.0183 0.0137 0.0132

The QSPR models are developed using the training set and then cross validated using the test set. The models are assessed by calculating the following statistical values: coefficient of determination

is considered by quantum descriptors because studies have shown the effect of HOMO, LUMO, and electrophilicity index (ω) on the reactivity.39 This reactivity prediction is important in predicting the minimum ignition energy and decomposition temperature.18,19 Therefore, this study utilizes both the nonquantum and quantum descriptors. Because molecular structures affect the molecular descriptors, the geometry optimization should be conducted to obtain the molecular structure with the lowest energy before calculating the molecular descriptors. The chemical compounds studied in this work are pure chemical compounds (isomers) in the gas phase. Protonation states or tautomers are not considered. The simplified molecular-input line-entry system (SMILES) for the organic compounds was obtained from Pubchem and translated to a molecular structure using RDKit.32 The optimization of the molecular structure was first conducted using MMFF94 force field by RDKit, which is then followed by using B3LYP density functional methods and the 6-31G(D) basis set, which are available in the Psi4 module.33 The open source Python modules, Mordred, RDKit, and Psi4, were used to calculate the molecular descriptors. An excerpt of some of the Supporting Information data for the calculated molecular descriptors is shown in Table 2. 2.3. Model Development and Assessment. QSPR models were developed conducting through MLR, KNN, SVM, and RF. MLR correlates the linear relationship between the molecular descriptors and the UFL, which has an explicit linear mathematical equation to represent the relationship as follows: y = a0 + a1x1 + ... + anxn

n

2

R =

∑i = 1 (yi ̂ − y ̅ )2 n

∑i = 1 (yi − y ̅ )2

(2)

average absolute error n

AAE =

∑i = 1 |yi ̂ − yi | (3)

n

root-mean-square error n

RSME =

∑i = 1 (yi ̂ − yi )2 n

(4)

where ŷi ̂ iare the predicted values, yi are the observed values in the data set, y is the mean of the observed values, and n is the number of observations in the data set. Leave-one-out cross-validation (QLOO2) is used to assess the internal robustness of the QSPR models developed in this work. The calculation of QLOO2 is shown as below leave-one-out cross-validation 2 Q LOO =1−

training

(yi − yi ̂ )2

training

(yi − y ̅ )2

∑i = 1 ∑i = 1

(5)

where yi, ŷi, and y are the observed, predicted, and mean experimental values of the training set, respectively. Qext2 is used to assess the predictive ability of the QSPR model and is calculated as below: predictive ability

(1)

where y represents the UFL of organic compounds and x represents the molecular descriptors. Machine learning algorithms, such as KNN, SVM, and RF, have been applied to solve regression problems.34 KNN regression calculates the average of the k-nearest-neighbors. SVM regression is a technique which depends on kernel functions. RF is a tree-based method, which constructs multiple decision trees when training.35 SVM-based QSPR models have been utilized for the study of safety-related characteristics, such as flammability limit, minimum ignition energy, and self-decomposition temperature.15,18,20 KNN regression and RF regression have not been tested for their performance in the development of QSPR models to predict safety-related parameters based on literature reviewed. Often machine learning based regression can be very complicated, not derivable or even not continuous, and often explicit equations are not produced.

test

2 Q ext =1−

∑i = 1 (yi − yi ̂ )2 test

∑i = 1 (yi − ytr̅ )2

(6)

where yi and ŷi are the observed and predicted UFL values in the test set and ytr is the mean observed UFL value of data in the training set.

3. RESULTS AND DISCUSSION 3.1. Multiple Linear Regression. JMP 12 software is applied to conduct the MLR.36 The mathematical equation to correlate the denary logarithms of the UFL values and the molecular descriptors is shown below: C

DOI: 10.1021/acs.iecr.8b05938 Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX

Article

Industrial & Engineering Chemistry Research

Figure 2. (a) Predicted value (MLR) vs observed value. (b) Residuals (MLR) vs observed value.

Figure 3. (a) Predicted value (KNN) vs observed value. (b) Residuals (KNN) vs observed value.

Figure 4. (a) Predicted value (SVM) vs observed value. (b) Residuals (SVM) vs observed value.

RMSE = 0.140. The overall AAE = 0.101 and overall RMSE = 0.145 for the entire data set. The internal robustness, QLOO2, for this model is 0.659 and the Qext2 is 0.534. The criteria for a good QSPR model are R2> 0.6 and QLOO2 > 0.5.37 Thus, the MLR QSPR model is statistically valid to predict the UFL of organic compounds. The residuals of this model are displayed in Figure 2b. The residuals lay on the both sides of the line. 3.2. k-Nearest Neighbors. The KNN regression was conducted through the Python Scikit-Learn module.31 The auto algorithm is selected, and the optional parameters are leaf size = 30, number of neighbors = 3, and p = 1. Figure 3a shows the comparison between predicted values and observed values. The KNN-based QSPR model has the statistics values R =

log(UFL) = 0.0551 + 0.0167(IC0) + 0.6340(PJI3) + 2.3412(SIC0) + 0.0789(GATS1v) − 0.0876DM − 2.2185ω

(7)

The MLR QSPR model obtained in this study contains six molecular descriptors: IC0, PJ3, SIC0, GATS1v, DM, and ω. The remaining four descriptors are removed because they are found to be statistically insignificant with P-values greater than 0.05. Figure 2a shows the comparison between predicted values and observed values. The MLR QSPR model has the statistical values of R = 0.812, R2 = 0.659, AAE = 0.104, and RMSE = 0.148 for the training set. For the test set, AAE = 0.095 and D

DOI: 10.1021/acs.iecr.8b05938 Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX

Article

Industrial & Engineering Chemistry Research

Figure 5. (a) Predicted value (RF) vs observed value. (b) Residuals (RF) vs observed value.

Table 3. Calculated Performance Parameters of QSPR Models Based on MLR, KNN, SVM, and RF Algorithms training set

test set

overall set

model

R

R2

AAE

RMSE

QLOO2

AAE

RMSE

Qext2

AAE

RMSE

MLR KNN SVM RF

0.812 0.821 0.885 0.961

0.659 0.675 0.783 0.924

0.104 0.087 0.085 0.042

0.148 0.144 0.118 0.070

0.659 0.675 0.783 0.924

0.095 0.137 0.80 0.088

0.140 0.199 0.119 0.136

0.534 0.061 0.662 0.561

0.101 0.102 0.084 0.057

0.145 0.163 0.118 0.095

Table 4. Excerpt from the Supporting Information Data Table Showing the Prediction of the UFL Using MLR, SVM, KNN, and RF Modelsa UFL (vol %) no.

name

observed

predicted (MLR)

predicted (KNN)

predicted (SVM)

predicted (RF)

status

1 2 3 4 5 6 7 8 9

ethane propane butane isobutane pentane isopentane neopentane hexane heptane

12.5 9.5 8.5 8.4 7.8 7.6 7.5 7.5 7.0

10.31 11.47 8.66 10.16 9.52 7.34 9.52 8.18 8.71

10.90 8.61 7.70 8.61 7.39 7.70 7.39 6.98 6.26

10.32 10.06 8.87 9.24 8.57 8.04 8.02 7.81 7.65

11.18 9.19 8.35 8.61 7.82 7.71 7.85 7.40 7.03

training training test test training training test training training

a

The remaining 70 compounds in the data set are available through the publisher online.

0.821, R2 = 0.675, AAE = 0.087, and RMSE = 0.144 for the training set. For the test set, AAE = 0.137 and RMSE = 0.199. The overall AAE = 0.102 and overall RMSE = 0.163 for the entire data set. The internal robustness, QLOO2, for this model is 0.675, and the Qext2 is 0.061. Based on these statistical values, the KNN-based QSPR model has an overfitting issue for the training set. The performance is worse than for the MLRQSPR model. The residuals for the KNN-based QSPR model are shown in Figure 3b, with the large deviations between the residual points and the line. 3.3. Support Vector Machine. The SVM regression was conducted through Python Scikit-Learn module.31 The kernel function used in this work was the RBF kernel function, and the optimal parameters were chosen as C = 30 and g = 0.15. The number of support vectors is 79. Figure 4a shows the comparison between predicted values and observed values. The SVM-based QSPR model has the statistical values of R = 0.885, R2 = 0.783, AAE = 0.085, and RMSE = 0.118 for the training set. For the test set, AAE = 0.080 and RMSE = 0.119. The overall AAE = 0.084 and overall RMSE = 0.118 for the entire data set. The internal robustness, QLOO2, for this model is 0.783, and the Qext2 is 0.662. Based on these statistical values,

the SVM-based QSPR model has a better performance than the MLR-based QSPR model. The residuals for the SVMbased QSPR model are shown in Figure 4b. 3.4. Random Forest. The random forest (RF) regression was conducted through the Python Scikit-Learn module.31 The number of decision trees selected for this study is 50. Figure 5a shows the comparison between predicted values and observed values. The RF-based QSPR model has the statistical values R = 0.961, R2 = 0.924, AAE = 0.042, and RMSE = 0.070 for the training set. For the test set, AAE = 0.088 and RMSE = 0.136. The overall AAE = 0.057 and overall RMSE = 0.095 for the whole data set. The internal robustness, QLOO2, for this model is 0.924, and the Qext2 is 0.561. Based on these statistical values, the RF-based QSPR model improves the performance of the UFL prediction. The residuals for the RF-based QSPR model are shown in Figure 5b, which presides on both sides and very near the zero line. The results shown Figures 2−5 have been summarized in Table 3. It suggests that all four models have difficulties in predicting the UFL for the compounds with high UFL in the current data set. One reason for this difficulty may be that the data set has a sparse population of values with a high UFL. E

DOI: 10.1021/acs.iecr.8b05938 Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX

Article

Industrial & Engineering Chemistry Research

database of the property to be predicted. Based on the results from this work, we recommend that algorithms beyond MLR and SVM be utilized. Both KNN and RF should be considered and critically evaluated as presented in this work. Additional UFL data for compounds with higher UFL values would improve the QSPR models. The RF regression is recommended for the development of QSPR models for the prediction of other safety-related characteristics, such as minimum ignition energy and self-decomposition temperature.

Comparing these four models, the SVM-based QSPR model and the RF-based QSPR model have better performance than the MLR-based QSPR model. However, the KNN-based QSPR model does not improve the performance of the UFL prediction. Figures 5a and 5b represent the RF-based QSPR model which has the best prediction for the organic compounds with UFL values less than 22 vol % fuel in air among these four models. Table 4 shows an excerpt from the Supporting Information data table, comparing the observed and predicted UFL values for the QSPR models based on MLR, KNN, SVM, and RF. Additionally the randomly selected training (70%) and test materials (30%) for model development are shown in the right-most column. 3.5. Statistical Performance. The statistical performances for the four methods are listed in Table 3. Comparing the training set RMSE for the MLR and three machine learning algorithms shows that the RF algorithm regression performance is significantly better than the other models. MLR and KNN algorithms have a similar RMSE, while SVM has a slightly better RMSE of these three. The RMSE for the test set shows the SVM algorithm has the best performance of the four methods, while KNN has a performance less than the MLR. Comparing the overall set, the RMSEs of MLR and KNN are similar, with slightly better performance from SVM. The RF has the best RMSE performance at 0.095. One possible reason the RF-based QSPR model has better performances than the MLR-, KNN-, and SVM-based models is that random forest is more robust than the other methods with a skewed data set like that present in this study. QLOO2 represents the robustness of the model, and Qext2 represents the predictive ability of the model. Similar to the trends for RMSE among the four methods, the RF algorithm shows the highest QLOO2 at 0.924. This means RF model is the most robust model in the four models. However, for the test set the SVM algorithm has the highest Qext2, 0.662.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.iecr.8b05938.



Experimental and predicted UFL values and the values of the molecular descriptors (XLSX)

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Shuai Yuan: 0000-0002-6155-1872 Joseph Sang-II Kwon: 0000-0002-7903-5681 Chad V. Mashuga: 0000-0003-0413-843X Notes

The authors declare no competing financial interest.



REFERENCES

(1) Lees’ loss prevention in the process industries, 4th ed.; Mannan, M. S., Ed.; Butterworth-Heinemann: Oxford, U.K., 2012; Vol. 2. (2) Mashuga, C. V.; Crowl, D. A. Derivation of Le Chatelier’s Mixing Rule for Flammable Limits. Process Saf. Prog. 2000, 19, 112− 117. (3) Vidal, M.; Rogers, W. J.; Holste, J. C.; Mannan, M. S. A review of estimation methods for flash points and flammability limits. Process Saf. Prog. 2004, 23 (1), 47−55. (4) Mashuga, C. V.; Crowl, D. A. Application of the flammability diagram for evaluation of fire and explosion hazards of flammable vapors. Process Saf. Prog. 1998, 17 (3), 176−183. (5) Rowley, J. Flammability Limits, Flash Points, and Their Consanguinity: Critical Analysis, Experimental Exploration, and Prediction. Ph.D. Thesis, Brigham Young University, Provo, UT, 2010. (6) Crowl, D. A.; Louvar, J. F. Chemical Process Safety: Fundamentals with Applications; Prentice Hall PTR: Upper Saddle River, NJ, 2002. (7) ASTM E691-09. Standard Test Method for Concentration Limits of Flammability of Chemicals (Vapors and Gases); American Society for Testing and Materials International (ASTM): West Conshohocken, PA, 2015. (8) Shimy, A. A. Calculating flammability characteristics of hydrocarbons and alcohols. Fire Technol. 1970, 6 (2), 135−139. (9) Mashuga, C. V.; Crowl, D. A. Flammability Zone Prediction Using Calculated Adiabatic Flame Temperatures. Process Saf. Prog. 1999, 18 (3), 127−134. (10) Suzuki, T.; Koide, K. Short Communication: Correlation between Upper Flammability Limits and Thermochemical Properties of Organic Compounds. Fire Mater. 1994, 18, 393−397. (11) Suzuki, T.; Ishida, M. Neural Network Techniques Applied to Predict Flammability Limits of Organic Compounds. Fire Mater. 1995, 19, 179−189. (12) Gharagheizi, F. A Chemical Structure-based Model for Estimation of Upper Flammability Limit of Pure Compounds. Energy Fuels 2010, 24, 3867−3871.

4. CONCLUSIONS In this study, four QSPR models were developed and evaluated for the prediction of the UFL. In addition to the traditional MLR method, three machine learning techniques were evaluated, including KNN, SVM, and RF. The reported criteria for a good QSPR model are a coefficient of determination of R2> 0.6 and an internal robustness of QLOO2 > 0.5.37 The MLR-based QSPR model developed in this work was successful in surpassing these criteria. The KNNbased model provides minimal improvement in the performance of the UFL prediction over that of the MLR-based model. The SVM-based model provides improvement in the performance of the UFL prediction over that of the MLR- and KNNbased models. However, the RF-based QSPR model has better performances than the MLR-, KNN-, and SVM-based QSPR models. The possible reason is random forest is more robust than the other two methods. As a result, the positive skewed data set has less impact on the model performance. The RFbased QSPR model could have better performance if the UFL values in the training and test data set were restricted to less than 22 vol % fuel in air. This is due to the limitations in the current data set with sparse data present above 22 vol % fuel in air. Direct use of the developed models developed in this work is not feasible because the explicit equations of machine learning are difficult to implement. It is recommended that the researchers approach QSPR modeling with an expanded F

DOI: 10.1021/acs.iecr.8b05938 Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX

Article

Industrial & Engineering Chemistry Research (13) Gharagheizi, F. Prediction of upper flammability limit percent of pure compounds from their molecular structures. J. Hazard. Mater. 2009, 167, 507. (14) Pan, Y.; Jiang, J. C.; Wang, R.; Cao, H. Y.; Cui, Y. Prediction of the Upper Flammability Limits of Organic Compounds from Molecular Structures. Ind. Eng. Chem. Res. 2009, 48, 5064−5069. (15) Pan, Y.; Jiang, J.; Wang, R.; Cao, H.; Cui, Y. A novel QSPR model for prediction of lower flammability limits of organic compounds based on support vector machine. J. Hazard. Mater. 2009, 168, 962−969. (16) Pan, Y.; Jiang, J.; Wang, R.; Cao, H.; Zhao, J. Quantitative Structure-Property Relationship Studies for Predicting Flash Points of Organic Compounds using Support Vector Machines. QSAR Comb. Sci. 2008, 27, 1013−1019. (17) Pan, Y.; Jiang, J.; Wang, R.; Cao, H. Advantages of support vector machine in QSPR studies for predicting auto-ignition temperatures of organic compounds. Chemom. Intell. Lab. Syst. 2008, 92, 169−178. (18) Wang, B.; Zhou, L.; Xu, K.; Wang, Q. Prediction of minimum ignition energy from molecular structure using quantitative structure− property relationship (QSPR) models. Ind. Eng. Chem. Res. 2017, 56, 47−51. (19) Wang, B.; Yi, H.; Xu, K. L.; Wang, Q. S. Prediction of the SelfAccelerating Decomposition Temperature of Organic Peroxides Using QSPR Models. J. Therm. Anal. Calorim. 2017, 128, 399. (20) He, P.; Pan, Y.; Jiang, J. C. Prediction of the self-accelerating decomposition temperature of organic peroxides based on support vector machine. Procedia Eng. 2018, 211, 215−225. (21) Yu, X. Support vector machine-based QSPR for the prediction of glass transition temperatures of polymers. Fibers Polym. 2010, 11 (5), 757−766. (22) Goudarzi, N.; Goodarzi, M. Prediction of the acidic dissociation constant (pKa) of some organic compounds using linear and nonlinear QSPR methods. Mol. Phys. 2009, 107 (14), 1495−1503. (23) Xue, C. X.; Zhang, R. S.; Liu, H. X.; Yao, X. J.; Liu, M. C.; Hu, Z. D.; Fan, B. T. An accurate QSPR study of O-H bond dissociation energy in substituted phenols based on support vector machines. J. Chem. Inf. Comput. Sci. 2004, 44, 669−677. (24) Shen, M.; Xiao, Y.; Golbraikh, A.; Gombar, V. K.; Tropsha, A. Development and validation of k-nearest neighbour QSPR models of metabolic stability of drug candidates. J. Med. Chem. 2003, 46, 3013− 3020. (25) Palmer, D. S.; O’Boyle, N. M.; Glen, R. C.; Mitchell, J. B. O. Random Forest Models To Predict Aqueous Solubility. J. Chem. Inf. Model. 2007, 47, 150−158. (26) Rodgers, A. D.; Zhu, H.; Fourches, D.; Rusyn, I.; Tropsha, A. Modeling liver-related adverse effects of drugs using knearest neighbor quantitative structure-activity relationship method. Chem. Res. Toxicol. 2010, 23, 724−732. (27) Teixeira, A. L.; Leal, J. P.; Falcao, A. O. Random Forests for Feature Selection in QSPR Models - An Application for Predicting Standard Enthalpy of Formation of Hydrocarbons. J. Cheminf. 2013, 5, 9. (28) Moriwaki, H.; Tian, Y. S.; Kawashita, N.; Takagi, T. Mordred: a molecular descriptor calculator. J. Cheminf. 2018, 10 (1), 4. (29) RDKit: Open-source cheminformatics; http://www.rdkit.org. (30) Turney, J. M.; Simmonett, A. C.; Parrish, R. M.; Hohenstein, E. G.; Evangelista, F. A.; Fermann, J. T.; Mintz, B. J.; Burns, L. A.; Wilke, J. J.; Abrams, M. L.; Russ, N. J.; Leininger, M. L.; Janssen, C. L.; Seidl, E. T.; Allen, W. D.; Schaefer, H. F.; King, R. A.; Valeev, E. F.; Sherrill, C. D.; Crawford, T. D. Wiley Interdisciplinary Reviews: Computational Molecular Science. 2012, 2, 556. (31) Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, É . Scikit-Learn: Machine Learning in Python. J. Machine Learning Res. 2011, 12, 2825−2830. (32) The PubChem Project. https://pubchem.ncbi.nlm.nih.gov/ #opennewwindow (accessed October 2018).

(33) Nikolaienko, T. Y.; Bulavin, L. A.; Hovorun, D. M. Can we treat ab initio atomic charges and bond orders as conformationindependent electronic structure descriptors? RSC Adv. 2016, 6 (78), 74785−74796. (34) James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An introduction to statistical learning: Springer: New York, 2013. (35) Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J. C.; Sheridan, R. P.; et al. Random Forest: A Classification and Regression tool for Compound Classification and QSAR Modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947−1958. (36) SAS Institute. JMP: Statistics and Graphics Guide; SAS Institute: Cary, NC, 2000. (37) Golbraikh, A.; Tropsha, A. Beware of q2! J. Mol. Graphics Modell. 2002, 20 (4), 269−276. (38) Witten, I. H.; Frank, E.; Hall, M. A.; Pal, C. J. Data Mining: Practical Machine Learning Tools and Techniques; Morgan Kaufmann: Burlington, MA, 2016. (39) Ruiz-Morales, Y. HOMO− LUMO gap as an index of molecular size and structure for polycyclic aromatic hydrocarbons (PAHs) and asphaltenes: A theoretical study. I. J. Phys. Chem. A 2002, 106, 11283−11308.

G

DOI: 10.1021/acs.iecr.8b05938 Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX