A Quantitative Model for the Prediction of Sooting Tendency from

Jul 24, 2017 - National Renewable Energy Laboratory, Golden, Colorado 80401, United States. ‡ Yale University, New ... Fortunately, the search for s...
1 downloads 32 Views 785KB Size
Subscriber access provided by UNIV OF NEWCASTLE

Article

A quantitative model for the prediction of sooting tendency from molecular structure Peter C. St. John, Paul M. Kairys, Dhrubajyoti D. Das, Charles Stewart Mcenally, Lisa D. Pfefferle, David J Robichaud, Mark R Nimlos, Bradley T. Zigler, Robert L. McCormick, Thomas D. Foust, Yannick J Bomble, and Seonah Kim Energy Fuels, Just Accepted Manuscript • DOI: 10.1021/acs.energyfuels.7b00616 • Publication Date (Web): 24 Jul 2017 Downloaded from http://pubs.acs.org on July 26, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Energy & Fuels is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

A quantitative model for the prediction of sooting tendency from molecular structure Peter C. St. John,† Paul Kairys,† Dhrubajyoti D. Das,‡ Charles S. McEnally,‡ Lisa D. Pfefferle,‡ David J. Robichaud,† Mark R. Nimlos,† Bradley T. Zigler,† Robert L. McCormick,† Thomas D. Foust,† Yannick J. Bomble,∗,† and Seonah Kim∗,† †National Renewable Energy Laboratory, Golden CO 80401 ‡Yale University, New Haven CT 06520 E-mail: [email protected]; [email protected]

Abstract Particulate matter emissions negatively affect public health and global climate, yet newer fuel-efficient gasoline direct injection engines tend to produce more soot than their port-fuel injection counterparts. Fortunately, the search for sustainable biomass-based fuel blendstocks provides an opportunity to develop fuels that suppress soot formation in more efficient engine designs. However, as emissions tests are experimentally cumbersome and the search space for potential bio-blendstocks is vast, new techniques are needed to estimate the sooting tendency of a diverse range of compounds. In this study, we develop a quantitative structure-activity relationship (QSAR) model of sooting tendency based on the experimental yield sooting index (YSI), which ranks molecules on a scale from n-hexane, 0, to benzene, 100. The model includes a rigorously defined applicability domain and the predictive performance is checked using both internal and external validation. Model predictions for compounds in the external test set had a median absolute error of ≈ 3 YSI units. An investigation of compounds that are poorly

1

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

predicted by the model lends new insight into the complex mechanisms governing soot formation. Predictive models of soot formation can therefore be expected to play an increasingly important role in the screening and development of next-generation biofuels.

Introduction Particulate matter (PM) emissions from internal combustion engines have been linked to adverse public health effects 1 and global temperature rise. 2 Tightening regulations on PM emissions have required manufacturers to introduce control measures for reducing soot, such as diesel particulate filters, with concomitant decreases in fuel efficiency and engine power. 3,4 Traditional port fuel injected (PFI) spark ignited (SI) engines have historically produced far less soot than their compression ignition (CI) counterparts. 5 However, the recent development of gasoline direct-injection engines has required more careful consideration of soot formation from gasoline-range fuels. 6,7 Soot formation is therefore a factor to be considered in maximizing passenger car engine efficiency, since the majority of light duty vehicles utilize SI engine technology. Under the United States Department of Energy’s Co-Optimization of Fuels and Engines initiative, there is an ongoing effort to develop fuels that can be produced sustainably. 8 This effort therefore presents an opportunity to co-develop high efficiency engines and fuels that are less susceptible to particulate matter formation. The complex chemical mechanisms that underlie soot formation are difficult to predict, leading to a reliance on experimental testing of soot formation and PM emission. 9 Directly measuring the soot emissions from engines is labor- and equipment-intensive and requires large quantities of fuel. As a result, smaller bench-scale measurements have formed the basis for much of the research on the sooting tendencies of pure compounds. One of the oldest sooting indices is the smoke point, a measure of the maximum flame height achievable by a fuel in a test lamp without smoking. The smoke point has been shown to be a reliable indicator of gas turbine soot levels and is used in ensuring quality in aviation fuels. 10 2

ACS Paragon Plus Environment

Page 2 of 28

Page 3 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

The smoke point has also been correlated to emissions from SI engines, 11 indicating that sooting indices measured in simple flames are relevant in emissions control from practical devices. Standardization of smoke point data into apparatus-independent sooting indices began with the Threshold Sooting Index (TSI), which ranks fuels on a 0-100 scale based on defined reference values. 12 More recently, this definition was extended to the oxygen extended sooting index (OESI), which better accounts for the reduced stoichiometric air required by oxygenated fuels. 13 While useful, sooting indices based on smoke point measurements suffer from a number of disadvantages: (1), operator bias is possible in estimating the proper flame shape, especially for compounds with a high smoke point, and (2), require up to 20 mL of sample per measurement. 10 Sooting indices not based on smoke point have been developed, including the micropyrolysis index (MPI) 14 and yield sooting index (YSI). 15 The yield sooting index, developed by McEnally & Pfefferle, accurately measures the sooting tendencies of small quantities of sample and correlates well with TSI. 15–18 In the YSI approach, the sooting tendency is characterized by the maximum soot volume fraction fv,max measured in a flame whose fuel is doped with the test compound of interest. These results are rescaled into an apparatus-independent YSI value with the relation YSI = A fv,max + B, where A and B are scaling constants chosen so that the YSI’s of n-hexane and benzene are 0 and 100, respectively. YSI has been shown to behave linearly for mixtures of dopants with varying molecular weights, such as heptanone and phenanthrene, suggesting a lack of dependence on physical interactions and the potential to scale the method to the prediction of fuel blends. 15 Due to the efficiency of the method, YSI currently represents the largest database of sooting measurements for pure compounds, with over 430 species currently measured. 19 YSI is well-positioned to guide the development of new fuel blends, as measurements include organic functionalities that can be produced from biomass, such as alcohols, esters, and ketones. However, some promising bio-blendstocks are still underrepresented in the database, including furans and oxygenated aromatics. Even with efficient laboratory-scale tests, the high number of components in gasoline

3

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

and the vast search space of potential bio-blendstocks makes a computational estimate of sooting behavior desirable. This type of problem is often addressed with the aid of quantitative structure-activity relationship (QSAR) models, which correlate measured quantities with molecular structure. 20 QSARs encode molecular structure as a series of numerical descriptors, then use these descriptors as an input to learn a suitable regression or classification function. The development of predictive QSARs must follow several important principles, including validation against internal data (cross-validation) and against data not used during the fitting process. 21 Models must also provide a quantitative definition of their applicability domain: the set of molecules for which predictions are deemed trustworthy. QSARs have been used in the wider combustion literature to predict quality indicators such as research octane number (RON) and cetane number. 22,23 In emissions research specifically, correlations between molecular structure and sooting indices are almost as old as the indices themselves. 24 Group contribution methods, developed by Benson & Buss, 25 have been applied to TSI, 26,27 OESI, 13 and YSI 18,28 to determine the effects of different functional groups on sooting behavior. These models demonstrate that sooting tendency can indeed be correlated to molecular structure, and typically yield regressions with R2 > 0.95. While these models are useful for developing an understanding of the factors that influence particulate matter formation, their predictive capability and applicability domain are not rigorously quantified and therefore have a limited capability to be confidently extended to new molecules. In this study, we develop a predictive model of YSI from 297 experimental measurements. By reserving 20% of the dataset for external validation, we demonstrate that the model is able to predict new compounds with a median absolute error of less than 3 YSI units (2% of the total range of the considered data). We further confirm the model’s predictive capability by predicting and measuring the YSI of three compounds not found in the original database. Our examination of outliers from the cross-validation procedure highlights interesting chemistry missed by the model and informs future data collection. We demonstrate the usefulness of the model in virtual screening applications by predicting the sooting ten-

4

ACS Paragon Plus Environment

Page 4 of 28

Page 5 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

dencies of individual compounds with a known RON. The results corroborate the trend that it is possible to reduce the intrinsic sooting tendency of gasoline-range fuels through the addition of oxygenated bio-blendstocks while meeting high-RON requirements.

Experimental Molecular descriptors Molecular structures were specified using the simplified molecular-input line-entry system (SMILES), a method for encoding chemical formulas as simple text strings. 29 Chemical Abstracts Service (CAS) numbers for the YSI database 19 were converted to SMILES via NIH Cactus and the NIST WebBook. RDKit was used to process SMILES and optimize 3D geometry via the MMFF94 force-field. 30 Molecular descriptors were calculated via Dragon 7, which calculates a total of 5270 molecular descriptors classified into 30 distinct logical blocks, including information on both 2D and 3D features. 31 A complete list of the calculated descriptors is available from the Dragon 7 website. 31

Machine learning Pandas was used for data management, 32 and scikit-learn was used to implement preprocessing and feature selection methods. 33 A multi-layer perceptron (MLP) was implemented via Keras 34 and integrated into the data regression pipeline using the scikit-learn wrapper utilities. Hyperparameters, including the choice in scaling methods, feature selection routines, and MLP size and dropout rates were optimized using hyperopt, 35 in which the median absolute error from cross-validation by functional group was minimized. Large-scale searches for optimum hyperparameters were parallelized across a small-scale cluster (approximately 50 compute nodes), while cross-validation and regression procedures were performed on an early 2015 model MacBook Pro (3.1 GHz). During hyperparameter optimization an optimal preprocessing method, feature extraction method, and MLP topology was se5

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

lected. For preprocessing methods, a choice was made between sklearn’s StandardScaler, MaxAbsScaler, Normalizer, and RobustScaler routines, with MaxAbsScaler yielding the lowest cross-validation score. For feature extraction, a choice was made between a principal component analysis, in which between 10 and 190 of the largest principal components were kept; recursive feature elimination, in which features are removed by iteratively fitting a support vector regression and dropping the 10 least important descriptors until between 10 to 500 components remain; or a select K-best routine, where between 10 and 500 descriptors are selected from the most important factors in a simple linear model. A recursive feature elimination strategy that kept 390 descriptors yielded the optimal cross-validation score. The topology of the neural network was also fit during hyperparameter optimization, in which we varied the number of layers (two or three), individual dropout rates for each layer (between 0 and .75), the size of each layer (between 1 and 200 nodes), batch size (between 28 and 128), and the choice of optimizer (keras’s adadelta, adam, rmsprop, or adamax). The optimal hyperparameters for the neural network were a three-layer network with 140, 140, and 161 nodes (dropout rates of .366, .030, and .314), respectively. The adam optimizer was chosen with a batch size of 39. A rectified linear unit activation function and 100-epoch fitting duration were used, but were not optimized by the hyperparameter optimization routine. All models and codes used to generate the results in this manuscript and reproduce the figures are available at (https://gist.github.com/pstjohn/b7e2f05136bf8bfd22f68f375437452d).

Measurement of YSI values The YSIs of the three new compounds in Table 1 were measured with the same procedures as described by McEnally and coworkers, 17 except that (1) a new, smaller burner was used, 36 and (2) maximum soot concentrations in the flames were characterized by line-of-sight spectral radiance at 660 nm measured with a photomultiplier tube. Chemicals were obtained from Sigma-Aldrich and had purities of 98% or higher. Flame conditions at 25◦ C and 1 atm were: total fuel flowrate, 395 cm3 min-1 ; air flowrate, 50,000 cm3 min-1 ; methane mole 6

ACS Paragon Plus Environment

Page 6 of 28

Page 7 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

fraction, 0.715; and dopant mole fraction, 1000 ppm. The uncertainties for the experimental measurements are estimated to be ± 5 YSI units (95% confidence interval). 17

Density functional theory Gaussian 09 37 was used to calculate the energies of all reactants, products, transition states, and intermediates to produce soot precursors and retro-Diels-Alder products with the G4 composite method. All geometries were optimized at the B3LYP/6-31G(2df,p) level as implemented in G4. All the transition state calculations were confirmed by checking for the existence of only one imaginary frequency, which reflects the change in geometry between reactant and product. The intrinsic reaction coordinate (IRC) calculations using B3LYP/631G(2df,p) were also performed to verify corresponding reactants and products.

Results Yield sooting index (YSI) quantifies the tendency of a molecule to form soot when added to a methane-air flame. Two different scales have been used in the experiments, one for low sooting hydrocarbons (primarily aliphatics), and one for high sooting hydrocarbons (primarily aromatics); this study focuses on the low sooting hydrocarbons since they are more representative of potential biofuels. In this scale, YSI is calibrated to 0 (n-hexane) and 100 (benzene), in which higher numbers correspond to increased sooting tendency. Of the 297 unique compounds with a measured YSI in this scale, 20% (59) were randomly reserved 38,39 as an external validation set. We then trained a QSAR model on the remaining 238 compounds to predict YSI as a function of molecular structure. This process begins with the calculation of 5270 unique molecular descriptors that quantify different features of the optimized 3D molecular structure.

7

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Defining an applicability domain A QSAR model’s applicability domain defines the set of molecules similar enough to the training data for which predictions can be trusted. In this study, we use a decision function based on the isolation forest approach. 40,41 A principal component analysis was used to reduce the dimensionality of the descriptor space: prior to calculating the distance, descriptors were projected onto their 5 largest principal components that account for 67% of the overall variance. Following this, an isolation forest was used for outlier classification assuming a training set contamination of 7.5%. The hyperparameters of the applicability domain algorithm were chosen to ensure that molecules sufficiently dissimilar from the training set were excluded: the high YSI database 19 was used as example molecules for this purpose, as shown in Figure S1. The fitted applicability domain method was tested on the test compounds to ensure a similar fraction of molecules were marked as outliers (7.6% vs. 8.4% for the training and test sets, respectively). Outlier molecules, which reside near the edges of the distribution, tended to include molecules at the beginning or end of homologous series (e.g., methanol, n-dodecane) or molecules with underrepresented structural features (e.g., tetramethoxymethane).

Hyperparameter optimization by functional group cross-validation Descriptors for the training set compounds were then passed to a regression pipeline. As is typical with most machine learning methods, our choice of regression method included a number of hyperparameters: parameters that affect the performance of the algorithms without being fit directly from the data. 35 In this study, we refit the model for a range of different normalization types, feature extraction methods, and neural network sizes to find a pipeline that minimizes error during cross-validation. To ensure the model does not over-fit the particular functional groups represented in the training set, we performed cross-validation using a leave-one-group-out approach, where molecules were grouped by molecule type as annotated in the original YSI database. 19 For the training set, group sizes ranged from just 8

ACS Paragon Plus Environment

Page 8 of 28

Page 9 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

3 compounds (alkynes and alkadienes) to 40 compounds (saturated alcohols). This method ensures that predictions are generated without knowledge of similar compounds, and forces the model to generalize well to functional groups not represented in the training data. The optimized regression pipeline consists of the following steps, shown in Figure 1: 1. Missing values in the descriptor matrix were filled using the mean value for that descriptor across other training set compounds, and descriptors that had the same value for all training molecules were dropped. 2. Descriptors were scaled so that all values fell in a similar range. While several types of normalization methods were tested, a scaling method that normalizes the maximum value of each descriptor to 1 yielded the best performance during cross-validation. 3. Normalized descriptors were then passed to a feature selection routine. A recursive feature elimination strategy was selected from the different routines tested, in which features are eliminated 10 at a time until only 390 descriptors remained. At each iteration, the least important variables in a linear support vector machine regression are dropped. 4. Selected descriptors were finally passed to a feed-forward artificial neural network that consists of three densely connected layers of 140, 140, and 161 units with dropout rates of 0.366, 0.030, and 0.314, respectively. A rectified linear unit activation function is applied to each internal layer. The final layer consists of a one-dimensional output, which serves as the model prediction. The results of this cross-validation for the optimized hyperparameters are shown in Figure 2, which demonstrate a median absolute error of 5.47 YSI units for molecules that pass the applicability domain threshold. Cross-validation errors are shown by functional group in Figure 2B, indicating that molecule classes with the lowest representation tend to have the worst cross-validation predictions. These compounds typically include multiple features 9

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(i.e., cyclic ethers, unsaturated esters), indicating complex interactions between functional groups are responsible for determining sooting chemistry. Molecules with simpler functional groups, like alkenes or saturated alcohols, are predicted more accurately even when representative molecules are withheld from the training set. These results underscore the importance of restricting model predictions to an appropriate applicability domain as well as the need for experimental measurements of diverse sets of compounds.

Internal model validation After determining the optimum regression pipeline, we use cross-validation to estimate how well we expect the model to perform on the test set. Since the test set was selected randomly, we shuffled the training set prior to performing k-fold cross-validation (k = 10), which omits 10% of the compounds each time. This process was repeated for 25 independent shufflings to estimate mean and standard deviations for cross-validation predictions. As shown in Figure 3, the resulting predictions had a median absolute error of 3.69 YSI units for the molecules that fell inside the applicability domain. Predictions were largely consistent across independent shufflings, with a mean standard deviation of 2.2 YSI units for predictions on molecules inside the applicability domain. While some molecules inside the applicability domain have large prediction errors, residuals are largely distributed close to zero (Figure 3B). The shape and location of the residual distribution indicates that the model’s structure is not biased towards over- or under-predicting sooting tendency. Since the applicability domain is based on a distance-based metric, we examined whether prediction error was correlated with distance from the training set distribution. In Figure 3C, we plot the mean cross-validated prediction error against the decision function metric of the isolation forest. While outliers have a higher average error than molecules inside the domain of validity, inaccurate predictions can occur even at low scores. The reported model performance is therefore not acutely sensitive to the chosen applicability domain threshold.

10

ACS Paragon Plus Environment

Page 10 of 28

Page 11 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Analysis of cross-validation outliers To gain a more fundamental understanding of the practical limitations of the model, we looked at cross-validation outliers to understand the types of chemistry that might be missed by the machine learning approach. The main mechanism driving the formation of soot particulates is the production and further aggregation of polycylic aromatic hydrocarbons (PAHs). 42 The rate in which a molecule forms aromatic precursors during combustion is therefore a critical factor in determining their sooting tendency. 43–45 Of particular note is the cyclopentadienyl (CPDyl) radical, which reacts rapidly to form the soot precursors naphthalene, indene, and benzene. 43 The five most over-predicted and under-predicted molecules in the randomized crossvalidation study are shown in Tables 2 and 3; respectively. Cyclic alkenes in particular were poorly predicted by the model. Cyclohexene had a cross-validated prediction of 51.4 (versus a measured YSI of 23.7), likely due to the high experimental values for cyclopentene (61.8) and cycloheptene (71.6). This trend is a substantial deviation from the monotonic YSI increases seen in cycloalkanes, with YSI values of 14, 19, and 30 for 5, 6, and 7 membered cycloalkanes; respectively. One potential explanation for this difference is the retro-DielsAlder mechanism, a reaction resulting in a ring-breaking, soot-suppressing dissociation that is available to cyclohexene but not to cyclopentene, cycloheptene, or any cycloalkane. To further investigate this mechanism, we applied density functional theory (DFT) to calculate simple potential energy surfaces of each cycloalkene at room temperature and 1700 K. We assume that the formation of a diene intermediate via hydrogen abstraction is the first step to PAH precursors, as cyclopentadiene quickly radicalizes to CPDyl while cycloheptadiene can further react via cycloheptatriene to form toluene and tropyl radicals. 46,47 The initial free energy change for the hydrogen abstraction mechanism by oxygen molecules for each cycloalkene is nearly identical at 36.3, 32.3, and 36.3 kcal mol-1 for 5, 6, and 7 membered cycloalkenes, respectively (30.7, 21.5, and 29.4 kcal mol-1 at 1700 K). Abstraction by free radicals can be significantly lower (less than 5 kcal mol-1 ). We compare this energy 11

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 28

barrier to that of the retro-Diels-Alder pathway, which has a free energy change of only -11.5 kcal mol-1 and a barrier of 40.5 kcal mol-1 . To compare these pathways we must consider the relative kinetics, and while a full kinetic analysis is beyond the scope of this study, we can still check the assumption using back of the envelope calculations. The rate constant for the unimolecular retro-Diels-Alder reaction can be estimated using the Eyring equation to be 108 s-1 , while the bimolecular abstraction reaction can be estimated to be 10 s-1 (assuming a barrierless, gas kinetic reaction rate, 10-10 cm3 molecule-1 s-1 , and a radical concentration of 1011 radicals cm-3 ). While these mechanisms and calculations are not all-inclusive, these results suggest that the primary reason for poor prediction of cycloalkenes is the exclusivity of the retro-Diels-Alder mechanism to cyclohexene, which provides a low-energy route to non-radical linear species. While cyclohexene has a non-intuitive pathway towards decomposition, outliers that are under-predicted tend to be closely related to PAH precursors. For example, 1-methylcyclopentene is predicted via cross-validation to have a YSI of 66.7, which, while high compared to the structurally similar methylcyclopentane (30.9) or cyclohexene (23.9), is far less than the experimentally measured 102.7. This error is likely due to the strong similarity between 1methylcyclopentene and intermediates in the CPDyl radical soot formation pathways. 44,45,48 In a similar case, the tendency for methyl cyclopentyl ether to form CPDyl radicals may explain its relatively high sooting index compared to ethers with a similar carbon content. These outliers underscore the effect that small differences in molecular structure can have on reaction mechanisms. By examining where the model predictions fail for measured values we are therefore able to determine where sooting behavior is poorly predicted by simple inspection of a molecule’s structure, and therefore potentially discover new and interesting chemical phenomena.

12

ACS Paragon Plus Environment

Page 13 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

External validation As a final step, we evaluated the QSAR model’s performance on both the 59 reserved test compounds as well as three newly tested molecules not included in the original YSI database 19 (Table 1). In order to estimate confidence intervals on the model’s predictions, we constructed a bootstrap aggregating (bagging) model consisting of 25 independent regression pipelines, each fit to a randomly selected 90% of the training data. We plot the means and 95% confidence intervals for the training and testing dataset in Figure 4A and confirm that the median absolute error in test set predictions (3.08 YSI units) is similar to the median absolute error obtained from randomized cross-validation. The standard deviation in test-set predictions is 5.35, which indicates a 95% confidence interval of ±10 YSI units - about twice that of the experimental measurement itself. Additionally, the model does a reasonable job of predicting the YSI of the previously unmeasured compounds, with a mean error of just over 3 YSI units. Histograms of the training and test prediction error (Figure 4B-C) indicate that the errors are centered around 0 and are similarly distributed to the errors seen during cross-validation. Of the three newly tested compounds, selected due to their promise as potential bio-blendstocks, only 2-methylbut-3-en-2-ol passed the applicability domain threshold. The errors seen in the predictions for these compounds highlights the need for experimental YSI measurements of molecules with similar structural features when extending QSAR models to new regions.

Application of the model for virtual screening Once the QSAR model was constructed and its predictive performance validated, we applied the model to a database of measured research octane numbers (RON) in order to investigate the relationship between octane number, sooting propensity, and molecule type. 49 Higher RON fuels enable the more aggressive pursuit of technologies aimed at reducing fuel consumption in addition to direct-injection, including higher compression ratios, higher power density, turbocharged engines that enable smaller swept displacement volume (downsizing), 13

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

and operation at lower engine speeds (downspeeding). 50 Experimentally determined YSI measurements were used when available (58 of 92 molecules). The resulting plot of RON vs. YSI is shown in Figure 5. While some non-oxygenated molecules have low YSI values, they tend to also have low RON. With the inclusion of oxygenated molecules, however, this trend is broken by a substantial amount of low-sooting, high-RON molecules. Compounds such as ethanol, 2-Propanone, and ethyl acetate are among the best-performing oxygenate molecules by RON and YSI, while some oxygenates, including hexanoic acid or 3-hexanol, do not substantially differ from the YSI and RON values achievable by non-oxygenated compounds. This result is promising for the development of new spark-ignition engine fuel blends that are not as susceptible to sooting even in direct-injection engines. To make YSI predictions accessible for the research community, we have applied the model to the entire ChEBI database 51 and included means and standard deviations for molecules that fall within the applicability domain as supplemental materials.

Discussion Through the development and application of a predictive QSAR model, we have demonstrated that the yield sooting index can be well-predicted by molecular structure. QSAR models also prove useful in cases where the predictions fail, as they reveal new mechanisms and where further experimental inquiry is needed. The model therefore serves as a tool to amplify experimental effort and interpolate sooting measurements for similar uncharacterized compounds. Such a tool is needed in the ongoing effort to predict particulate matter formation rates from the composition of a particular fuel, as sooting tendency for blends seems to be relatively well-described by a linear function of the concentrations of its components. 52,53 While previous efforts have demonstrated success using simple descriptors based on double bond equivalents, 53 these techniques will likely not be able to capture the sooting behavior of more chemically-varied bio-blendstocks. 54 This work demonstrates that more descriptive

14

ACS Paragon Plus Environment

Page 14 of 28

Page 15 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

sooting indices can be efficiently and accurately predicted. The accuracy achieved by the QSAR model presented in this work should therefore be suitable to enable the development of more generalizable engine-level soot prediction models. However, these models will likely need to include additional measurements or empirical predictions, such as vapor pressure or heat of vaporization. 11 Future work will need to expand these structure-property relationships to higher-sooting species, such as aromatic species responsible for the majority of soot formation in traditional gasoline blends. 15,16 While the current YSI database 19 does not support the inclusion of these compounds, this work demonstrates that such a model may be able to find high-boiling point, low-sooting replacements for many of the most problematic sooting species in gasoline-range fuels.

Acknowledgement This research was conducted as part of the Co-Optimization of Fuels & Engines (Co-Optima) project sponsored by the U.S. Department of Energy (DOE) Office of Energy Efficiency and Renewable Energy (EERE), Bioenergy Technologies and Vehicle Technologies Offices. CSM, LDP, DDD appreciate assistance with the experiments from Avery Vella and financial support from the National Science Foundation (Awards CTS-3991122311 and CTS-1604983).

15

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 16 of 28

Page 17 of 28

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 18 of 28

Page 19 of 28

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 20 of 28

Page 21 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Table 2: YSI over-predicted species by cross-validation. YSI indicates the experimentally determined YSI as reported in McEnally & Pfefferle, 2011, 17 while YSI(cv) indicates the k-fold cross-validation predictions (mean ± 95% confidence interval from 25 randomized replications). Species

CAS

YSI

YSI(cv)

cyclohexene 2-methoxytetrahydropyran 2,4,4-trimethyl-1-pentene 2-methyl-2-butene methylcyclopentane

110-83-8 6581-66-4 107-39-1 513-35-9 96-37-7

23.7 48.81 ± 9.11 -12.9 2.18 ± 9.14 59.2 74.06 ± 6.97 20.4 33.10 ± 4.90 30.9 42.90 ± 10.45

Table 3: YSI under-predicted species by cross-validation. YSI indicates the experimentally determined YSI as reported in McEnally & Pfefferle, 2011, 17 while YSI(cv) indicates the k-fold cross-validation predictions (mean ± 95% confidence interval from 25 randomized replications). Species

CAS

1-methylcyclopentene 693-89-0 methyl cyclopentyl ether 5614-37-9 2,4,4-trimethyl-2-pentene 107-40-4 cycloheptene 628-92-2 2,3,4-trimethyl-2-pentene 565-77-5

21

YSI

YSI(cv)

102.7 49.6 91.4 71.6 87.9

63.66 20.14 64.02 46.24 71.87

ACS Paragon Plus Environment

± 11.13 ± 4.78 ± 8.02 ±8.64 ± 8.47

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

References (1) Valavanidis, A.; Fiotakis, K.; Vlachogianni, T. Airborne Particulate Matter and Human Health: Toxicological Assessment and Importance of Size and Composition of Particles for Oxidative Damage and Carcinogenic Mechanisms. J. Environ. Sci. Heal. C 2008, 26, 339–362. (2) Jacobson, M. Z. Strong radiative heating due to the mixing state of black carbon in atmospheric aerosols. Nature 2001, 409, 695–697. (3) Singh, N.; Rutland, C. J.; Foster, D. E.; Narayanaswamy, K.; He, Y. Investigation into Different DPF Regeneration Strategies Based on Fuel Economy Using Integrated System Simulation. SAE Technical Paper Series 2009, (4) Sappok, A.; Wong, V. W. Ash Effects on Diesel Particulate Filter Pressure Drop Sensitivity to Soot and Implications for Regeneration Frequency and DPF Control. SAE Int. J. Fuels Lubr. 2010, 3, 380–396. (5) Schreiber, D.; Forss, A.-M.; Mohr, M.; Dimopoulos, P. Particle Characterisation of Modern CNG, Gasoline and Diesel Passenger Cars. SAE Technical Paper Series 2007, (6) Piock, W.; Hoffmann, G.; Berndorfer, A.; Salemi, P.; Fusshoeller, B. Strategies Towards Meeting Future Particulate Matter Emission Requirements in Homogeneous Gasoline Direct Injection Engines. SAE Int. J. Engine 2011, 4, 1455–1468. (7) Zimmerman, N.; Wang, J. M.; Jeong, C.-H.; Wallace, J. S.; Evans, G. J. Assessing the Climate Trade-Offs of Gasoline Direct Injection Engines. Environ. Sci. Technol. 2016, 50, 8385–8392. (8) Department of Energy, Co-Optimization of Fuels & Engines. https://energy.gov/ eere/bioenergy/co-optimization-fuels-engines. (9) Bockhorn, H., Ed. Soot Formation in Combustion; Springer Berlin Heidelberg, 1994. 22

ACS Paragon Plus Environment

Page 22 of 28

Page 23 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

(10) D1322, A. Test Method for Smoke Point of Kerosine and Aviation Turbine Fuel; 2015. (11) Barrientos, E. J.; Anderson, J. E.; Maricq, M. M.; Boehman, A. L. Particulate matter indices using fuel smoke point for vehicle emissions with gasoline, ethanol blends, and butanol blends. Combust. Flame 2016, 167, 308âĂŞ319. (12) Calcote, H.; Manos, D. Effect of molecular structure on incipient soot formation. Combust. Flame 1983, 49, 289–304. (13) Barrientos, E. J.; Lapuerta, M.; Boehman, A. L. Group additivity in soot formation for the example of C-5 oxygenated hydrocarbon fuels. Combust. Flame 2013, 160, 1484–1498. (14) Crossley, S. P.; Alvarez, W. E.; Resasco, D. E. Novel Micropyrolyis Index (MPI) to Estimate the Sooting Tendency of Fuels. Energy Fuels 2008, 22, 2455–2464. (15) McEnally, C.; Pfefferle, L. Improved sooting tendency measurements for aromatic hydrocarbons and their implications for naphthalene formation pathways. Combust. Flame 2007, 148, 210–222. (16) McEnally, C. S.; Pfefferle, L. D. Sooting tendencies of nonvolatile aromatic hydrocarbons. P. Combust. Inst. 2009, 32, 673–679. (17) McEnally, C. S.; Pfefferle, L. D. Sooting Tendencies of Oxygenated Hydrocarbons in Laboratory-Scale Flames. Environ. Sci. Technol. 2011, 45, 2498–2503. (18) Das, D. D.; McEnally, C. S.; Pfefferle, L. D. Sooting tendencies of unsaturated esters in nonpremixed flames. Combust. Flame 2015, 162, 1489–1497. (19) Das, D. D.; McEnally, C. S.; Pfefferle, L. D. Yield sooting index database volume 1. Harvard Dataverse 2016,

23

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(20) Dudek, A.; Arodz, T.; Galvez, J. Computational Methods in Developing Quantitative Structure-Activity Relationships (QSAR): A Review. Comb. Chem. High T. Scr. 2006, 9, 213–228. (21) Gramatica, P. Principles of QSAR models validation: internal and external. QSAR Comb. Sci. 2007, 26, 694–701. (22) Al-Fahemi, J. H.; Albis, N. A.; Gad, E. A. M. QSPR Models for Octane Number Prediction. J Theor. Chem. 2014, 2014, 1–6. (23) Saldana, D. A.; Starck, L.; Mougin, P.; Rousseau, B.; Pidol, L.; Jeuland, N.; Creton, B. Flash Point and Cetane Number Predictions for Fuel Compounds Using Quantitative Structure Property Relationship (QSPR) Methods. Energy Fuels 2011, 25, 3900–3908. (24) Hunt, R. A. Relation of Smoke Point to Molecular Structure. Ind. Eng. Chem. 1953, 45, 602–606. (25) Benson, S. W.; Buss, J. H. Additivity Rules for the Estimation of Molecular Properties. Thermodynamic Properties. J. Chem. Phys. 1958, 29, 546. (26) Pepiot-Desjardins, P.; Pitsch, H.; Malhotra, R.; Kirby, S.; Boehman, A. Structural group analysis for soot reduction tendency of oxygenated fuels. Combust. Flame 2008, 154, 191–205. (27) Yan, S.; Eddings, E. G.; Palotas, A. B.; Pugmire, R. J.; Sarofim, A. F. Prediction of Sooting Tendency for Hydrocarbon Liquids in Diffusion Flames. Energy Fuels 2005, 19, 2408âĂŞ2415. (28) Das, D. D.; Cannella, W. J.; McEnally, C. S.; Mueller, C. J.; Pfefferle, L. D. Twodimensional soot volume fraction measurements in flames doped with large hydrocarbons. Proc. Combust. Inst. 2016,

24

ACS Paragon Plus Environment

Page 24 of 28

Page 25 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

(29) Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Modeling 1988, 28, 31âĂŞ36. (30) RDKit: Open-source cheminformatics. http://www.rdkit.org. (31) Kode, Dragon 7.0. https://chm.kode-solutions.net/products_dragon.php. (32) McKinney, W. Data structures for statistical computing in python. P. 9th Python Sci. 2010. (33) Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. (34) Chollet, F. keras. https://github.com/fchollet/keras. (35) Bergstra, J.; Yamins, D.; Cox, D. D. Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. P. 12th Python Sci. 2013. (36) McEnally, C.; Schaffer, A.; Long, M.; Pfefferle, L.; Smooke, M.; Colket, M.; Hall, R. Computational and experimental study of soot formation in a coflow, laminar ethylene diffusion flame. Symp. Int. Combust. Proc. 1998, 27, 1497–1505. (37) Frisch, M. J. et al. Gaussian 09 Revision D.01. Gaussian Inc. Wallingford CT 2009. (38) Martin, T. M.; Harten, P.; Young, D. M.; Muratov, E. N.; Golbraikh, A.; Zhu, H.; Tropsha, A. Does Rational Selection of Training and Test Sets Improve the Outcome of QSAR Modeling? J. Chem. Inf. Model. 2012, 52, 2570–2578. (39) Schuffenhauer, A.; Brown, N.; Selzer, P.; Ertl, P.; Jacoby, E. Relationships between Molecular Complexity, Biological Activity, and Structural Diversity. J. Chem. Inf. Model. 2006, 46, 525–535.

25

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(40) Jaworska, J.; Nikolova-Jeliazkova, N. QSAR applicability domain estimation by projection of the training set descriptor space: a review. Altern. Lab. Anim. 2005, (41) Liu, F. T.; Ting, K. M.; Zhou, Z.-H. Isolation Forest. IEEE Conf. Data Min. 2008, (42) Richter, H.; Howard, J. Formation of polycyclic aromatic hydrocarbons and their growth to soot—a review of chemical reaction pathways. Prog. Energ. Combust. 2000, 26, 565–608. (43) Frenklach, M. Reaction mechanism of soot formation in flames. Phys. Chem. Chem. Phys. 2002, 4, 2028–2037. (44) Wang, D.; Violi, A.; Kim, D. H.; Mullholland, J. A. Formation of Naphthalene, Indene, and Benzene from Cyclopentadiene Pyrolysis: A DFT Study. J. Phys. Chem. A 2006, 110, 4719–4725. (45) Cavallotti, C.; Polino, D.; Frassoldati, A.; Ranzi, E. Analysis of Some Reaction Pathways Active during Cyclopentadiene Pyrolysis. J. Phys. Chem. A 2012, 116, 3313–3324. (46) Herndon, W. C.; Lowry, L. L. Mechanism of the Pyrolysis of Bicyclo [2.2.1]heptadiene. Kinetics of the Bicyclo [2.2.1]heptadiene to Toluene Isomerization. J. Am. Chem. Soc. 1964, 86, 1922–1926. (47) Buckingham, G. T.; Porterfield, J. P.; Kostko, O.; Troy, T. P.; Ahmed, M.; Robichaud, D. J.; Nimlos, M. R.; Daily, J. W.; Ellison, G. B. The thermal decomposition of the benzyl radical in a heated micro-reactor. II. Pyrolysis of the tropyl radical. J. Chem. Phys. 2016, 145, 014305. (48) Scheer, A. M.; Mukarakate, C.; Robichaud, D. J.; Ellison, G. B.; Nimlos, M. R. Radical Chemistry in the Thermal Decomposition of Anisole and Deuterated Anisoles: An Investigation of Aromatic Growth. J. Phys. Chem. A 2010, 114, 9043–9056.

26

ACS Paragon Plus Environment

Page 26 of 28

Page 27 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

(49) Fuel

Property

Database.

https://fuelsdb.nrel.gov/fmi/webd#

FuelEngineCoOptimization. (50) Leone, T. G.; Anderson, J. E.; Davis, R. S.; Iqbal, A.; Reese, R. A.; Shelby, M. H.; Studzinski, W. M. The Effect of Compression Ratio, Fuel Octane Rating, and Ethanol Content on Spark-Ignition Engine Efficiency. Environ. Sci. Technol. 2015, 49, 10778– 10789. (51) Hastings, J.; de Matos, P.; Dekker, A.; Ennis, M.; Harsha, B.; Kale, N.; Muthukrishnan, V.; Owen, G.; Turner, S.; Williams, M.; Steinbeck, C. The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013. Nucleic Acids Res. 2012, 41, D456–D463. (52) Gill, R. J.; Olson, D. B. Estimation of Soot Thresholds for Fuel Mixtures. Combust. Sci. and Technol. 1984, 40, 307âĂŞ315. (53) Aikawa, K.; Sakurai, T.; Jetter, J. J. Development of a Predictive Model for Gasoline Vehicle Particulate Matter Emissions. SAE Int. J. Fuels Lubr. 2010, 3, 610–622. (54) Ratcliff, M. A.; Burton, J.; Sindler, P.; Christensen, E.; Fouts, L.; Chupka, G. M.; McCormick, R. L. Knock Resistance and Fine Particle Emissions for Several BiomassDerived Oxygenates in a Direct-Injection Spark-Ignition Engine. SAE Int. J. Fuels Lubr. 2016, 9, 59–70.

27

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 28 of 28