Predicting the Net Heat of Combustion of Organosilicon Compounds

Sep 5, 2012 - ... important properties of flammable substances that can be used to estimate the potential fire hazards of chemicals once they ignite a...
0 downloads 0 Views 328KB Size
Article pubs.acs.org/IECR

Predicting the Net Heat of Combustion of Organosilicon Compounds from Molecular Structures Yong Pan,* Juncheng Jiang, and Yinyan Zhang College of Urban Construction & Safety Engineering, Nanjing University of Technology, Nanjing 210009, China S Supporting Information *

ABSTRACT: The net heat of combustion is one of the most important properties of flammable substances that can be used to estimate the potential fire hazards of chemicals once they ignite and burn. This study proposed a quantitative structure−property relationship model to predict the net heat of combustion of 308 organosilicon compounds from only the knowledge of their molecular structures. Various kinds of molecular descriptors, such as topological, charge, and geometric descriptors, were calculated to represent the molecular structures of organosilicon compounds. The genetic algorithm combined with multiple linear regression is employed to select optimal subset of descriptors that have significant contribution to the overall net heat of combustion property. The best resulted model is a three-variable multilinear model, with the root-mean-square error and average absolute error for the external test set being 176.8 and 111.2 kJ/mol, respectively. Model validation was also performed to check the stability and predictive capability of the presented model. The results showed that the presented model is a valid and predictive model. This study can provide a new way for predicting the net heat of combustion of organosilicon compounds for engineering. concluded that the reliable and accurate ΔHc° data of various chemicals is quite important and valuable in engineering design (process and plant design) and risk assessment for loss prevention. There are large amounts of ΔHc° data for organic compounds in the literature. However, the ΔHc° data for organosilicon compounds are scarce. As a result, reliable and accurate ΔHc° data for organosilicon compounds are required for the practical industry process of organosilicon chemistry. In consideration of the shortage of experimental methods, for example, the measurement of ΔHc° is very dependent on the apparatus and the test methods employed, and the measurement work is expensive and time-consuming; the development of theoretical prediction methods which are desirably convenient and reliable for predicting the ΔHc° for organosilicon compounds is desirably required and considered to be absolutely necessary. There have been little works reported in the literatures for predicting the ΔHc° for organosilicon compounds, except for the work of Hshieh.4 In his paper, empirical equations have been developed to predict the gross and the net heats of combustion of organosilicon compounds based on the atomic contribution method. The average absolute percent error was 1.4% and 1.5% for the predictions. This work was of great importance regarding the absence of theoretical models for predicting the ΔHc° of organosilicon compounds. Moreover, the presented equations were considered to be with satisfactory predictive capability and be simple to apply. However, these equations also suffers from disadvantages. For example, the

1. INTRODUCTION Organosilicon compounds are organic compounds containing carbon silicon bonds as an integral part of the molecule, and organosilicon chemistry is the corresponding science exploring their properties and reactivity.1 Recently, the use of organosilicon compounds in organic chemistry has been an increasingly important field, such as organic synthesis. Meanwhile, in silicone industries, as the building blocks of silicone materials, organosilicon compounds are also produced and used in large quantities. For example, as semiconductor related industries have made great progresses in recent years, more and more organosilicon compounds are synthesized and employed in these industries.2 Some organosilicon compounds are chemically or physically stable; however, some of them are extremely reactive and flammable.3 Consequently, it is important to study the flammability characteristics of these compounds for safety consideration. The heat of combustion of a substance is defined as the heat evolved when that substance is converted to its final oxidation products by means of molecular oxygen.4 The standard net heat of combustion (ΔHc°) is defined as the increase in enthalpy when a substance in its standard state at 298.15 K of temperature and 1 atm of pressure, undergoes oxidation to defined combustion products. ΔHc° is an important physicochemical property which can be used to calculate the reactive heat in many chemical engineering processes such as the processes of hydrogenation and dehydrogenation of hydrocarbon. In addition, the ΔHc° values of reactive chemicals can be used to estimate the potential fire hazards of chemicals once they ignite and burn and thus can be used to measure the risk level of chemicals in production, storage, and transportation. Thus, reliable and accurate ΔHc° data are always required and also considered to be absolutely necessary when preparing plant designs. Accordingly, it can be reasonably © 2012 American Chemical Society

Received: Revised: Accepted: Published: 13274

April 11, 2012 September 2, 2012 September 5, 2012 September 5, 2012 dx.doi.org/10.1021/ie300952x | Ind. Eng. Chem. Res. 2012, 51, 13274−13281

Industrial & Engineering Chemistry Research

Article

efficient program for the calculation of molecular descriptors. The calculation is on the basis of the minimum energy molecular geometries optimized by the HyperChem software (version 7.5, HyperChem is copyrighted by Hypercube, Inc.) based on MM+ molecular mechanics force field and AM1 semiempirical method, since the values of many descriptors are related to the bonds length and bonds angles. In all, a total of 1664 descriptors were calculated for each compound in the data set. A detailed description on the types of the descriptors that Dragon can calculate and the calculation procedure of the descriptors can be found in Dragon software user’s guide.16 Following, considering that some descriptors can not encode the structural differences between compounds that accounts for their different ΔHc° values, the descriptors that stayed constant and near constant for all molecules were removed from the descriptor pool. Morover, pairwise correlations between descriptors were also examined for further reducing the descriptor pool, and only one descriptor was retained from a pair contributing similar information (correlation coefficient >0.98 in this study). These reductions resulted in a reduced pool of 834 descriptors for further study. 2.3. Descriptor Selection and Model Development. One of the most important problems involved in QSPR studies is to select optimal subset of descriptors that have significant contribution to the desired property. The benefit gained from descriptor selection in QSPR is not only the validity and stability of the model but also the interpretability of relationship between the descriptors and desired property. The well-known genetic algorithm is just a well-accepted method for solving the problems above. Genetic algorithm (GA) is a powerful optimization method to search for the global or near global optima of solutions. This algorithm is a stochastic optimization and population-based approach. The detailed description of which can be found in reference.17 As a powerful optimization approach, GA has been successfully applied to feature selection in QSPR studies.18−24 In this study, the GA, along with multiple linear regression (MLR) method (GA-MLR), was employed to find the optimal subset of descriptors that accurately represented the relationships between molecular structures and ΔHc° for organosilicon compounds. GA-MLR is a hybrid approach that combines GA as a powerful optimization method and variable selection applied to MLR as a popular statistical method for modeling. This algorithm was presented by Leardi et al. for the first time.18 In this study, the program required to perform GAMLR was written in MATLAB M-file in our laboratory. The chromosome and its fitness function in the species correspond to a set of descriptors and the root-mean-square error of crossvalidation (RMSECV), respectively. To obtain the best QSPR model, the descriptor space is then thoroughly explored by GA-MLR analysis. Models with varying numbers of descriptors are examined. Because the minimum number of possible descriptors must be tested at the starting point, the program is started with one descriptor. After running the program, the best one-parameter regression model with the minimal RMSECV value should be obtained. Then, the number of desired variables should be increased to two, three, four, and so on to find the best multiparameter regression models with the desired number of descriptors. All calculations must be repeated for them. The first rule to determine if one descriptor is selected is the fitness function (RMSECV). When an increase in the number of variables did not significantly improve the RMSECV of the best-obtained model, it was determined that

developed equations have not been externally validated for evaluating the true predictive ability on those organosilicon compounds that have not been used in the model development. Moreover,. the applicability range of these equations are too related to the studied data set, and the new chemicals with atomics not included in those used for the model development will be out of the model applicability range and thus will not be predicted. Quantitative structure−property relationship (QSPR) studies, which relate the properties of interest to the molecular structures of compounds that are represented by a variety of molecular descriptors, have been demonstrated to be an effective approach for predicting various physicochemical properties, such as boiling point, melting point, flash point, vapor pressure, water solubility, critical properties, and so on, which have been extensively reviewed elsewhere.5−12 The basic strategy of QSPR analysis is to find optimum quantitative relationships between molecular structure and physicochemical properties, which can be then used for the prediction of physicochemical properties from only molecular structures. The advantage of this approach over other methods lies in the fact that it requires only the knowledge of chemical structure and is not dependent on any experimental properties. Morover, the molecular descriptors used in the QSPR models which are calculated solely from molecular structures have definite physical meanings, which would be useful to probe the physicochemical information that has significant contribution to the targeted properties. In this work, in order to attempt a new way of predicting the ΔHc° of organosilicon compounds, which are of industrial importance, QSPR reaserch is presented to study the optimum quantitative relationships between molecular structures and the ΔHc° of a number of organosilicon compounds. The main purpose is to develop a reliable QSPR model for predicting the ΔHc° of various organosilicon compounds from their molecular structures alone and also to identify and provide some insight into what structural features are most related to the potential fire hazards of organosilicon compounds.

2. MATERIALS AND METHODS 2.1. Data Set. The data set for this study consisted of a diverse set of 308 organosilicon compounds, all of which were taken from the work of Hshieh.13 The ΔHc° values of these compounds ranged from 383 to 26181.7 kJ/mol. A complete list of the compounds and their corresponding reported ΔHc° values are presented in the Supporting Information. These intermediates cover most of the important organosilicon compounds that are manufactured or used in the silicone industry. 2.2. Determination of Molecular Descriptors. Molecular descriptors are various molecular-based theoretical parameters that are calculated using known mathematical algorithms solely from molecular structures. An important step in a QSPR study is the definition of the molecular structure. To obtain a QSPR model, compounds must be represented by a variety of molecular descriptors, such as topological, geometrical, electrostatic, and quantum chemical descriptors, which have been extensively reviewed elsewhere.14,15 Each type of these descriptors is related to the special types of interaction between chemical groups in a molecule. In this work, the molecular descriptors used to search for the best model of the ΔHc° prediction are calculated by the Dragon program (version 5.4, DRAGON is copyrighted by TALETE srl),16 which is an 13275

dx.doi.org/10.1021/ie300952x | Ind. Eng. Chem. Res. 2012, 51, 13274−13281

Industrial & Engineering Chemistry Research

Article

evolution algebrat is 200. Moreover, the leave-one-out (LOO) CV is employed. Models with varying numbers of descriptors are examined. First, the program is started with one descriptor, and the best one-parameter regression model with the minimal RMSECV value is obtained. Then, the number of desired variables is increased to two, three, four, and so on to find the best multiparameter regression models with the desired number of descriptors. When increasing in the number of variables did not improve significantly the RMSECV of the best-obtained model, it was determined that the optimum subset of descriptors had been achieved. Finally, an optimum subset of three descriptors is achieved. The types and definitions of these

the optimum subset of descriptors that yielded the best MLR model had been achieved. 2.4. Model Validation. For QSPR studies, model validation is of crucial importance to develop models. The calibration and predictive capability of developed QSPR models should be tested through model validation. The most widely used squared correlation coefficient (R2) can provide a reliable indication of the fitness of the model; thus, it was employed to validate the calibration capability of the QSPR model in this study. As for the validation of predictive capability of QSPR models, two basic principles (internal validation and external validation) exist. Cross validation (CV) is one of the most often used methods for internal validation. A good CV result (Q2) often indicates a good robustness and high internal predictive ability of a QSPR model. In this work, the leave-many-out (LMO, 20% out) CV (Q2LMO) is employed. Moreover, the bootstrap resampling method is another efficient and stable approach to internal validation. The basic premise of bootstrap resampling is that the data set should be representative of the population from which it was drawn. Since there is only one data set, bootstrapping simulates what would happen if the samples were selected randomly. Bootstrapping can be seen as a smoother version of CV, and as in the LMO−CV validation, a high average Q2 in the bootstrap validation (Q2BOOT) is a demonstration of the model robustness. Thus, the bootstrapping is also employed for internal validation. The external validation is a significant and necessary validation method used to determine both the generalizability and the true predictive capability of the QSPR models for new chemicals by splitting the available data set into a training set and an external test set. The training set is used for descriptor selection and model development, amd the test set is used for model validation. Moreover, as we know, a QSPR model cannot be verified for its predictivity by checking only a few compounds; as in such cases the results could be obtained by chance, and it is impossible to obtain general conclusions.25 That is to say, the model must be tested on a sufficiently large number of compounds not used in the model development.25 So in this work, the whole data set is randomly divided into a training set with 247 compounds (80% of the data set) and a test set with 61 compounds (20% of the data set). Two types of validation criteria are usually used: those based on variations of the Q2 form, and those based on the difference between the experimental and the predicted data (by the model) of the test set. In this study, the squared correlation coefficient for external validation (Q2EXT) is employed. Recently, a conceptually simple statistical parameter, the concordance correlation coefficient (CCC), was proposed as a criterion for the external validation of a QSPR model for predictivity on new chemicals.26 This CCC parameter is considered to be more intuitive and efficient and should suffice to evaluate the model performance in terms of relative measurement of errors. Moreover, because no training set information is involved, CCC can be considered a true external validation measure, independent of the sampled chemical space.26 So, in this study, the concordance correlation coefficient (CCC) is also employed for the external validation of QSPR model.

Table 1. Molecular Descriptors Used in the Obtained Model and Their Definitions molecular descriptor

type

Sv

constitutional descriptor

nHM

constitutional descriptor topological descriptor

Seige

definition sum of atomic van der Waals volumes (scaled on carbon atom) no. heavy atoms eigenvalue sum from electronegativity weighted distance matrix

ME value −3.321 20.397 0.118

descriptors are presented in Table 1. The corresponding best MLR model is presented as follows ΔHc° = 403.431(± 1.530)Sv − 426.375(± 9.384)nHM − 2186.751(± 20.071)Seige − 163.876( ±13.676) (1)

383kJ/mol ≤ ΔHc° ≤ 26181.7kJ/mol

R2 = 0.999, Q 2 LMO = 0.999, Q 2 BOOT = 0.999, Q 2 EXT = 0.998, s = 140.9, n = 246, F = 71656, CCC = 0.999

where n is the number of organosilicon compounds used in the model, s is the standard deviation of the model, F is the F value of the model, and CCC is the CCC value of the model. The standardized regression coefficient on the significance of an individual descriptor in the MLR model indicated that, the greater the absolute value of a coefficient, the greater the weight of the descriptor in the model. A detailed description of theories of MLR can be found in the literature.27,28 Equation 1 shows that the ΔHc° of organosilicon compounds can be predicted using three molecular descriptors. As can be seen from Table 1, all three descriptors selected in the model are 2D descriptors, which could also be calculated from the simplified molecular input line entry system (SMILES). Moreover, these descriptors are freely accessible from the Milano Chemometrics and QSAR research group Web site (http://michem.disat.unimib.it/mol_db/). The physical meanings of these descriptors are interpreted as following: “Sv” is a measure of the molecular volume. “nHM” is the number of heavy atoms (silicon atom in this study) in a molecule. For the organosilicon compounds studied here, this parameter is experimentally most important parameter to evaluate the availability heat of combustion. “Seige” is a measure of the strength of bonds in a molecule.14 All three

3. RESULTS AND DISCUSSION 3.1. Results of Prediction. The GA-MLR procedure was performed on the training set. The number of individual components in population is 100; the crossover probability (Pc) is 0.5; the mutation probability (Pm) is 0.01, and the maximum 13276

dx.doi.org/10.1021/ie300952x | Ind. Eng. Chem. Res. 2012, 51, 13274−13281

Industrial & Engineering Chemistry Research

Article

descriptors are more correlated with the simple features of the compounds like molecular size and volume and indicate the presence of certain atom (silicon atom). Therefore, the overall ΔHc° property of organosilicon compounds can be reasonably explained by their steric effects. Following, to investigate which predictors are the most important ones among the three descriptors appearing in the MLR model, the relative significance and contribution of each descriptor in the model was determined by the calculation of the value of the mean effect (ME)29 for each descriptor using the following equation i=n

MEj =

βj ∑i = 1 dij m

n

∑ j βj ∑i dij

(2)

where MEj represents the mean effect for the descriptor j, βj is the coefficient of the descriptor j, dij is the value of the interested descriptors for each molecule, and m is the number of descriptors in the model. The mean effect value of a descriptor is the product of its mean and the regression coefficient in the MLR model, which shows the relative importance of each descriptor in comparison to the other descriptors. The symbol (positive or negative) of ME shows the impact trend of each descriptor on the ΔHc° property. The ME of the three descriptors are also shown in Table 1 and indicate that, among the selected descriptors, the most important one is nHM as it has the highest mean effect value and has the largest effect on the ΔHc° of the organosilicon compounds. Moreover, the positive symbol of ME indicates that as the number of heavy atoms (silicon) increases the ΔHc° of the organosilicon compounds increases. The positive effect of this descriptor is in agreement with the experiment. Moreover, the negative symbol of ME of descriptor Sv indicates that as the molecular volume increases the ΔHc° of the organosilicon compounds decreases, while the positive symbol of ME of descriptor Seige indicates that as the strength of bonds in a molecule increases the ΔHc° of the organosilicon compounds increases. In conclusion, the relative importance and contribution of each descriptor in the model to the ΔHc° of the organosilicon compounds was determined and arranged as follows based on the ME values for each descriptor: nHM > Sv > Seige. Subsequently, the developed model is used to predict the ΔHc° values of the compounds in the test set for external validation. As a result, the predicted ΔHc° values for the 61 organosilicon compounds in the test set are obtained and presented in the Supporting Information. The main statistical parameters of the model are shown in Table 2. A plot of the calibrated and predicted ΔHc° values vs the observed ones for both the training and test sets is shown in Figure 1.

Figure 1. Correlation between the predicted (calibrated) and observed ΔHc° values for both the training and test sets.

As can be seen from Figure 1, the calibrated and predicted ΔHc° values agree with the observed ones satisfactorily for both the training and test sets. Moreover, the obtained results presented in Table 2 showed that the prediction errors of the model were as low as possible. Also, it is noteworthy that both the average absolute error (AAE) and RMSE values were not only low but also as similar as possible for the training and external test sets, which suggests that the new proposed model has both predictive ability (low values) and generalization performance (similar values).25 Moreover, the predicted percentage error of all the 308 organosilicon compounds was calculated. The obtained average absolute percentage error for these compounds was 2.4%. The results are shown in detail in Figure 2. As can be seen from

Figure 2. Percent errors of predicted FP and the number of compounds in each range.

Figure 2, the model predictions for the majority compounds are very accurate with the percentage errors lower than 4%, with very few compounds are above the 10% error range. 3.2. Model Stability Validation and Results Analysis. The developed model was tested for chance correlation to further analyze the model stability. The Y-randomization test method was employed, which is a widely used technique to ensure the robustness of a QSPR model. In this test, the dependent-variable vector (Y vector) is randomly scrambled, and a new QSPR model is developed using the original independent-variable matrix. This process is repeated 50−100 times. It is expected that the resulting QSPR models should

Table 2. Main Statistical Parameters of the Obtained Model statistical parameters squared correlation coefficient for fitting (R2) squared correlation coefficient for leave-many-out crossvalidation (Q2LMO) squared correlation coefficient for external validation (Q2EXT) average absolute error (AAE) root-mean-square error (RMSE) number of compounds (n)

training set 0.999 0.999

test set 0.998

0.998 106.2 140.6 247

111.2 176.8 61 13277

dx.doi.org/10.1021/ie300952x | Ind. Eng. Chem. Res. 2012, 51, 13274−13281

Industrial & Engineering Chemistry Research

Article

generally have low R2 and Q2LMO values. It is likely that sometimes high R2 and Q2LMO values may be obtained due to a chance correlation. If all QSPR models obtained in the Yrandomization test have relatively high R2 and Q2LMO, it implies that an acceptable QSPR model cannot be obtained for the given data set by the current modeling method.30 In this study, the Y-randomization test was performed on the training set 100 times. As expected, all the models generated had produced low R2 and low Q2LMO values. The obtained maximum R2 and Q2LOO values of the generated models on the training set were 0.135 and 0.133, respectively, which were much lower than the ones calculated when the dependent variables were not scrambled. It can be thus concluded that only the correct dependent variables can be used to generate reasonable models, and the chance correlation had little or even no effect in the presented model. Also, the residuals of the predicted values vs the observed ones for the developed model are shown in Figure 3. As most of the calculated residuals are distributed on both sides of the zero line, one may conclude that there is no systematic error in the development of the presented model.

sets are, in a sense, similar to the real life situation of unknown new chemicals. As a result, the whole data set was randomly divided into a training set and a test set in our present study. However, it must be noted that the similarity analysis method such as self-organizing map (SOM) is also of great importance in splitting methodologies studies. In a comparative study of these methods, it was demonstrated that the best QSPR models were built when similarity analysis was used other than random selection.31 Consequently, in this study, one more splitting methodology of SOM was also employed to check the modeling results. As a result, the same three descriptors were selected in the achieved new QSPR model, and the main statistical parameters of the model are shown in Table 3. Table 3. Main Statistical Parameters of the Newly Obtained Model (by SOM Splitting) training set

statistical parameters squared correlation coefficient for fitting (R ) squared correlation coefficient for leave-many-out crossvalidation (Q2LMO) squared correlation coefficient for external validation (Q2EXT) average absolute error (AAE) root-mean-square error (RMSE) 2

0.999 0.999

test set 0.998

0.998 108.4 151.6

106.2 164.4

As can be seen from Table 3, the main statistical parameters of the newly obtained model by SOM splitting method are very close to those obtained by random selection through property sampling. Consequently, the same descriptors selected in the QSPR models and the close modeling results obtained by employing these two different splitting methodologies showed that the modeling variable selection by random selection is unbiased of “structure” and “response value”, and the previously obtained modeling results are not obtained by chance. All the results discussed above showed that the presented model is a valid model and can be effectively used to predict the ΔHc° of organosilicon compounds using only theoretical descriptors derived solely from the molecular structures. In consideration of the limited number of experimental data available and the complex nature of the ΔHc° of organosilicon compounds, it was not possible to further improve the model predictions beyond the current results. However, from Figures 1 and 3, it must also be noted that there may be some outliers in the developed model. An analysis of residuals showed that octaphenylcyclotetrasiloxane, silane, and 5-hexenyldimethylchlorosilane were provided with higher prediction errors, whose prediction error were greater than three times the standard deviation. The possible outliers would greatly reduce the prediction performance of the proposed model. The source of error for this type of outlier can be attributed either to the observed ΔHc° data or to structural features of the molecule that are not properly encoded in the model but that have a large influence on the observed ΔHc°. A further analysis of the outliers can also be found as follows. 3.3. Definition of the Applicability Domain of the Model. Attention should also be paid to the applicability range of the developed QSPR model. Once a QSPR model is obtained, another crucial problem is the definition of its applicability domain (AD). For any QSPR model, only the predictions for chemicals falling within its AD can be considered reliable and not model extrapolations.

Figure 3. Plot of the residuals vs the observed ΔHc° values for the presented model.

Moreover, as have been discussed previously, in order to obtain a reliable (validated) QSPR model, the studied data set should be divided into the training and test sets. Ideally, the best splitting must guarantee that the training and test sets are scattered over the whole area occupied by representative points in the descriptor space (representativity) and that the training set is distributed over the entire area occupied by representative points for the whole data set (diversity). There are two different splitting methodologies that are always applied, namely, similarity analysis (for instance, D-optimal distance, Kohonen Map-Artificial Neural Network (K-ANN) or Self Organizing Map) and straightforward random selection through property sampling. These methods help achieve desirable statistical characteristics of the training and test sets to varing degrees. As one of the most widely used methods for dividing a data set into training and test sets, the random selection method possesses its own advantages. First, the random selection is a economical way for splitting. Second, it is useful if applied iteratively in splitting for internal validation. Moreover, in this way of selection, the randomly selected data in each of the test 13278

dx.doi.org/10.1021/ie300952x | Ind. Eng. Chem. Res. 2012, 51, 13274−13281

Industrial & Engineering Chemistry Research

Article

the model well; thus they can stabilize the model and make it more precise, which implies that they should not be considered outliers but influential compounds. Also, it can be concluded from the two compounds in the test set that the developed model has good generalizability and predictivity for the compounds with descriptor values significantly far from the centroid of the descriptor space. Moreover, two compounds (silane and n-octadecylmethyldichlorosilane) in the training set and one compound (5-hexenyldimethylchlorosilane) in the test set are wrongly predicted (>3 s) but with lower leverage values (h < h*). These erroneous predictions could probably be attributed to incorrect observed data rather than to molecular structures.25 3.4. Comparison with Previous Works. To our best knowledge, there is no QSPR study available in the literature for predicting the ΔHc° of organosilicon compounds from only the molecular structures. In 1999, Hshieh4 developed an empirical correlation model to predict the ΔHc° of organosilicon compounds based on the atomic contribution method. The resulting model was reported to be able to predict the ΔHc° of organosilicon compounds with satisfactory accuracy. In this paper, for the purpose of verifying the validity of the presented QSPR approach, a general comparison between the presented work and the work of Hshieh4 is performed. In consideration of the fact that these two works were carried out based on different data sets and different methods and each method possesses its own advantages and disadvantages, it is suggested that not only the prediction results but also other important characteristics of the prediction models should be taken into account and analyzed, such as the model applicability efficiency and applicability range. Consequently, a detailed comparison between the two works is presented as follows. First, regarding the input parameters used in the prediction models, the model of Hshieh4 employed the numbers of nine different individual atoms in the empirical formula of an organosilicon compound, while the presented model employs three molecular descriptors which are directly calculated from the molecular structures. Of course, counting the number of atoms comes more easily than calculating the molecular descriptors. However, the theoretical molecular descriptors have definite physical meanings, which are useful to probe the structure characteristics that have significant contribution to the ΔHc° property of organosilicon compounds. As for the statistical parameters of the prediction models, the statistical parameters of the presented QSPR model were a little inferior to those of the Hshieh’s model4 regarding the average absolute percent error (2.4 vs 1.5%). However, it must be noted that, compared with the work of Hshieh,4 our model was developed based on larger number of compounds in the data set (247 vs 105) for the ΔHc° predictions. Moreover, in our present work, model validation has been systematically performed to validate the reliability and validity of the developed model. For example, our model has been externally validated with compounds not used in model development, which has the less probability of being suffered from chance correlation. However, no model validation has been carried out in Hshieh’s work. Finally, regarding the applicability efficiency and applicability range, both models are conceptually simple and easy to apply. However, it must be noted that the model of Hshieh4 was developed based on less compounds in the data set for the ΔHc° predictions. Consequently, the regression range of the Hshieh’s empirical model (between 1436 and 26694 kJ/mol)

There are several methods for defining the AD of QSPR models,32 but the most common one is determining the leverage values for each compound.25 To visualize the AD of a QSPR model, the plot of standardized residuals vs leverage values (h) (the Williams plot) was exploited in this study, which played a double role. First, it described the impacts of the objects on models by the values of their leverages. Leverage indicates a compound’s distance from the centroid of X. The leverage of a compound in the original variable space is defined as33

hi = xi T(XTX )−1xi

(3)

where xi is the descriptor row-vector of the considered compound and X is the descriptor matrix derived from the training set descriptor values. The warning leverage (h*) is defined as33

h* = 3p /n

(4)

where n is the number of training compounds and p is the number of model variables plus one. The leverage (h) greater than the warning leverage (h*) suggested that the compound was very influential on the model. Second, it presented the Euclidean distances of the compounds to the model measured by the cross-validated standardized residuals. The crossvalidated standardized residuals greater than three standard deviation (s) units classified the compound as a response outlier. The Williams plot for the presented model is shown in Figure 4. From this plot, the applicability domain is established

Figure 4. The Williams plot describing the applicability domain of the presented model (h* = 0.049).

inside a square area within 3 standard deviations and a leverage threshold h* of 0.049. For making predictions, predicted ΔHc° data must be considered reliable only for those compounds that fall within this AD on which the model was constructed. It can be seen from Figure 4 that the majority of compounds in the data set are inside of this area. However, there is one compound (compound octaphenylcyclotetrasiloxane) in the training set with h > h* and the standardized residual >3 s. It is both the response outlier and high leverage chemical. Meanwhile, four compounds in the training set are consistent with h > h* and the standardized residuals