Research Article pubs.acs.org/journal/ascecg
Biodegradability Prediction of Fragrant Molecules by Molecular Topology Vincent Blay,† Jesús Gullón-Soleto,‡ María Gálvez-Llompart,§ Jorge Gálvez,§ and Ramón García-Domenech*,§ †
Departamento de Ingeniería Química, Universitat de València, c/Doctor Moliner, 50, 46100 Burjassot, Spain Instituto de Tecnología Química (UPV-CSIC), Universidad Politécnica de Valencia, Camino de Vera, s/n, 46022 Valencia, Spain § Molecular Topology & Drug Design Research Unit, Departamento de Química Física, Universitat de València, Av. V. A. Estellés, s/n, 46100 Burjassot, Spain ‡
S Supporting Information *
ABSTRACT: Biodegradability is a key property in the development of safer fragrances. In this work we present a green methodology for its preliminary assessment. The structure of various fragrant molecules is characterized by computing a large set of topological indices. Those relevant to biodegradability are selected by means of a hybrid stepwise selection method to build a linear classifier. This model is compared with a more complex artificial neural network trained with the indices previously found. After validation, the models show promise for time and cost reduction in the development of new, safer fragrances. The methodology presented could easily be adapted to many quasi-big data problems in R&D environments. KEYWORDS: Fragrances, Biodegradability, Environmental hazard assessment, Topological descriptors, Molecular screening, Mathematical chemistry, Statistical learning, in silico
■
INTRODUCTION The fragrance in a cosmetic product is typically a mixture of many organic chemical compounds. These fragrances are present in almost any personal care product, being essential ingredients in colognes, perfumes or deodorants. Fragrances are also found in numerous other consumer products, such as laundry detergents, fabric softeners or room fresheners. Its formulation may comprise both pure synthetic as well as complex natural raw materials that are harmonized by the perfumer to compose his story.1 There are thousands of compounds comprising very different functional groups with potential use in fragrances (aldehydes, alcohols, acids, esters, epoxides, sulfides, amines, etc.). On top of that, small molecular changes may result in surprisingly different odor impressions or, contrarily, molecules whose structures look very different may still evoke similar odors.2 The truth is that our limited knowledge of odorant receptors often limits current olfactophore models (i.e., models relating chemical structure to sensory character) to be based on molecular similarity to known odorants.3 On the business side, the total market for fragrances is estimated at €6.5 billion,4 being a research intensive sector (around 7−8% of total sales are reinvested in R&D). Although new aroma chemicals enter the market every year, after 60 years of organic synthesis the organoleptic novelty is smaller. Therefore, success is no longer reliant only on new molecules but also importantly on the finished-product level and on discerning consumer trends ahead of time. This has to be © XXXX American Chemical Society
aligned with the business corporate social responsibility program, which defines how the company interacts with society and the environment to contribute to sustainable development. Among the activities in sustainable development programs, it could be of great interest to include the development of readily biodegradable formulations. At present, the fragrance industry worldwide is “self-regulating”, meaning that legislation is mostly provided by two fragrance organizations: the Research Institute for Fragrance Materials (RIFM) and the International Fragrance Association (IFRA). Given the low concentrations accessible to the respiratory system (namely in the ppm range), safety tests are often limited to skin irritation and sensitization, and phototoxicity. 5 However, the unregulated use of some synthetic essences could pose a serious environmental problem. Indeed, several of these compounds can be persistent in the environment, bioaccumulative, or even toxic. An example was the widespread use of diethyl phthalate in the formulation of fragrances, which is now listed as Category 1 endocrine disruptor by the European Commission on Endocrine Disruptors.6 In practice, however, a proper risk assessment provides a real view of the levels these substances can reach in the environment and their associated potential problems. For this assessment, information on biodegradability is necessary. Received: April 8, 2016 Revised: June 22, 2016
A
DOI: 10.1021/acssuschemeng.6b00717 ACS Sustainable Chem. Eng. XXXX, XXX, XXX−XXX
Research Article
ACS Sustainable Chemistry & Engineering
molecular structures, and thus could be adapted to proprietary data banks in industry. The group of molecules used in this work appears structurally heterogeneous, containing between 3 and 21 carbon atoms, and frequently between 1 and 4 oxygen atoms, mainly in carbonyl groups. Aromatic rings are also common, up to 2 in a given molecule, as well as the presence of several double bonds. Figure 1
In addition to being a step forward toward sustainability, the development of biodegradable formulations may also increase the value perceived by the final customer4 and advance to regulatory requirements. As a matter of fact, the European Union has already begun to take measures: by now, 26 allergens must be identifiable in those products containing them and the use of many other ingredients is already limited.7 In the last decades, mathematical chemistry and more particularly molecular topology have been applied to the design and search of new molecules useful in different areas.8−14 Molecular topology is a major area in mathematical chemistry that focuses on the description of the topological characteristics of molecules, that is, how different paths connect atoms in molecules without taking into consideration physical or geometrical magnitudes (energies, moments, Euclidean lengths, etc.). At present, three main methodologies are used: QSPR (Quantitative Structure−Property Relationships, based on molecular descriptors), molecular mechanics (based on classical mechanics) and quantum mechanics, which are regarded as techniques with complementary possibilities. In QSPR methods, one way to describe molecular structures is by means of graphs. Molecular graphs can be summarized by numerical topological indices or descriptors.15,16 This methodology presents some interesting advantages, such as affordable computational costs, the possibility to design without detailed knowledge of the mechanism or receptor that interacts with the molecule, and the possibility to identify new compounds based on suitable indices. In this paper, we present an application of QSPR methods to the classification of organic compounds with potential use in perfumery according to their biodegradability. To this end, we make use of two different statistical learning methods: linear discriminant analysis (LDA) and artificial neural networks (ANNs), built upon hybrid stepwise variable selection results as data mining tool. The methodology proposed could be adapted as a prescreening method to a wide variety of problems in industrial research programs.
■
Figure 1. Sample of readily biodegradable and nonreadily biodegradable molecules used in the study along with their calculated values for two topological indices. shows a few examples illustrative of the biodegradable and nonbiodegradable groups of molecules in the study. Some molecules also contain other heteroatoms, such as nitrogen or silicon. All this information is captured more deeply than a simple eye-level appreciation by the molecular indices computed. First, it is necessary to label the molecules in a way that can be interpreted unambiguously by the software that will be used later. One possibility is to use SMILES coding.20 These codes can be read directly by the software Dragon,21 which was used to compute the topological indices (TIs). These indices are calculated by applying well-defined algorithms on matrices that encode the molecular topology, such as the topological distance matrix. Figure 2 shows, as an example, the
MATERIALS AND METHODS
The collection of molecules and biodegradability used in this study was conducted recently by Boethling17 according to the OECD criterion by which biological oxygen demand should exceed 60% of its stoichiometric maximum and the dissolved organic carbon content should have decreased by at least 70% in a 28 day standardized test. Alternatively, CO2 formation can be monitored, like in Sturm tests (OECD TG 301 B), which also consider a 60% of the theoretical maximum as the pass criterion.18 In this regard, Wammer and Peters have published a careful experimental study on the biodegradation of polycyclic aromatic hydrocarbons19 from two to four rings, emphasizing the avoidance of physical and chemical processes that might interfere with the true biodegradation rates. In fact, their work reveals that intrinsic biodegradation rates do not vary as much with molecular size as had been reported for biodegradation in the environment, suggesting that bioavailability can become rate-limiting in practice. This leads us to point out that minor changes in the experimental procedure can easily translate into major differences in the biodegradation results, thus the importance of standardized tests and careful experimentation. Notice also that uncertainty associated with the experimental data, limits the quality of the predictions by any model and, therefore, a certain extent of misclassification is justified to avoid overfitting, as will be addressed below. From the data accessed, 130 known molecules were retrievable. Many more fragrant molecules have been notified to environmental agencies, although some of these are confidential and their structures are undisclosed. The method proposed requires knowledge of the
Figure 2. Calculation example of the Wiener Index, W, and Mean Wiener Index, WA, from the molecular graph and its topological distance matrix, D.
calculation of the mean Wiener index, WA, for compound No. 75 (ethyl acetate). In this work, 575 topological indices were computed for each molecule (see Supporting Information Tables S1 and S2). For the prediction of biodegradability, a linear discriminant model is proposed as a first approximation. A discriminant model is a classifier: it aims to predict a qualitative response from a given observation, in our case, whether the molecule presented, described by several topological indices, is biodegradable or not. A linear discriminant B
DOI: 10.1021/acssuschemeng.6b00717 ACS Sustainable Chem. Eng. XXXX, XXX, XXX−XXX
Research Article
ACS Sustainable Chemistry & Engineering model can be written as a discriminant function that is linear on the molecular indices: DF(λj) = a0 + a1·λ1 + a 2 · λ 2 + ... + an · λn
learning methods. Most ANNs comprise three processing layers: input, hidden and output layer, as illustrated in Figure 3. The input
(1)
where λi are the topological indices characteristic of the molecule j, and ai are the adjustable model parameters. The molecule j will be classified as biodegradable (B) if DF is positive and as nonbiodegradable (NB) if DF is negative. As indicated above, one can calculate many topological indices describing the molecules, but not all of them will actually be helpful to understand the property we are trying to model. Thus, it is necessary to select a subset of indices from which the model will be built. Here, one option would be to evaluate exhaustively all possible linear discriminant models. However, this is computationally impractical, as it would involve too many models. Starting from n molecular indices, one can train 2n linear discriminant models considering the different possible combinations of molecular indices. Interestingly, even if it were possible to train all these models, this approach does not guarantee the best performance, since the selected model would probably exhibit a high variance outside the training set of molecules. In practice, exhaustive search algorithms are seldom applied with more than 20 variables. Alternatively, one can use heuristics algorithms for variable selection, which are much faster although they do not ensure a global optimization. We used the hybrid stepwise selection algorithm22 as implemented in STATISTICA.23 The algorithm proceeds iteratively. At every step, it may either add a variable to the model or remove a variable already present in the model, according to the corresponding F-tests under predefined confidence levels. We chose 0.05 as the critical p-value for both variable addition and removal tests. In our case, the algorithm was stopped when the model contained 14 topological indices, which were enough to minimize the bias of the model while still allowing for exhaustive search. It was then possible to carry out an exhaustive search among all combinations in this reduced set of indices. This exhaustive search was performed using as the selection criterion the minimization of Wilks’ λ,24 keeping the best model of each size. Wilks’ λ applied this way measures whether every parameter is contributing significantly to reducing variance and not just the final classification results. Optimizing this statistic yields some confidence that the selected indices for a given model size are statistically significant, hopefully allowing us to draw a trend valid for the whole set of molecules more easily than by just optimizing for the correctness rate in the training. Notice that this approach may yield models containing variables not necessarily in the order in which they were selected in the previous hybrid stepwise selection algorithm, as will be seen later. Two points of interest are, on the one hand, to estimate the correctness rate achievable with new molecules not previously used when training the model. Likewise, it would be interesting to know the optimal model complexity, i.e. the most appropriate number of molecular indices to consider for making predictions without overfitting. Both cases are addressed by means of validation. Models in this paper were subjected both to internal and external validation.25 To use external validation, a random set of molecules was held out before the variable selection step. In our case, this hold-out set contained a 20% of the original data set of 130 molecules. Additionally, internal validation in LDA was applied as 5-fold cross validation. This consists in dividing the set of molecules (after hold-out of the external test set) into 5 groups, 4 of which are used for training the model and the remaining one is used for cross-validation (see Table S4). This is repeated 5 times, each time using a different group for cross-validation. The resulting correctness percentages are arithmetically averaged to yield a single figure. As an alternative to the previous methodology, we proposed a more complex model to describe the relation between topological indices and biodegradability. This model is based on an artificial neural network (ANN) trained with the descriptors found in the previous variable selection step. ANNs are one type of nonlinear statistical
Figure 3. Layout of the neural network used in this research: input layer with the 14 TIs selected, hidden layer with 12 neurons determined empirically, output layer corresponding to the dependent variable value (biodegradable, B/nonbiodegradable, NB).
layer receives data from input files. Intermediary features are derived by applying nonlinear activation functions (usually sigmoid-like) to linear combinations of the original variables in the hidden layer. Results from individual neurons in this hidden layer are linearly combined to yield the results in the output layer. In supervised ANNs, a training set of known outputs is used to train the network by assigning weights to each neuron. In the case of classification problems, the vector of outputs may be finally transformed to produce estimates that are positive and add up to unity, for instance, by a softmax function. Commercial software such as STATISTICA allows exploration of multiple combinations of functions and linear combinations of variables for a given network topology. ANN analysis was performed on 60% of the data set (training set), and was validated against the test set (20%), and an additional external validation using the remaining 20% of the original set of molecules selected at random. Results from a satisfactory network are presented below.
■
RESULTS AND DISCUSSION Linear Discriminant Analysis. Results of variable selection using the hybrid stepwise algorithm are shown in Table 1. The algorithm was stopped when 14 topological indices were selected. The indices’ names follow conventional nomenclature (please refer to the Nomenclature section) and their definition can be found in the literature.26 Notice that hundreds of such indices have been proposed and are nowadays included in software such as Dragon or PaDEL.27 As shown previously in Figure 2, indices are computed by algebraic operations on matrices derived from the molecular graphs. Nevertheless, in Table 1 one can recognize different classes of topological descriptors. Some of them are autocorrelation indices, which distinguish similarity of a given property along the molecule, such as the van der Waals volume or the molecular polarizability, as in the Broto-Moreau ATS5v or BEHp4 indices, respectively. There are other purely structural indices, such as PW3 (length 3 path count), and also charge transfer indices, which evaluate intramolecular charge transferences, such as JGI5 and JGI9. In a second step, we carried out an exhaustive search on possible linear discriminant models resulting from combining these 14 preselected indices. For every number of indices from 1 to 14, the linear discriminant model maximizing Wilks’ lambda was retained (Table 1). Notice that this approach yields models that do not necessarily keep the variables present in C
DOI: 10.1021/acssuschemeng.6b00717 ACS Sustainable Chem. Eng. XXXX, XXX, XXX−XXX
Research Article
ACS Sustainable Chemistry & Engineering
Table 1. Variable Selection for LDA Resulting from Hybrid Stepwise Selection Combined with Exhaustive Search over the Training Seta # indices Wilks’ λ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 a
0.626 0.484 0.402 0.363 0.328 0.297 0.273 0.246 0.219 0.204 0.193 0.180 0.168 0.159
correct (%)
Φ
78.8 90.4 87.5 91.3 91.3 93.3 95.2 96.2 96.2 96.2 97.1 97.1 98.1 98.1
0.556 0.791 0.729 0.813 0.813 0.856 0.897 0.917 0.917 0.918 0.939 0.939 0.959 0.959
EEig15x
ATS5v nDB
BEHv6
JGI5
WA
BEHp4 BEHe1
EEig08x
JGI9
PW3
PJI2 RBF
GATS8m
x
x x x x x
x x x x x x x x x x x x x
x
x x x x x x x x x x
x
x x x x x x
x x x x x x x x
x x x x x x x x x x x x
x x x x x x x
x x x x x x x x x x
x x x x x x x
x x x x x x x x
x x x x x
x x
x
x x x x
x x x
x
p < 0.05 for entry and removal.
that the correctness rate does not monotonically increase with the model size in the training set. This is because the linear models were built from the 14 selected variables with the aim of maximizing Wilks’ λ, which is not necessarily equivalent to maximizing the correctness rate. On the other hand, a nonmonotonic behavior in the validation set could be expected, as it is possible to overfit the classifier to the training set. Fortunately, it is observed that small models already yield reasonable predictions. This is observed both in the training and in the validation set. In particular, a model featuring 2 descriptors seems particularly appropriate in the light of these results. For instance, in Table 1 it can be seen that this model yields relatively high values of Wilks’ λ statistic and Pearson correlation coefficient. The discriminant function, DF, takes the form:
smaller models. This can be seen, for instance, when comparing models with 1 to 3 indices in the table. Table 1 also contains the Pearson correlation coefficient (also known as Matthews correlation coefficient28), Φ, evaluated for the different linear discriminant models. This statistic has the advantage to compensate for differences in group sizes and takes the following form:26 Φ=
TP ·TN − FP ·FN (TP + FN)·(TP + FP) ·(TN + FN) ·(TN + FP) (2)
In eq 2, TP, TN, FP and FN represent the number of true positive, true negative, false positive and false negative cases in the classification problem, respectively. More specifically, TP and TN are the number of biodegradable and nonbiodegradable molecules correctly identified by the model, whereas FP and FN are the number of nonbiodegradable and biodegradable molecules that were misclassified by the model, respectively. Because Φ is a correlation coefficient, values close to unity are desired. Figure 4 represents the correctness rate for the linear discriminant models found as a function of model size, both for
DF = −5.7619 + 5.2548·WA − 5.4646·ATS5v N = 104 Wilks′λ = 0.484 Φ = 0.791 p < 0.05
(3)
It is important to reflect on the prior probabilities considered for training the classifier. In our opinion, although the sizes of groups used are clearly unequal (conditioned by the data available), we see no reason to believe that a new fragrant molecule should, or should not, have a priori a higher probability of being biodegradable. Accordingly, in the models presented the prior probability considered for a new molecule to be biodegradable is 0.5. Assuming a different proportion in the population may certainly lead to different predictions. Similarly, a probability threshold for the classification of 0.5 has been used. These considerations also apply to the ANN presented below. The first point that stands out from the discriminant function selected, eq 3, is the quality of discrimination, because with only two variables it is able to predict reasonably well the biodegradability of a large group of compounds (see Table 2). Notice that a classification by hazard should not do better than 50% correctness rate. The two indices used in the model are the mean Wiener index, WA, and the Broto-Moreau autocorrelation index weighted by van der Waals volume, ATS5v. Supporting Information Table S3 includes biodegradation probabilities and values of the discriminant function computed for each compound.
Figure 4. Percent correct vs number of topological indices obtained in the LDA (blackheads, training set; whiteheads, test set).
molecules in the training set and in the external validation set. As could be expected, predictions are more accurate in the training set than in the external validation set, evidencing that training results generally are not indicative of the real behavior of the model with new cases. It may seem surprising, however, D
DOI: 10.1021/acssuschemeng.6b00717 ACS Sustainable Chem. Eng. XXXX, XXX, XXX−XXX
Research Article
ACS Sustainable Chemistry & Engineering Table 2. Classification Matrix Obtained with LDA and ANNa set
a
B
NB
TP
training avg. 5-CV validation total
66 13.4 19 85
38 7.6 7 45
63 12.6 16 79
training test validation total
54 14 17 85
24 12 9 45
52 14 16 82
FP
TN
LDA eq 3 7 31 1.6 6 2 5 9 36 ANN (MLP-14-12-2) 1 23 1 11 1 8 3 42
FN
sensitivity (%)
specificity (%)
3 0.8 3 6
95.5 94.0 84.2 92.9
81.6 78.9 71.4 80.0
2 0 1 3
96.3 100 94.1 96.5
95.8 91.7 88.9 93.3
Number of biodegradable (B), non-biodegradable (NB), true positive (TP), false positive (FP), true negative (TN) and false negative (FN) cases.
The model was subjected to both internal and external validation. As for internal validation, 5-fold cross-validation was applied to the training set (external validation molecules excluded), randomly choosing the validation group among the whole set of molecules. This led to an average correctness rate of 86.5% in the validation fold. Additionally, the model was externally validated with a set randomly chosen to represent a 20% of the total number of molecules (results included in Table S3). Results from this external validation are summarized in Table 2 (test set results), with an 80.8% correctness rate in this subset, which corresponds to the white point in Figure 4 for 2 topological indices. We would attribute the slightly better predictions in the 5-fold cross-validation to be due to some spillover of information from the variable selection step, which had included molecules later interrogated in the crossvalidation step. Therefore, on the basis of the procedure followed, we would consider external validation figures as more representative of real behavior with new molecules. Alternatively, one might have chosen to split the 5-fold for crossvalidation before the variable selection step, which would have led to figures closer to those of external validation, although the researcher would likely have to compare models containing different variables. Still, both internal and external validation are seen as complementary approaches25 and both results agree on the model proposed. Regarding the interpretation of the model, it is known that the Wiener index correlates very well with alkane boiling temperatures. This index does not take into account the nature of the constituent atoms but only topological distances. Moreover, in this case the index is averaged by size, so we can conclude that it evaluates molecular shape or, more precisely, topological molecular compactness. By looking at the values of the WA index for molecules in Figure 1 one can actually observe that the more compact the molecule, the lower the value of this index. Thus, small cyclic compounds (such as molecules No. 4, 28 or 82) show low WA values (2.7, 2.2 and 2.5, respectively). By contrast, big molecules (such as No. 114) or elongated ones (such as No. 67 and 95) show larger values of this index (4.6, 3.0 and 2.8, respectively). This observation holds for any molecule in Table S1. On the other hand, the ATS5v index assesses the presence of differing van der Waals atomic volumes at a topological distance of 5 units. Among the molecules studied, it is somewhat related to the presence of oxygen atoms in a certain amount, whose van der Waals volume is much greater than that of carbon. In general, though, it is observed that compounds with high WA values also exhibit high values of ATS5v and vice versa, thus existing a weak correlation between both indices. Moreover, all
the molecules studied show ratios WA/ATS5v greater than one (all biodegradable compounds in Figure 1 except No. 114 show WA/ATS5v ratios above 2), which suggests that molecular shape has a somewhat prevalent role over the nature and location of the oxygen atoms. Interestingly, the two indices selected, WA and ATS5v, contribute to the model with opposite signs and have similar coefficients (eq 3), which are also similar to the independent term. This implies that biodegradability depends on the difference (WA-ATS8v) rather than on the values of each index independently. It also enables us to simplify the model such that, as a rule of thumb, if the molecule’s WA index exceeds ATS5v by at least one unit, then it can be expected to be biodegradable. In fact, the four biodegradable compounds in Figure 1 have values of (WA-ATS8v) above 1.5, whereas the nonbiodegradable ones show values below that threshold. On the other hand, the descriptors selected point out that the shape of the molecule, the amount of oxygen, the type of functional group and its location in the molecule, are important with regards to biodegradability. From a biochemical point of view, groups such as hydroxyl or ester, contribute positively to biodegradability because they offer sites of enzymatic attack. How this is encoded in the topological descriptors intervening in the model, however, is a different story. We interpret that groups such as hydroxyl or ester contribute positively to the predicted biodegradability because they introduce a very limited branching in the molecule and thus tend to contribute to a higher topological compactness as measured by WA. By contrast, introduction of an ether groupwhich would not affect topological compactnessis predicted to have a generally negative impact on biodegradability. Likewise, degradation rates often decrease with increasing number of alkyl substituents negatively affecting topological compactness. Both descriptors are thus consistent with the fact that molecules with moderate hydrophilicity are more likely to be biodegradable, in general, than highly hydrophobic ones. This could be related to the underlying mechanisms of bacterial degradation, such as the ease of attack by bacterial enzymes and/or the molecule’s ability to dissolve in the bacterial cytosol after diffusion across lipophilic cellular membranes. Artificial Neural Network. A linear discriminant analysis can yield good results when boundaries between groups are close to linear in the descriptor space. However, if these boundaries have nonlinear dependencies, the model may be improved by considering interactions between variables. Examples of usual supervised classification tools which can handle nonlinear boundaries include quadratic discriminant analysis, k-nearest neighbors, random forests, support vector E
DOI: 10.1021/acssuschemeng.6b00717 ACS Sustainable Chem. Eng. XXXX, XXX, XXX−XXX
Research Article
ACS Sustainable Chemistry & Engineering machines, or artificial neural networks, among others.21 We have decided to apply an artificial neural network because of its high flexibility, which would be a good case to compare with our previous linear discriminant model. As stated before, STATISTICA allows comparing combinations of functions and variables given a network architecture, which in our case we defined with the 14 indices found in the variable selection step for the input layer, 12 units in a single hidden layer, and 2 output variables (MLP 14-12-2). The structure of this network is illustrated in Figure 3. Reasonable results were found employing a tangential activation function for the hidden layer and maintaining the identity function as the output layer activation function. A BFGS (BroydenFletcher-Goldfarb-Shanno) algorithm29 was used to train this network. The results from the ANN are compared in Table 2 with those from the LDA proposed before. Thanks to its flexibility, the ANN can fit the data tightly although it becomes less accurate when applied to new molecules in the test or external validation sets. Still, considering the potential uncertainty in the experimental data some overfitting might be anticipated for the ANN upon more extensive external validation, particularly in regions covered less densely by compounds in the training set. Details on the classification are included in Supporting Information Table S5. Comparison of Models. It has been shown that both models yield reasonable classification rates, with the more flexible artificial neural network showing greater accuracy at the expense of greater complexity. Not only the neural network is using 14 indices (as opposed to 2 in the linear discriminant model) but also the functional form of the neural network is far more complex than that of the linear discriminant model. Thus, one should judge whether the improvement in accuracy compensates the increase in complexity for their particular application requirements. Our feeling is, however, that the classifier does not benefit much from nonlinear boundaries in this problem, but, instead, that misclassifications arise mainly as a result of capturing a continuous variable (biodegradation percent) in a binary and conventional outcome (biodegradable/nonbiodegradable). This is illustrated in the activity distribution diagram30 in Figure 5, where it can be seen that
the DF bin considered. Similarly, the expectancy that a molecule is not biodegradable given its discriminant function value is estimated as E(NB) = i/(a + 1). In addition, the diagram also indicates the range of values of the discriminant function for which a prediction may be sensible, in this case predictions for compounds for which the DF evaluates outside the range [−6,8] should be distrusted. More stringent criteria on the applicability domain of QSPR methods can be found elsewhere.25 The modeling of biodegradation of molecules by other statistical learning methods has been addressed recently by Ceriani et al.31 They explored classification and regression trees (CART), k-nearest neighbors (kNN) and BIOWIN predictions based on group contribution methods. It is interesting their rejection of outlier data points prior to training, which is attributed to the uncertainty in the experimental results, as well as the use of several external validation groups. This allows them to identify some instability in the predictions with the more nonlinear kNN model. Compared to the present work, we have advocated for combining internal and external validation approaches to optimize the data available and at the same time we have explored more extreme learning methods, one being linear, the other possibly being highly nonlinear. Both works agree that the simpler methods may lead to more stable predictions, which could be promoted by the uncertainty of the experimental data in biodegradation tests, as mentioned above. Our models show somewhat better predictions (average correctness rates over 90% and 80% compared to 85% and 75% in the training and external validation sets, respectively), although this would be influenced by the different number and variety of chemical structures that have been covered. In the case of biodegradability it is unlikely that even higher correctness rates were more predictive, since that is limited by the uncertainty of the data on which the models are built. This point would be increasingly evidenced by using more comprehensive validation sets. Preceding results indicate that both the LDA and ANN models seem to identify better those compounds that are biodegradable than those that are not, i.e., they show high sensitivity. Sensitivity, or true positive rate, is defined as the percent of biodegradable molecules correctly classified by the model. Analogously, specificity, or true negative rate, is the percent of nonbiodegradable molecules correctly classified by the model: sensitivity =
TP TP + FN
(4)
specificity =
TN TN + FP
(5)
At this point, it is necessary to indicate that the probability threshold used when deciding whether a molecule belongs to the group of biodegradable or nonbiodegradable compounds is 0.5. Alternatively, one may use a different cutoff value, thus potentially increasing the model specificity at the expense of sensitivity, or vice versa. This information is captured in the socalled ROC (Receiver Operating Characteristic) curve of the model,32 as shown in Figure 6 for both the LDA and ANN models when applied to the training set. The y-axis represents the sensitivity of the model as the discrimination threshold is varied, which simultaneously affects the specificity of the model. For convenience, the x-axis represents 1-specificity, so that both magnitudes change in the same direction as the discrimination
Figure 5. Activity distribution diagram for biodegradable (filled bars) and nonbiodegradable (empty bars) fragrant compounds obtained using eq 3.
most misclassifications take place for values of the discriminant function near the zero threshold, that is, most misclassified biodegradable molecules evaluate to small negative DF values, whereas most misclassified nonbiodegradable compounds evaluate to small positive DF values. In this histogram, the expectancy of a molecule to be biodegradable given its DF value is estimated as E(B) = a/(i + 1), where a and i are the fractions of biodegradable and nonbiodegradable molecules in F
DOI: 10.1021/acssuschemeng.6b00717 ACS Sustainable Chem. Eng. XXXX, XXX, XXX−XXX
ACS Sustainable Chemistry & Engineering
■
Research Article
ASSOCIATED CONTENT
S Supporting Information *
The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acssuschemeng.6b00717. List of molecules and experimental biodegradability classification (Table S1), topological indices used in this study (Table S2), classification results from the LDA on training and external validation sets (Table S3) and on cross-validation folds (Table S4), and from the ANN on training, test and validation sets (Table S5) (PDF).
■
Figure 6. ROC curve for the LDA (dashed line; AUC = 0.912) and ANN (solid line; AUC = 0.989) models.
AUTHOR INFORMATION
Corresponding Author
́ *R. Garcia-Domenech. E-mail:
[email protected]. Notes
The authors declare no competing financial interest.
■
threshold is varied. In accordance with eq 5, 1-specificity is also referred to as the false positive rate. In this context, the area under the ROC curve (AUC) is often regarded as an indicator of the performance of the classifier. A value of AUC = 1 would be obtained for a perfect classifier, whereas the diagonal line would represent a model with no classification power in predicting binary outcomes. These representations suggest again that both models are statistically significant and may have potential for fast and cheap preliminary assessment of biodegradability of new fragrance ingredients.
ACKNOWLEDGMENTS V. Blay thanks the support from the Valencian Ministry of Education. M. Gálvez-Llompart acknowledges the “Atracció de talents” Fellowship by the Universitat de València.
■
NOMENCLATURE
Topological Indices
ATS5v
Broto-Moreau autocorrelation of a topological structure of lag 5 weighted by atomic van der Waals volumes BEHe1 Highest eigenvalue n. One of Burden matrix weighted by atomic Sanderson electronegativities BEHp4 Highest eigenvalue n. Four of Burden matrix weighted by atomic polarizabilities BEHv6 Highest eigenvalue n. Six of Burden matrix weighted by atomic van der Waals volumes EEig08x eigenvalue n. Eight from edge adj. matrix weighted by edge degrees EEig15x eigenvalue n. Fifteen from edge adj. matrix weighted by edge degrees GATS8m Geary autocorrelation of lag 8 weighted by atomic masses JGI5 Mean topological charge index of order 5 JGI9 Mean topological charge index of order 9 nDB Number of double bonds PJI2 2D Petitjean shape index PW3 Path/walk 3 count index − Randić shape index RBF Rotatable bond fraction WA Mean Wiener index
■
CONCLUSIONS Legislation in the field of fragrances, although traditionally lenient to preserve trade secret, is starting to toughen when it comes to environmental and human health protection, especially upon adoption of REACH regulation in the European Union. In particular, the use of persistent, bioaccumulative or toxic compounds should be limited in favor of biodegradable compounds. Traditionally this requirement would require costly experiments given the thousands of molecules that are usable in fragrances, some in very small amounts. Alternatively, the use of low-cost methods based on mathematical chemistry can be the first step toward a safer and more sustainable perfumery. This work illustrates how the molecular structural characteristics relevant to a given property can possibly be identified by means of topological indices. With a reduced number of descriptors it has been possible to model with fair accuracy a complex property like the biodegradability of a varied set of compounds. The application of linear discriminant analysis has proved competitive with more sophisticated statistical techniques usable to this end. In particular, two topological indices were found especially relevant to the biodegradability of this class of molecules, which were related to topological compactness and hydrophilicity. Although both properties are interrelated, it was concluded that compact molecules with intermediate hydrophilicity are expected to be more biodegradable, in general, which could be related to the mechanisms of biodegradation. The methodology used in this work can also be adapted to other problems, potentially contributing to the optimization of resources in academia and R&D departments in industry, to the benefit of the environment.
■
REFERENCES
(1) Ellena, C. Perfume formulation: words and chats. Chem. Biodiversity 2008, 5, 1147−1153. (2) Ohloff, G.; Pickenhagen, W.; Kraft, P. Scent and Chemistry; Wiley, 2011. (3) Kraft, P.; Bajgrowicz, J. A.; Denis, C.; Frater, G. Odds and Trends: Recent Developments in the Chemistry of Odorants. Angew. Chem., Int. Ed. 2000, 39, 2980−3010. (4) Flavours and Fragrances: Chemistry, Bioprocessing and Sustainability; Berger, R. G., Ed.; Springer, 2007. (5) The Chemistry of Fragrances: From Perfumer to Consumer, 2nd ed.; Pybus, D., Sell, C., Eds.; RSC Publishing, 2006. (6) European Union. Directive 2003/15/EC (7th amendment to the Cosmetics Directive). G
DOI: 10.1021/acssuschemeng.6b00717 ACS Sustainable Chem. Eng. XXXX, XXX, XXX−XXX
Research Article
ACS Sustainable Chemistry & Engineering (7) RPS BKH Consulting Engineers, 2002. (8) Jasinski, P.; Welsh, B.; Galvez, J.; Land, D.; Zwolak, P.; Ghandi, L.; Terai, K.; Dudek, A. Z. A novel quinoline, MT477: suppresses cell signaling through Ras molecular pathway, inhibits PKC activity, and demonstrates in vivo anti-tumor activity against human carcinoma cell lines. Invest. New Drugs 2008, 26, 223−232. (9) Garcia-Domenech, R.; Galvez, J.; de Julian-Ortiz, J. V.; Pogliani, L. Some new trends in chemical graph theory. Chem. Rev. 2008, 108, 1127−1169. (10) Garcia-Domenech, R.; de Julian-Ortiz, J. V.; Duart, M. J.; GarciaTorrecillas, J. M.; Anton-Fos, G. M.; Rios-Santamarina, I.; De Gregorio-Alapont, C.; Galvez, J. Search of a topological pattern to evaluate toxicity of heterogeneous compounds. SAR QSAR Environ. Res. 2001, 12, 237−254. (11) García-Domenech, R.; Domingo-Puig, C.; Esteve-Martínez, M. A.; Scmitt, J.; Vera-Martínez, J.; Chindemi, A. L.; Gálvez, J. Aplicación de la Topologiá Molecular en la búsqueda de nuevos agentes activos frente a Leishmania. An. R. Acad. Nac. Farm. 2008, 74, 345−367. (12) Gálvez, J.; Gálvez-Llompart, M.; García-Domenech, R. Application of molecular topology for the prediction of the reaction times and yields under solvent-free conditions. Green Chem. 2010, 12, 1056−1061. (13) Gálvez-Llompart, M.; Gálvez, J.; García-Domenech, R.; Kier, L. B. Modeling Drug-Induced Anorexia by Molecular Topology. J. Chem. Inf. Model. 2012, 52, 1337−1344. (14) García-Domenech, R.; Aguilera, J.; El Moncef, A.; Pocovi, S.; Gálvez, J. Application of molecular topology to the prediction of mosquito repellents of a group of terpenoid compounds. Mol. Diversity 2010, 14, 321−329. (15) Gálvez, J.; Gálvez-Llompart, M.; García-Domenech, R. Introduction to Molecular Topology: Basic Concepts and Application to Drug Design. Curr. Comput.-Aided Drug Des. 2012, 8, 196−223. (16) Kier, L. B.; Hall, L. H. Molecular Connectivity in Structure-Activity Analysis; RSP-Wiley, 1986. (17) Boethling, R. Comparison of ready biodegradation estimation methods for fragrance materials. Sci. Total Environ. 2014, 497−498, 60−67. (18) OECD. ENV/JM/TG(2005)5/REV1 (Guidelines for testing of chemicals, section 3, Part 1). (19) Wammer, K. H.; Peters, C. A. Polycyclic Aromatic Hydrocarbon Biodegradation Rates: A Structure-Based Study. Environ. Sci. Technol. 2005, 39, 2571−2578. (20) Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Model. 1988, 28, 31−36. (21) Todeschini, R.; Consonni, V. DRAGON, version 5.4; TALETE, 2006. (22) Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second ed.; Springer, 2009. (23) StatSoft Inc. STATISTICA, version 9.0; StatSoft Inc., 2009. (24) Everitt, B. S.; Dunn, G. Applied Multivariate Data Analysis, 2nd ed.; Wiley, 2001. (25) Gramatica, P. Principles of QSAR models validation: internal and external. QSAR Comb. Sci. 2007, 26, 694−701. (26) Todeschini, R.; Consonni, V.; Mannhold, R.; Kubinyi, H.; Folkers, G. Molecular Descriptors for Chemoinformatics, Vol. 41 (2 Vol. Set); Wiley, 2009. (27) Milano Chemometrics & QSAR Research Group. Molecular Descriptors: the free online resource, http://www.moleculardescriptors. eu/softwares/softwares.htm. (28) Matthews, B. W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta, Protein Struct. 1975, 405, 442−451. (29) Bishop, C. M. Neural Networks for Pattern Recognition; Oxford University Press, 1995. (30) Roy, K.; Kar, S.; Das, R. N. Understanding the Basics of QSAR for Applications in Pharmaceutical Sciences and Risk Assessment; Academic Press, 2015.
(31) Ceriani, L.; Papa, E.; Kovarich, S.; Boethling, R.; Gramatica, P. Modeling Ready Biodegradability of Fragrance Materials. Environ. Toxicol. Chem. 2015, 34, 1224−1231. (32) Fawcett, T. An introduction to ROC analysis. Pattern Recogn. Lett. 2006, 27, 861−874.
H
DOI: 10.1021/acssuschemeng.6b00717 ACS Sustainable Chem. Eng. XXXX, XXX, XXX−XXX