Multi-Objective Genetic Algorithm (MOGA) as a feature selecting

6 days ago - ABSTRACT Quantitative Toxicity-Toxicity relationship (QTTR) models have a great potential for improving the meaning of toxicological test...
0 downloads 0 Views 805KB Size
Subscriber access provided by University of Rhode Island | University Libraries

Computational Chemistry

Multi-Objective Genetic Algorithm (MOGA) as a feature selecting strategy in ionic liquids' Quantitative Toxicity-Toxicity Relationship models' development Maciej Barycki, Anita Sosnowska, Karolina Jagiello, and Tomasz Puzyn J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.8b00378 • Publication Date (Web): 03 Dec 2018 Downloaded from http://pubs.acs.org on December 4, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Multi-Objective Genetic Algorithm (MOGA) as a feature selecting strategy in ionic liquids’ Quantitative Toxicity-Toxicity Relationship models’ development

Maciej Barycki, Anita Sosnowska, Karolina Jagiello, Tomasz Puzyn*

University of Gdansk, Faculty of Chemistry, Department of Environmental Chemistry and Radiochemistry, Laboratory of Environmental Chemometrics, ul. Wita Stwosza 63, 80-308, Gdansk, Poland

1 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ABSTRACT

Quantitative Toxicity-Toxicity relationship (QTTR) models have a great potential for improving the meaning of toxicological tests conducted on simple organisms. These models allow predicting the toxicological effect of a chemical based on its known toxicological effect in different toxicity tests, even against a different organism. This fact poses a great potential for predicting the toxicity of chemicals against higher organisms based on the results against lower ones. However, the possibility of developing such models is often restricted due to the low availability of data. We present the case study of developing the QTTR model for ionic liquids in different toxicological tests against the same species, in the face of insufficient experimental data (an additional confirmation for a different species is provided in the supporting information). In the presented case, we use a series of Quantitative StructureActivity Relationship (QSAR) models developed to deliver the data concerning the toxicity of ionic liquids against human HeLa and MCF-7 cancer cell lines. We use these data to develop a QTTR model with R2 as high as 0.8. The benefit of applying the Multiple-Objective Genetic Algorithm (MOGA – a genetic algorithm allowing for selection of the best set of explanatory features for several different dependent variables at the same time) as a QSAR model feature selecting strategy is presented and discussed.

2 ACS Paragon Plus Environment

Page 2 of 30

Page 3 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

INTRODUCTION According to the REACH Regulations currently applicable in Europe, a risk assessment, including a toxicity assessment, is required for every chemical substance produced or imported to Europe in quantities greater than one ton per year.1 The registration process is very expensive because, in addition to hard administrative work, it requires several experiments to be performed. Moreover, an extensive use of laboratory animals is morally questionable.2 Therefore, alternative ways (less expensive and more ethical) of allowing the information about the potential hazard to be provided are now being promoted.1-4 Among these alternatives, there are computational methods such as predictive Quantitative StructureActivity Relationship models (QSARs),5-10 molecular docking,11 knowledge-based SAR (structure-activity

relationship)

systems,12-13

read-across14-16

and

physiology-based

pharmacokinetic (PBPK) models.17-18 These techniques can support toxicity testing at multiple levels.19-20 Finding the relationships between the toxicity and the structural features is one of the most basic processes.7-10, 21 Computational chemistry is useful in this matter in two ways. On one hand, it can be used to translate the chemical structure to multiple indices called molecular descriptors that carry specific information about molecular structure and/or properties. On the other hand, there are many mathematical algorithms used in chemoinformatics (various stepwise procedures, swarm optimization algorithms, LASSO, etc.),22-24 that allow searching through a large sets of features (in this case molecular descriptors) to find the combinations of features responsible for occurrence of some particular activity, e.g., the toxicity, of studied chemicals. Genetic algorithms (GA) are among the most popular techniques used for this matter and will be discussed in more detail in this work. This step is mostly qualitative and it allows the structurally related chemical properties that will most likely induce their toxicological response to be identified. As a consequence, the mathematically based models are determined. Such models, called Quantitative Structure3 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Activity Relationship (QSAR) models allow the toxicity of certain chemicals to be predicted based on their structure. Recently, a novel approach was provided for predicting the hazard of chemicals. The Quantitative Toxicity-Toxicity Relationship models (QTTR)20,25 and interspecies Quantitative Structure-Toxicity Relationship models (i-QSTR)26 are similar to QSAR models in the general idea but use the toxicity data measured or predicted against one organism to determine the chemical’s toxicity against another organism. The main advantage of this approach is that it allows the toxicity of chemicals to be argued against species of higher levels based on the toxicity tests performed on the organisms of lower levels. The limitation is that the data for both toxicity tests must be available for at least several compounds. All the described applications of computational chemistry are in a close relationship and can be used to deliver new, valuable information about the toxicity of chemicals with a reduction in the number of experimental tests. In our work, we present a methodology that can be used to develop QTTR models when the experimental toxicity data are available for two nonoverlapping sets of chemical compounds that belong to the same chemical domain. In such a case, the missing data for QTTR modeling can be predicted using classical QSAR models. We argue however, that the quality of these predictions heavily depends on the approach used for feature selection. In this work, we present the consequence of using two approaches for feature selection: the classical GA and its modified version – the Multi-Objective Genetic Algorithm (MOGA), which allows the set of features for the modes concerning multiple modeled responses to be selected simultaneously. As a case study we use data available in the literature for two sets of ionic liquids, for which two series of experiments were performed, measuring their toxicity against HeLa human cancer cells27 and MCF-7 human cancer cells.28 In addition to the study performed with the use of the toxicological information against the same species, we

4 ACS Paragon Plus Environment

Page 4 of 30

Page 5 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

additionally prove the method’s applicability for cross species toxicology testing (see the supporting material). METHODS Data origin We used two data sets for QSAR model development. Each set was obtained from a single scientific publication, one describing the toxicity of 40 ionic liquids (ILs) against HeLa cells (set A),27 the other describing the toxicity of 26 ILs against MCF-7 cells (set B).28 Only two ionic liquids are common to both data series. Experimental data were obtained with the use of the MMT assay for a 48 h incubation time at 37°C. We used the concentration of IL that reduces cell viability by 50% as the modeled variable (endpoint). Data for both sets are provided in the supplementary materials. Computational steps The chemical structures of ionic liquids were expressed as molecular models using ChemSketch software29 (cations and anions separately). Structure energy minimizations were performed to assure the repeatability of the descriptors calculation step. Energy minimizations were performed with MOPAC software30 using the PM7 method31 for each ion separately. Optimized structures were used to calculate molecular descriptors with DRAGON 7 software32. We obtained 2891 descriptors for each cation and 1085 for each anion. Subsequently, both data subsets (A and B) were divided into training and test sets using the ‘2:1’ algorithm (every third compound from the list of compounds is selected to the test set; and the list is ordered according to descending values of the modeled property). For the purpose of linearization, we transformed each modeled property (toxicity against HeLa and MCF-7) logarithmically (log10).

5 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Genetic algorithms The selection of the descriptors was performed using two types of genetic algorithms: (i) the Multi-Objective Genetic Algorithm (MOGA) and (ii) the classical genetic algorithm (GA). The classical genetic algorithm is based on the concept in which a small set of features (compared to the total number of features available) is tested on the quality of the model they yield. In each iteration of the algorithm, a population of such small sets is tested, and the measure of model quality is a user chosen measure called the fitting function. Sets that were recognized to yield the best models in one iteration are mixed with each other (random features are exchanged between sets, similar to the process of gene exchange – hence the name of the algorithm) and tested in another iteration of the algorithm. This leads to the best (locally) set of features to be used to create a high-quality model. The Multi-Objective Genetic Algorithm (MOGA) is a modification of the classical method and is based on the same idea. However, in each iteration of the MOGA, sets of features are tested against multiple modeled features.33-34 In this way, the MOGA makes it possible to develop several high-quality models using the same set of features. The MOGA is already proven to be useful in the feature selection process.35-36 For this study, we used the classical GA implemented in QSARINS software.37 The MOGA algorithm was written in MATLAB. A simplified algorithm of the method can be expressed as: ################################################################################## # Set four parameters: X = number of descriptors in a single set (chromosome) P = population size I = number of iterations M = % of mutations descriptors – repository of descriptors endpoints – vectors of modeled variables # Algorithm: define fitting_function # in our case - Q2Loo 6 ACS Paragon Plus Environment

Page 6 of 30

Page 7 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

generate randomly: P of descriptors_set(X) # each descriptors_set (chromosome) contains X descriptors for each I: using both endpoints: for each descriptors_set(X): assess the value of fitting_function # at this point each descriptors_set (chromosome) has two values of fitting function – one for each endpoint choose smaller value of fitting_function for each descriptors_set(X) sort descriptors_set(X) by descending value of fitting_function substitute Z descriptors_set(X) with lowest fitting function values with random descriptors mix descriptors_set(X) with each other randomly ##################################################################################

The parameters used in the GA setup were as follows: iterations=500; population size=1000; mutation rate=20%. The parameters in the MOGA setup: iterations=500; population size=1000; mutation rate=20%. A cross-validation goodness-of-fit (Q2Loo) was used as a fitting function in both cases. Model development technique Each of the QSAR models in this work was developed using the Multiple Linear Regression (MLR) technique. The models were developed in QSARINS software.37 An internal and external validation step were performed to verify the credibility of the developed models. All measures recommended for evaluating the goodness-of-fit, stability, robustness and predictivity of each developed model were calculated.38-40 According to these recommendations, the validation metrics R2, Q2CV, Q2EXT, and CCC should be high and close to 1, whereas the error values RMSEC, RMSECV, RMSEEXT, MAE should be low and close to 0. Because the error values depend on the response ranges and it is difficult to provide a welldefined threshold for these characteristics, the RMSE values are evaluated arbitrarily in comparison to the training set range. Additionally, to evaluate the quality of the prediction based on the mean absolute error (MAE), the “Xternal Validation Plus” tool was applied.41 The analysis was provided based on the test set. To avoid the influence of the highest prediction errors on the quality of the prediction for the entire set, the highest 5% of the residual data points were omitted. The prediction is classified as ‘bad’ when the MAE is more 7 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

than 15% of the training set range or when the MAE 3 standard deviations is more than 25% of the training set range.41 The external validation was performed using ILs not used for the calibration of the model (training set). Additionally, because predictions of an endpoint for set A and B were made without experimental response data, we applied the “Predictor Reliability Indicator” tool to indicate/categorize the quality of the predictions.42 RESULTS AND DISCUSSION In this study, we compare two approaches for developing the QTTR model based on QSAR predictions (Figure 1). In approach 1 (the MOGA approach), the missing data describing the toxicity of two IL subsets against different cell lines were generated using two QSAR models. The descriptors were selected using the MOGA, therefore each QSAR model was developed based on the same set of structurally related chemical features. This approach followed the assumption that the studied chemicals induce toxicity against the cells of the same species (in this case: human) following a similar mechanism. Therefore, the MOGA application was justified. We developed two QSAR (1 and 2) models and then used each model to predict the missing data for the other subset (subsets A and B – Figure 1). Finally, we developed the QTTR model based on the enlarged data sets. In the second approach (classical GA) we repeated the same procedure but used conventional GA. We did this to determine how the use of another algorithm affects the quality of QSAR models (3 and 4) and the QTTR model. To make the comparison more reliable, we used the same GA parameters in both cases. For this study we used 64 measured values of ionic liquid toxicity against two cell lines (human cervical cancer cells – HeLa and human breast cancer cells – MCF-7). The availability of the data used here is presented in Figure 2.

8 ACS Paragon Plus Environment

Page 8 of 30

Page 9 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 1. Scheme of two analyzed approaches

Figure 2. Simplified scheme presenting the considered scientific problem. Approach 1 – QSAR models built with the MOGA approach Assuming that the same structural features are responsible for the toxicity of the chemicals against cells extracted from the same organism (the same mechanism of toxicity occurs) we used the MOGA for the feature selection step. This way, structural features that provide reasonable QSAR models in both cases (toxicity of ILs against HeLa or MCF-7 cells) were selected simultaneously. Applying the MOGA also allowed for selecting the features

9 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

responsible for inducing the toxicological effect of ILs that are more global and less ILs set dependent. Three molecular descriptors selected with the MOGA were used to develop two separate QSAR models (1 and 2), which were applied for predicting two toxicological endpoints of the entire set of 64 ILs. In the next step, we developed the QTTR model for the obtained data. Two QSAR models were developed by applying the MOGA approach: QSAR 1 – a model for predicting the toxicity of ionic liquids against HeLa cells and QSAR 2 – a model for predicting the toxicity of ILs against MCF-7 cells and are presented in Table 1 and Table S3 in Supplementary Information 1. Almost all validation parameters of the models (please refer to Table 2) are within the boundary of the requirements, therefore the models can be considered well fitted, robust and characterized by accurate predictive abilities. Similar conclusions can be drawn by analyzing the scatter plots presenting the observed versus predicted toxicity values because all points are distributed near the dotted line that indicates the exact match of x and y values (Figures 3 A and D). However, two parameters calculated for the QSAR 2 model ((r2 – r/02) / r2 and k’) do not have the recommended values. This result suggests that most predicted values are in fact lower than the observed values for the test set (this is apparent in Figure 4 – A, as most of the orange dots are under the 1:1 line). This suggests exercising caution but for such a small test set this is an acceptable situation (because such a distribution of errors may simply occur by chance for smaller sets of data). To determine the limits of allowable errors the crucial aspect is the response range. There are no strong recommendations regarding the threshold of the RMSE value because it depends on the size and composition of the analyzed data. The RMSEs for the QSAR 1 and QSAR 2 models are low (Table 2) compared to the range of response values of the training set, which are equal to 2.629 and 3.734 for QSAR 1 and QSAR 2, respectively. Additionally, the “Xternal Validation Plus” tool was applied to estimate the quality of the predictions based on

10 ACS Paragon Plus Environment

Page 10 of 30

Page 11 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

the MAE with the highest 5% of the errors omitted.41 The predictions were found to be of ‘good quality’ for model 1, which means that all prediction errors follow the criteria MAE0.1 × training set range and MEA 3 standard deviation  0.2 × training set range. For model 2 this approach cannot be applied due to the size of the test set (the criteria proposed by Roy et al. can be applied for estimating the quality of the test set prediction only if the number of data points is at least 10). Additionally, objects from the training and test sets are within both models’ applicability domain (AD) – a theoretical space defined over a set of objects, to which a model can be reliably applied (Figures 3 B and 3E). The right panel in Figure 3 (Figure 3 C – Insubria plot) presents the relation of the ILs from the opposite subset (subset B) with the AD of the model. Every ionic liquid from the opposite subset is within the AD of the developed model; therefore, predictions obtained by application of this model to subset B can be treated as reliable. The AD of the QSAR 2 model is much narrower than that of the QSAR 1 model. Thus, ionic liquids that belong to subset A are more structurally diversified. In fact, the structural variability they present covers the entire variability of subset B. This is why subset B is entirely within the AD of the QSAR 1 model, whereas subset A is not entirely within the AD of the QSAR 2 model. This result is apparent in Figure 3 F. The reliability of the QSAR 2 model predictions for subset A is still considerably high. A total of 32 ionic liquids are in the region with a leverage value lower than the critical value. This result means that they are structurally similar to the compounds used for model development. The fact that they are above or below the dotted lines representing the minimal and maximal predictions of the model informs that the predicted values come from the model’s extrapolation. This is worth comparing with the situation in the previous case (QSAR 1). One can observe that subset B covers a wider range of logEC50 values against MCF-7 cells (3.734) than subset A against HeLa cells (2.629). Nonetheless, the predicted toxicity of the ILs from subset B for HeLa

11 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

cells stayed within the borders marked by predictions for the ILs from subset A (see Figure 3 – C). This result indicates that the HeLa cells may be more immune to the toxic properties of the ILs. Following this reasoning, knowing that the toxicity of the ILs from subset A was of a wider range against HeLa cells, a similar situation would be expected here – the ILs from subset A should have a wider range of toxicity. This is apparent in Figure 4 – C. The ILs from subset A have a range of toxicity that is wider than that of the ILs from subset B. Although the predictions come from model extrapolation, they meet the theoretical expectations, which proves them to be more reliable. There are also 5 ionic liquids outside the region of critical leverage values, being in fact within the minimum – maximum range of predictions. These are also relatively reliable predictions that in fact prove the model’s accuracy for a wide range of compounds. Only the three ILs outside the region below the minimal predicted toxicity and over the critical leverage value are truly unreliable and should be treated with great caution. Because the developed models QSAR 1 and QSAR 2 were applied to the true external set of data (QSAR 1 to set B and QSAR 2 to set A), we have indicated the quality of these predictions based on three criteria: i) the mean absolute error of leave-one-out predictions for the 10 closest training compounds for each query IL;43 ii) the applicability domain in terms of similarity based on the standardization approach;44 iii) the proximity of the predicted value of the query compound to the mean training response.45 For this purpose, we have used the Prediction Reliability Indicator recently developed by Roy et al.42 The estimated predictions quality (Table S2) shows that almost all external predictions can be considered reliable (only in one case is the prediction moderate instead of good). Table 1. QSAR models’ equations derived from the MOGA approach. Values in brackets represent coefficients’ 95% confidence intervals. QSAR1 log(EC50) = – 2.587(±0.765) + 10.08(±2.139) VE2_B(i)_C + 12 ACS Paragon Plus Environment

Page 12 of 30

Page 13 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

+ 0.184(±0.145) GGI1_C – 0.128(±0.061) SP11_A QSAR2 log(IC50) = - 7.740(±1.897) + 21.953(±7.760) VE2_B(i)_C + + 1.112(±0.594) GGI1_C – 0.174(±0.147) SP11_A VE2_B(i)_C – average coefficient of the last eigenvector (absolute values) from Burden matrix weighted by ionization potential; descriptor of cation GGI1_C – topological charge index of order 1; descriptor of cation SP11_A – randic molecular shape profile no. 11; descriptor of anion For more details of the descriptors please refer to the Todeschini et al.46

Table 2. Parameters for internal and external validation of QSAR 1 (toxicity of the ILs against HeLa cells) and QSAR 2 (toxicity of the ILs against MCF-7 cells) obtained using the MOGA approach. Measure 37-40, 47

F R2 Q2 CCC R2SCR RMSE MAE average rm2  rm2 2 (r – r02) / r2 (r2 – r/02) / r2 k k’ | r02 – r’02|

Calibration QSAR2 QSAR1 40.7 0.842 --0.914 --0.261 0.219 ---------------

24.9 0.842 --0.915 --0.464 0.356 -------------

Cross-validation (loo) QSAR2 QSAR1 ----0.775 0.882 --0.311 0.261 ---------------

----0.772 0.877 --0.558 0.437 -------------

External validation QSAR1 QSAR2 ----0.911b 0.943 --0.196 0.161 0.746 0.052 0.078 0.051 1.050 0.883 0.023

----0.811b 0.869 --0.508 0.459 0.596 0.080 0.077 0.141 0.986 0.801

Criteria38-41 dependenta > 0.7 > 0.6 >0.85 dependenta acceptablec goodd > 0.5 < 0.2 < 0.1 < 0.1 0.85  k  1.15 0.85  k’  1.15 < 0.3

[---] – measure does not refer to the procedure; a – F criterion is dependent of the number of compounds in the set, should be higher than tabulated value of F at chosen significance, R2SCR – is dependent of value of R2 of calibration, should be significantly lower than R2; b – value of Q2 parameter in the validation procedure refers to Q2F3; c – RMSE should be as low as possible and is estimate based on the range of response values of the training set that are equal to 2.639 for QSAR 1 and 3.734 for QSAR 2; d – MAE, if possible (enough data points are available), is based on MAE-based metrices that are recommended by Roy et al.41

13 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3. QSAR models developed using the MOGA approach: left panel – scatter plot of observed versus predicted values of EC50 (A) and IC50 (D); middle panel – Williams plot for the toxicity of the ILs against HeLa cells using the QSAR 1 model (B) and for the toxicity of the ILs against MCF-7 cells using the QSAR 2 model (E), blue points represent the compounds in the training set, orange – compounds in the validation set; right panel Insubria plot, presenting the compounds of the opposite subset in the space of the model’s AD (C) and (F).

In the next step, we attempted to build the QTTR model for HeLa – MCF-7 toxicity predictions for the ILs. We present the mutual relation of the predictions in Figure 4. The determination coefficient (R2) between these data series is equal to 0.80.

Figure 4. Toxicity of the ILs against HeLa and MCFF-7 cells predicted for an entire set of 64 ionic liquids with QSAR models (the MOGA approach) After proving the high correlation between both data sets, we used the predicted data to develop the QTTR model (Eq. 5, Table S3 in SI1). The set of 64 ionic liquids was divided into training and test sets with the use of the ‘2:1’ algorithm. The model parameters are presented in Table 3. Only one parameter (k’) has a value slightly lower than recommended. As with the QSAR 2 model, this may suggest that the distribution of the objects above and below the fitting line is unequal. However, the values of all the other parameters meet the expected criteria and the overall analysis of the plots from Figure 5 confirms the goodness of fit, robustness and well predictive abilities of the developed QTTR model. 14 ACS Paragon Plus Environment

Page 14 of 30

Page 15 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

log(EC50)_HeLa = 0.15(±0.07) + 0.41(±0.06) log(IC50)_MCF-7

(1)

Table 3. Parameters for internal and external validation of the QTTR model. Measure 37-40, 47

Calibration

Cross-validation (loo)

External validation

Criteria 38-40 41

F R2 Q2 CCC R2SCR RMSE MAE average rm2  rm2 2 (r – r02) / r2 (r2 – r/02) / r2 k k’ | r02 – r’02|

197.7 0.825 --0.904 --0.237 0.174 ---------------

----0.806 0.894 --0.249 0.183 ---------------

----0.802b 0.866 --0.252 0.215 0.703 0.075 0.023 0.028 0.989 0.833 0.004

dependenta > 0.7 > 0.6 >0.85 dependenta acceptablec modetated > 0.5 < 0.2 < 0.1 < 0.1 0.85  k  1.15 0.85  k’  1.15 < 0.3

[---] – measure does not refer to the procedure; a – F criterion is dependent of the number of compounds in the set, should be higher than tabulated value of F at chosen significance, R2SCR – is dependent of value of R2 of calibration, should be significantly lower than R2; b – value of Q2 parameter in the validation procedure refers to Q2F3; c – RMSE should be as low as possible and is estimate based on the range of response values of the training set that is equal to 2.639; d – MAE is based on MAE-based metrices that are recommended by Roy et al. 41

Figure 5. (A) – Scatter plot of observed versus predicted values of EC50, (B) – Williams plot.

Approach 2 – QSAR models built with the classical GA approach After developing the QTTR model with the MOGA approach, we attempted to build the second QTTR model, changing the algorithm of feature selection to that of the classical GA. We kept the same algorithm parameters, the same number of variables and the same distribution of the compounds to the training and validation sets to make the approaches more 15 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

comparable. We started the QTTR model development by developing single QSAR models according to the scheme shown in Figure 1. Table 4 and Table S3 in SI1 shows the QSAR models developed by application of the classical GA algorithm to select the features correlated with the studied response: (i) the QSAR 3 model for predicting the toxicity of the ILs against HeLa cells and (ii) QSAR 4 for predicting the toxicity of the ILs against MCF-7 cells. All validation parameters (internal and external) for both models are presented in Table 5. The models are well fitted (predictions are in high correlation with the observed values of toxicity), which is proven via the calibration parameters and is apparent in the predicted versus observed values scatter plots (Figures 6 – A and D). The validation parameters also prove that both models are robust (their structure depends on every object – IL similarly) and have good predictive abilities (their predictions are accurate). Table 4. QSAR model equations derived from the classical GA approach. The values in brackets represent coefficient 95% confidence intervals.

QSAR3 log(EC50) = 1.8865 (±0.464) – 0.8209 (±0.5454) TDB03m_C – – 0.6483 (±0.0792) TDB09v_C – 0.0029 (±0.0007) MW_A QSAR4 log(IC50) = - 40.1691(±18.8437) + 22.6952(±8.447) TDB02e_C – – 12.9964(±4.6102) E1p_C – 0.0044(±0.0015) HTs_A TDB03m_C – 3D Topological distance based descriptors - lag 3 weighted by mass; descriptor of cation TDB09v_C – 3D Topological distance based descriptors - lag 9 weighted by van der Waals volume; descriptor of cation MW_A – molecular weight of anion TDB02e_C – 3D Topological distance based descriptors - lag 2 weighted by Sanderson electronegativity; descriptor of cation E1p _C – 1st component accessibility directional WHIM index / weighted by polarizability; descriptor of cation HTs _A – GETAWAY class descriptor, H total index / weighted by I-state; descriptor of anion For more details of the descriptors please refer to the Todeschini et al.46

Table 5. Parameters for internal and external validation of QSAR 3 (toxicity of the ILs against HeLa cells) and QSAR 4 (toxicity of the ILs against MCF-7 cells) obtained using the GA approach. 16 ACS Paragon Plus Environment

Page 16 of 30

Page 17 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Measure

Calibration

Cross-validation (loo)

37-40, 47

QSAR3

QSAR4

QSAR3

QSAR4

F R2 Q2 CCC R2SCR RMSE MAE average rm2  rm2 2 (r – r02) / r2 (r2 – r/02) / r2 k k’ | r02 – r’02|

123.2 0.941 --0.970 --0.160 0.121 ---------------

72.5 0.940 --0.969 --0.287 0.216 ---------------

----0.913 0.956 --0.193 0.145 ---------------

----0.910 0.954 --0.351 0.276 ---------------

External validation QSAR3 ----0.870b 0.942 --0.237 0.193 0.798 0.037 0.050 0.038 0.853 1.074 0.011

Criteria 38-40 41

QSAR4 ----0.969b 0.980 --0.208 0.180 0.962 0.007 0.031 0.031 1.084 0.895 0.001

dependenta > 0.7 > 0.6 >0.85 dependenta acceptablec moderated > 0.5 < 0.2 < 0.1 < 0.1 0.85  k  1.15 0.85  k’  1.15 < 0.3

[---] – measure does not refer to the procedure; a – F criterion is dependent of the number of compounds in the set, should be higher than tabulated value of F at chosen significance, R2SCR – is dependent of value of R2 of calibration, should be significantly lower than R2; b – value of Q2 parameter in the validation procedure refers to Q2F3; c – RMSE should be as low as possible and is estimate based on the range of response values of the training set that are equal to 2.639 for QSAR 3 and 3.734 for QSAR 4; d – MAE, if possible (enough data points are available), is based on MAE-based metrices that are recommended by Roy et al.41

Figure 6. QSAR models developed using the GA approach: left panel – scatter plot of observed versus predicted values of EC50 (A) and IC50 (D); middle panel – Williams plot for the toxicity of the ILs against HeLa cells using the QSAR 1 model (B) and for the toxicity of the ILs against MCF-7 cells using the QSAR 2 model (E), blue points represent the compounds in the training set, orange – compounds in the validation set; right panel Insubria plot, presenting the compounds of the opposite subset in the space of the model’s AD (C) and (F).

17 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Notably, for the QSAR 3 model, one of the objects from the test set (1-octyl-3methylimidazolium bis(trifluoromethylsulfonyl)imide) is located on the border line of the applicability domain (the standardized error of prediction for this IL is 3 standard deviations – Figure 6 B). This result suggests that the developed model might be close to overfitting. Overfitting means that the structural features selected to develop the model are describing not only the variance of changing toxicity but also a case specific random variance that is unique for the selected set. This may cause problems with predicting the toxicity of ionic liquids from outside subset A. Nevertheless, the external validation parameters for this model meet the requirements, so we decided to proceed with this model to the subsequent steps. The applicability domain of the second model developed with the classical GA approach, QSAR 4, also indicates that one of the ionic liquids goes beyond its border. However, in this case (1methyl-1-[4,5-bis(methylsulfide) pentyl]piperidinium bis(trifluoromethylsulfonyl)imide) the IL has a leverage value higher than the critical value (Figure 7 – E) whereas its prediction is within ±3 standard deviations of the standardized errors. This result indicates that the considered IL is structurally different from the ILs from the training set. Although this may be alarming, suggesting that the applicability domain of the model may be too narrow, the predicted value for this IL was accurate. The Insubria plot (Figure 6 C) of the QSAR 3 model shows that some of the ILs from subset B are outside the AD of the model. Aside from two ILs being structurally similar to the compounds used to develop the QSAR model but with predicted values slightly lower than the minimal value for the training set, there are also two ILs for which the predictions cannot be considered reliable. These are the ILs with leverage values higher than the critical value and a predicted toxicity value lower than the minimal value for the training set. This result suggests that the model can be overfitted to the training set, especially when we compare this result with the Insubria plot for the QSAR 1 model (Figure 3 C). In the scenario for the QSAR

18 ACS Paragon Plus Environment

Page 18 of 30

Page 19 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1 model, subset B was entirely within the model’s AD. For QSAR 4, the Insubria plot (Figure 6 – F) shows that 17 out of 40 ILs from subset A are located outside the AD (being a result of the model’s extrapolation for ILs structurally distant from the model’s training set compounds) and 11 ILs are in the caution region (being structurally distant or being a result of the model’s extrapolation). This result suggests that in this case, selected descriptors refer to structural features characteristic of this particular training set, narrowing the AD significantly. The same result is observed for set A using the QSAR 4 model (Table S2), where for 17 ionic liquids the predictions are ‘moderate’. For the QSAR 2 model (MCF-7 model developed with the MOGA), only three ILs are in the critical zone. According to the main assumption of this work, the toxicity of the ILs against both types of cells should be identical. However, both QSAR models developed at this stage of work were obtained using a different set of molecular descriptors. To support the hypothesis of mechanism similarity, we analyzed the mutual relation of each set of descriptors. We present Table 6, in which the correlation between particular features selected for each model is observed on the diagonal. A high correlation coefficient (always higher than 0.65) suggests that the descriptors used in both models, although different, describe the same structural features of molecules. Analyzing the results from both QSAR models, we found that, in general, the length of the alkyl side chains on the cation influenced the toxicity (an increase in the alkyl chain length results in increasing toxicity). Hence, we conclude that the mechanism of toxicity in both cases is similar, as expected. This is a strong confirmation of the thesis that the QTTR model can be developed for this dataset.

19 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 30

Table 6. Correlation coefficient between molecular descriptors used in the developed QSAR models (classical approach).

TDB02e_C E1p_C HTs_A

MCF-7 descriptors

TDB03m_C 0.773 0.016 0.074

HeLa descriptors TDB09v_C -0.500 0.761 0.310

MW_A -0.007 0.028 0.654

After developing two separate, classical GA-based models for predicting the toxicity of the ILs against HeLa and MCF-7 cells, we used them to predict the values of EC50 / IC50 for the entire set of ionic liquids (both subsets combined). The toxicity predicted for 64 ionic liquids against HeLa and MCF-7 cells is plotted in Figure 7. The determination coefficient (R2) between these data series is 0.52. The developed QTTR model based on these results will not be sufficient. The correlation between these two sets is too low to be able to quantitatively predict the value of the toxicity against one cell line based on the predictions for another cell line.

Figure 7. Toxicity of the ILs against HeLa and MCF-7 cells predicted for the entire set of 64 ionic liquids with the QSAR 3 and QSAR 4 models (classical approach).

To verify that the MOGA leads to satisfying results, we applied this approach to additional sets of data. We collected data containing information on the toxicity of ionic liquids against two species: Vibrio fisheri

7

and Scenedesmus vacualatus.48 We chose these 20

ACS Paragon Plus Environment

Page 21 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

datasets because 25 toxicity data points were present in both sets. Therefore, we could select 9 ILs (with experimental data available for both toxicities) as an external set for validating the QTTR2 model, and in this way validate the presented methodology. The details of the analysis are presented in the SI2 (.docx and .xlsx files). The QTTR2 model developed in this way for predicting the toxicity of V. fisheri and S. vacualatus for the ILs is presented below (Eq. 2). All validation parameters of the models are within the boundary of the requirements, therefore the models (particularly the MOGA models (QSAR C, QSAR D and QTTR2)) can be considered well fitted, robust and characterized by accurate predictive abilities (please refer to Table ES2). Similar conclusions can be drawn by analyzing the scatter plots presenting the observed versus predicted toxicity values because all points are distributed near the dotted line that indicates the exact match of x and y values (Figure 8 A). The Williams plot is presented in Figure 8 B. log(EC50) V. fisheri = 5.5015 – 0.5272 pEC50 S. vacualatus

(2)

Figure 8. (A) – Scatter plot of observed versus predicted values of EC50, (B) – William’s plot of QTTR model.

In the next step, the developed QTTR2 model was applied to predict the external set of data (9 compounds- Table S in ES2 – excel file), which were not used to develop the previous models (QSAR C, QSAR D or QTTR). The results showed that the experimental and 21 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

predicted values of toxicity are at the same level of accuracy. For example, for IL3 (1-decyl3-methylimidazolium chloride) the experimental value of toxicity against V. fisheri was 0.50, whereas the value predicted by the QTTR2 model was 0.459. For IL28 (1-butylpyridinium chloride) the experimental value was 3.18 and the value from QTTR2 was 3.407 (for more details please refer to the SI2). This result indicates that the applied MOGA methodology can be used in the QTTR modeling, even for cross-species predictions, where the mechanisms of the toxicity are similar. SUMMARY According to the “same-mechanism” assumption, ionic liquids should act similarly in both toxicity tests and the EC50 / IC50 values should indicate a high correlation. We chose the similar toxicity values yielded under the same experimental conditions on purpose; to minimize the variance in toxicity change to the structure of the ionic liquids. Following two paths of analysis (Figure 1), in which we changed the feature selection algorithm, we obtained slightly different results. However, this is easily explainable. In our study, both classical GA-based QSAR models were too strongly dependent on the set of liquids used for their development (training sets). Each time, the structural features selected for the purpose of QSAR model development were too accurately fitted to the modeled variable (toxicity). However, it was not clearly apparent until the selected model was applied to the test set. Even then, all the parameters of external validation met the desired conditions, suggesting that both models accurately predicted the values of toxicity. Only the analysis of the applicability domain of both models against an opposite subset of compounds suggested a need for caution. Our study showed that QTTR modeling for insufficient experimental data for two subsets of chemicals is possible and can be performed with high quality results. We proved that the QSAR model predictions can be treated as a fine source of data for the purpose of 22 ACS Paragon Plus Environment

Page 22 of 30

Page 23 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

QTTR modeling, but a wide applicability domain should be obtained for these models. In this study, we managed to achieve that by applying the MOGA algorithm as the QSAR model feature selection method. However, its limitation is that the mechanism of toxicity induction should be identical against both target organisms. Only then are the same set of features for multiple QSAR models expected. The MOGA algorithm played a major role in the AD widening in our study, but any other way of granting the ADs a wide range should be a valid way to apply QSAR-based QTTR modeling. On the other hand, QSAR models developed in accordance with the classical scheme in which explanatory features are being selected for one modeled variable can easily be a subject of overfitting, especially when developed for a small set of compounds. Widening the AD can be performed by enlarging the training set, but insufficient data is often a problem in QSAR modeling. Therefore, application of the MOGA algorithm seems to be a very good way to grant a wide AD for the model that can most likely be used in applications unrelated to QTTR. CONCLUSIONS In this paper, we propose a practical way to build the toxicity model (QTTR) in the situation of experimental data insufficiency. We presented the case study of developing the QTTR model for predicting the toxicity of ionic liquids against two human cancer cell lines (HeLa and MCF-7). We proved that the data derived from the QSAR models can serve as a good substitute for the experimental data in such a case. Here, we verified two methods for features selection to QSAR modeling – the classical approach with a genetic algorithm applied separately to two endpoints and the Multi-Objective Genetic Algorithm allowing for selection of the best set of explanatory features for several different dependent variables applied simultaneously for two endpoints. 23 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Although several procedures in the development of the QTTR model were identical in the classic (1) and the MOGA (2) approaches, i.e., (1) the training/test set split, (2) modeled variables used, (3) set of structural features and (4) model development technique (MLR), application of the two approaches to the variables selection led to different results. QSAR models that were built with the classical approach (the QSAR 3 and QSAR 4 models) were most likely overfitted, which could not be noticed until applying them to predict the toxicity values of the test set. However, it was not the way of distributing the ILs among the training and test sets or selecting too many descriptive variables for the models that caused this overfitting. The problem was in using the descriptive variables that by chance explained some of the random variance unique for the used training sets. This caused the predictions of both classically derived QSAR models to be inaccurate when the models were applied to another subset of ILs. The approach of using the MOGA as the variable selection tool reduced slightly the fitting of the QSAR models (QSAR 1 and QSAR 2), but the features chosen by this genetic algorithm were more global. The information they contained was related with the information about the toxicity of the ILs, not with the random variance of the training sets. Therefore, the final predictions of the MOGA derived QSAR models were highly correlated, as expected by analyzing the nature of the problem we described. To our best knowledge, the model we provided is the first QTTR model for ionic liquids involving HeLa and MCF-7 human cancer cells. This model allows for predicting the toxicity of ionic liquids against one cell line based on the experimental result for another cell line. The presented approach is very useful when the experimental data is only available for a few chemicals.

24 ACS Paragon Plus Environment

Page 24 of 30

Page 25 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

ASSOCIATED CONTENT Supporting Information The following files are available free of charge. SI1 - List of descriptors used in modeling and predicted values from QSAR models. (.xlsx) SI2 - List of descriptors used in modeling and predicted values from QSAR models for verifying MOGA approach. (.xlsx) SI3 – description of applied methodology for verifying MOGA approach. (.docx)

AUTHOR INFORMATION Corresponding Author *E-mail: [email protected] Author Contributions The manuscript was written through the contributions of all authors. All authors have given approval to the final version of the manuscript. Funding Sources National Science Center (Poland) Grant No. UMO-2012/05/E/NZ7/01148 Conflicts of interest There are no conflicts to declare. Notes The authors declare no competing financial interest.

ACKNOWLEDGMENT

25 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

This material is based on the research funded by the National Science Center (Poland) (Grant No. UMO-2012/05/E/NZ7/01148).

ABBREVIATIONS AD- applicability domain; GA Genetic Algorithm; ILs ionic liquids; MOGA Multi-Objective Genetic Algorithm; MLR Multiple Linear Regression; QSAR Quantitative Structure-Activity Relationship; QTTR Quantitative Toxicity-Toxicity Relationship

REFERENCES Regulation (EC) No 1907/2006 of the European Parliament and of the Council of 18 December 2006 concerning the Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH), establishing a European Chemicals Agency, and amending Directive 1999/45/EC and repealing Council Regulation (EEC) No 793/93 and Commission Regulation (EC) No 1488/94 as well as Council Directive 76/769/EEC and Commission Directives 91/155/EEC, 93/67/EEC, 93/105/EC and 2000/21/EC. Official Journal of the European Union, L 396/1, 2006. 2. Gozalbes, R.; de Julian-Ortiz, J.V. Applications of Chemoinformatics in Predictive Toxicology for Regulatory Purposes, Especially in the Context of the EU REACH Legislation. Int. J. Quant. Struct.-Prop. Relat. 2018, 3, 1-24. 3. Benigni, R.; Giuliani, A. Putting the Predictive Toxicology Challenge into perspective: reflections on the results. Bioinformatics 2003, 19, 1194-1200. 4. Williams, E. S.; Panko, J.; Paustenbach, D. J. The European Union's REACH regulation: a review of its history and requirements. Crit. Rev. Toxicol. 2009, 39, 553-575. 5. Merlot, C. Computational toxicology – a tool for early safety evaluation. Drug Discovery Today 2010, 15, 16-22. 6. Modi, S. H., M.; Garrow, A.; White, A. The value of in silico chemistry in the safety assessent of chemicals in the consumer goods and pharmaceutical industries. Drug Discovery Today 2012, 17 (2-4), 135-142. 7. Grzonkowska, M.; Sosnowska, A.; Barycki, M.; Rybinska, A.; Puzyn, T. How the structure of ionic liquid affects its toxicity to Vibrio fischeri? Chemosphere 2016, 159, 199207. 8. Das, R. N.; Roy, K.. Advances in QSPR/QSTR models of ionic liquids for the design of greener solvents of the future. Mol. Diversity 2013, 17, 151-196. 9. Jagiello, K., Makurat, S., Pereć, S., Rak, J., Puzyn, T. Molecular features of thymidine analogues governing the activity of human thymidine kinase. Struct. Chem. 2018, 29, 13671374. 10. Judycka, U.; Jagiello, K.; Bober, L.; Blazejowski, J.; Puzyn, T. Assessing therapeutic relevance of biologically interesting, ampholytic substances based on their physicochemical and spectral characteristics with chemometric tools. Chem. Phys. Lett. 2018, 701, 58-64. 1.

26 ACS Paragon Plus Environment

Page 26 of 30

Page 27 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

11. Jagiello, K.; Sosnowska, A.; Kar, S.; Demkowicz, S.; Dasko, M.; Leszczynski, J.; Rachon, J.; Puzyn, T. Geometry optimization of steroid sulfatase inhibitors - the influence on the free binding energy with STS. Struct. Chem. 2017, 28, 1017-1032. 12. Sosnowska, A.; Barycki, M.; Zaborowska, M.; Rybinska, A.; Puzyn, T. Towards designing environmentally safe ionic liquids: the influence of the cation structure. Green Chem. 2014, 16, 4749-4757. 13. Kazius, J.; McGuire, R.; Bursi, R. Derivation and validationa of toxicophores for mutagenicity prediction. J. Med. Chem. 2005, 48 312-320. 14. Patlewicz, G.; Ball, N.; Booth, E.D.; Hulzebos, E.; Zvinavashe, E.; Hennes C. Use of category approaches, read-across and (Q)SAR: General considerations. Regul. Toxicol. Pharmacol. 2013, 67, 1-12. 15. Vink, S. R.; Mikkers. J.; Bouwman, T.; Marquart, H.; Kroese, E.D. Use of read-across and tiered exposure assessment in risk assessment under REACH, a case study on a phase-in substance. Regul. Toxicol. Pharmacol.2010, 58, 64-71. 16. Gajewicz, A.; Jagiello, K.; Cronin, M. T. D.; Leszczynski, J.; Puzyn, T. Less data do not have to create barriers for regulate nanomaterials: Applying quantitative read-across (Nano-QRA). Environ. Sci.: Nano 2017, 4, 346-358. 17. Bartelsa, M.; Rick, D.; Lowe, E.; Loizou, G.; Price, P.; Spendiff, M.; Arnold, S.; Cocker, J.; Ball, N., Development of PK- and PBPK-based modeling tools for derivation of biomonitoring guidance values. Comput. Methods. Programs. Biomed. 2016, 108, 773-788. 18. Chen, A.; Yarmush, M.L.; Maguire, T. Physiologically based pharmatokinetic models: Integration of in silico approaches with microcellculture analogues. Curr. Drug Metab. 2012, 13, 863-880. 19. Kar, S.; Gajewicz, A.; Roy, K.; Leszczynski, J.; Puzyn, T. Extrapolating between toxicity endpoints of metal oxide nanoparticles: Predicting toxicity to Escherichia coli and human keratinocyte cell line (HaCaT) with Nano-QTTR. Ecotoxicol. Environ. Saf. 2016, 126, 238-244. 20. Roy, K.; Das, R. N.; Popelier, P. L. Predictive QSAR modelling of algal toxicity of ionic liquids and its interspecies correlation with Daphnia toxicity. Environ. Sci. Pollut. Res. 2015, 22, 6634-6641. 21. Jagiello, K.; Grzonkowska, M.; Swirog, M.; Ahmed, L.; Rasulev, B.; Avramopoulos, A.; Papadopoulos, M. G.; Leszczynski, J.; Puzyn, T. Advantages and limitations of classic and 3D QSAR approaches in nano-QSAR studies based on biological activity of fullerene derivatives. J. Nanopart. Res. 2016, 18, 256. 22. Xu, L.; Zhang, W. J. Comparison of different methods for variable selection. Anal. Chim. Acta 2001, 446, 477-483. 23. Eberhart, R. C.; Shi, Y. H. Particle swarm optimization: Developments, applications and resources. Ieee. C. Evol. Computat. 2001, 81-86. 24. Tibshirani, R., Regression shrinkage and selection via the Lasso. J. Roy. Stat. Soc. B Met. 1996, 58, 267-288. 25. Das, R. N.; Roy, K.; Popelier, P. L. Interspecies quantitative structure-toxicity-toxicity (QSTTR) relationship modeling of ionic liquids. Toxicity of ionic liquids to V. fischeri, D. magna and S. vacuolatus. Ecotoxicol. Environ. Saf. 2015, 122, 497-520. 26. Kar, S.; Dad. R.N.; Roy, K.; Leszczynski, J. Can Toxicity for Different Species be Correlated?: The Concept and Emerging Applications of Interspecies Quantitative StructureToxicity Relationship (i-QSTR) Modeling. Int. J. Quant. Struct.-Prop Relat. 2016, 1, 23-51. 27. Wang, X.; Ohlin. C. A.; Lu, Q.; Fei, Z.; Dyson, P. J. Cytotoxicity of ionic liquids and precursor compounds towards human cell line HeLa. Green Chem. 2007, 9, 1191-1197.

27 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

28. Kumar, R. A., Papaïconomou, N.; Lee, J.M.; Salminen, J.; Clark, D.S.; Prausnitz, J.M. In vitro cytotoxicities of ionic liquids: effect of cation rings, functional groups, and anions. Environ. Toxicol. Chem. 2008, 24, 388-395. 29. ACD/ChemSketch, v., Advanced Chemistry Development, Inc., Toronto, ON, Canada, http://www.acdlabs.com, 2008. 30. Stewart, J., Steward Computational Chemistry, Colorado Springs, CO, USA, http://OpenMOPAC.net, 2012. 31. Stewart, J. J., Optimization of parameters for semiempirical methods V: modification of NDDO approximations and application to 70 elements. J. Mol. Model. 2007, 13, 1173-213. 32. Dragon Software for Molecular Descriptor Calculation, http://www.talete.mi.it/, Milano, 2014. 33. Konak, A.; Coit, D. W.; Smith, A. E. Multi-objective optimization using genetic algorithms: A tutorial. Reliab. Eng. Syst. Safe. 2006, 91, 992-1007. 34. Zitzler, E.; Deb, K.; Thiele, L. Comparison of Multiobjective Evolutionary Algorithms: Empirical Results. Evol. Comput. 2000, 8, 173-195. 35. Ribeiro, L. D.; Soares, A. D.; de Lima, T. W.; Jorge, C. A. C.; da Costa, R. M.; Salvini, R. L.; Coelho, C. J.; Federson, F. M.; Gabriel, P. H. R. Multi-Objective Genetic Algorithm for Variable Selection in Multivariate Classification Problems: A Case Study in Verification of Biodiesel Adulteration. Procedia. Comput. Sci. 2015, 51, 346-355. 36. Spolaôr, N.; Lorena, A. C.; Lee, H. D. Multi-objective Genetic Algorithm Evaluation in Feature Selection. Evolutionary multi-criterion optimization. 6th international conference, EMO 2011, 462-476. 37. Gramatica, P.; Chirico, N.; Papa, E.; Cassani, S.; Kovarich, S. QSARINS: A new software for the development, analysis, and validation of QSAR MLR models. J. Comput. Chem. 2013, 34, 2121-2132. 38. Chirico, N.; Gramatica, P. Real external predictivity of QSAR models: how to evaluate it? Comparison of different validation criteria and proposal of using the concordance correlation coefficient. J. Chem. Inf. Model. 2011, 51, 2320-35. 39. Tropsha, A. Best Practices for QSAR Model Development, Validation, and Exploitation. Mol. Inf. 2010, 29, 476-488. 40. Ojha, P. K.; Mitra, I.; Das, R. N.; Roy, K. Further exploring r(m)(2) metrics for validation of QSPR models. Chemom. Intell. Lab. Syst. 2011, 107, 194-205. 41. Roy, K.; Das, R.N.; Ambure, P.; Aher, R.B. , Be aware of error measures. Further studies on validation of predictive QSAR models. Chemom. Intell. Lab. Syst. 2016, 152, 1833. 42. Roy, K.; Ambure, P.; Kar S. How precise are predictions from our QSAR models for new query compounds? ACS Omega 2018, 3, 11392-11406. 43. Roy, K.; Ambure, P.; Kar, S.; Ojha P. Is it possible to improve the quality of predictions from an “intelligent” use of multiple QSAR/QSPR/QSTR models? J Cheminf. 2018, 32, e2992. 44. Roy, K.; Kar, S.; Ambure, P. On a simple approach for determining applicability domain of QSAR models Chemom. Intell. Lab. Syst. 2015, 145, 22-29. 45. Roy, K.; Ambure, P.; Aher, R., How important is to detect systematic error in predictions and understand statistical applicability domain of QSAR models? Chemom. Intell. Lab. Syst. 2017, 162, 4-54. 46. Todeschini, R.; Consonni, V. Handbook of Molecular Descriptors, Methods and Principles in Medicinal Chemistry, 2000, Wiley-VCH Verlag GmbH. 47. Chirico, N.; Gramatica, P. Real External Predictivity of QSAR Models. Part 2. New Intercomparable Thresholds for Different Validation Criteria and the Need for Scatter Plot Inspection. J. Chem. Inf. Model. 2012, 52, 2044-2058. 28 ACS Paragon Plus Environment

Page 28 of 30

Page 29 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

48. Roy, K.; Das, R. N.; Popelier, P. L., Predictive QSAR modelling of algal toxicity of ionic liquids and its interspecies correlation with Daphnia toxicity. Environ. Sci. Pollut. Res. 2015, 22, 6634-41.

29 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

TABLE OF CONTENTS

30 ACS Paragon Plus Environment

Page 30 of 30