QSAR-Co: An Open Source Software for ... - ACS Publications

multi-tasking or multi-target classification-based QSAR model using linear discriminant analysis or random forest techniques, with appropriate validat...
1 downloads 0 Views 2MB Size
Subscriber access provided by Bibliothèque de l'Université Paris-Sud

Application Note

QSAR-Co: An Open Source Software for Developing Robust Multi-tasking or Multi-target Classification-Based QSAR Models PRAVIN AMBURE, Amit Kumar Halder, Humbert González-Díaz, and Maria Natália D.S. Dias Soeiro Cordeiro J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.9b00295 • Publication Date (Web): 14 May 2019 Downloaded from http://pubs.acs.org on May 15, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

QSAR-Co: An Open Source Software for Developing Robust Multi-tasking or Multi-target Classification-Based QSAR Models Pravin Ambure1, Amit Kumar Halder1, Humbert González‐Díaz2, M. Natália D. S. Cordeiro1*

1LAQV@REQUIMTE,

Department of Chemistry and Biochemistry, University of Porto, 4169-007 Porto, Portugal 2Department

of Organic Chemistry II, University of Basque Country UPV/EHU, 48940

Leioa, Spain

1 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 22

ABSTRACT Quantitative structure-activity relationships (QSAR) modeling is a well-known computational technique with wide applications in fields such as drug design, toxicity predictions, nanomaterials, etc. However, QSAR researchers still face certain problems to develop robust classification-based QSAR models, especially while handling response data pertaining to diverse experimental and/or theoretical conditions. In the present work, we have developed an open source standalone software ‘QSAR-Co’ (available to download at https://sites.google.com/view/qsar-co) to set-up classification-based QSAR models that allow mining the response data coming from multiple conditions. The software comprises two modules, namely: 1) the Model development module and 2) the Screen/Predict module. This user-friendly software provides several functionalities required for developing a robust multi-tasking or multi-target classification-based QSAR model using linear discriminant analysis or random forest techniques, with appropriate validation, following the principles set by the ‘Organisation for Economic Co-operation and Development’ (OECD) for applying QSAR models in regulatory assessments.

2 ACS Paragon Plus Environment

Page 3 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

INTRODUCTION Quantitative structure-activity relationships (QSAR) modeling is a well-known technique that has been proven extremely helpful in several research fields including, pharmaceutical, eco-toxicity of industrial chemicals, materials science, etc.1, 2 This technique not only helps in screening desirable lead chemicals, but also provides hints to improve the physical and (bio)chemical properties of interest. Several advanced approaches such as machine learning algorithms (like Random Forest, Neural network and deep learning)3, Monte Carlo method4-6 etc. are successfully implemented for performing the QSAR modeling. Though the QSAR technique is well established, there are still room for upgrades.7 In this work, we have developed a software for tackling issues related to the general practice of setting up a classification-based QSAR model and further suggested a logical solution. Let us first briefly outline the QSAR technique for the reader to understand the motive of developing such software. The core objective of any QSAR modeling is to setup a relationship between the response (e.g. activity/toxicity/property) variable(s) and the structural features employing enough chemicals with known response(s) of interest. The approach depends on being able to represent the chemical structures in quantitative or numerical terms  the socalled descriptors, and then to find relationships between the descriptor values and targeted response variable(s). QSAR models can be divided into classification-based, when a relationship is aimed to be established between the descriptors and the categorical values of the response variable(s), or regression-based, whenever the aim is to find a relationship between the descriptors and the quantitative values of the response variable(s).8 Here, since the software reported is entirely devoted to setting up classification-based QSAR models, our discussion will be limited to their development along with several pertinent coupled functionalities. Problem statement and motive of software development The main motivation for developing the ‘QSAR-Co’ software version 1.0.0 is to tackle some critical issues usually neglected during the development of conventional classification-based QSAR models. The commonly observed protocol (see, for example, these studies reported in the literature9-11) employed for developing these models involves combining all the compounds with a known response variable to form a dataset, 3 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 22

irrespectively of the experimental (or theoretical) conditions followed to determine that response variable. Further, based on a pre-fixed threshold value, the response variable is divided into two or more classes like active/inactive, etc., the classification-based QSAR model being then developed using such categorical variables. Here it should be noticed that, the merging of data provides an opportunity to ensure enough diversity in the dataset, and thus the applicability domain of the resultant classification model becomes sufficiently broad. But one should understand that the response values may vary significantly when determined using different experimental protocols or when determined using the same experimental protocol but in different laboratory/environmental/time conditions as well as according to the biological measurements (e.g. IC50, Ki, etc.) or theoretical calculations employed.12, 13 Thus, combining such response values to form a dataset without considering the above-mentioned conditions can clearly affect the development of a robust QSAR model. However, there is a solution to this subject, that is, to apply the Box-Jenkins moving average approach.14 In fact, in recent years, a significant number of studies15-23 have been reported, where such kinds of merged dataset were successfully employed to develop robust classification models applying the Box-Jenkins approach. Using the latter, not only one can merge compounds with response variable following different conditions, but one can also derive a single classification-based QSAR model by employing the biological activity values jointly (as response variable) against multiple relevant biological targets. Such models are known as multi-target (mt) or multi-tasking (mtk) QSAR models15, 16, 24 and these have proven to be extremely beneficial to identify, optimize, and screen lead compounds with significant activity against multiple biological targets (of interest). Such lead compounds are known as multi-target-directed ligands (MTDLs) and these are highly promising candidates in the treatment of complex diseases like Alzheimer’s disease, cancer, etc. However, in all these reported studies15-23, the calculations related to Box-Jenkins moving average approach have been performed manually using the Microsoft Excel, and all the other required steps have been mostly carried out employing commercial software packages. Therefore, there is still the unavailability of freely-available software completely dedicated to performing classification-based mtk-QSAR studies, which are otherwise highly time consuming.

4 ACS Paragon Plus Environment

Page 5 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

For this reason, we have developed an open source standalone software ‘QSAR-Co’ version 1.0.0 able to perform classification based mtk-QSAR studies. It is noteworthy that QSAR-Co is a short form for “QSAR with conditions”, which highlights the fact that this software can deal with the response data with multiple experimental/theoretical conditions. This feature helps users to develop a mtk-QSAR model that can predict multiple responses (with different experimental/theoretical conditions or against different biological targets) simultaneously using a single QSAR model. Another motivation for establishing this software is to provide a distinct platform for setting up classification-based QSAR models keeping in mind all the guidelines required by the OECD25 towards their successful application in regulatory purposes. However, it is noteworthy that just by following all the steps implemented in the software will not always lead to a robust QSAR model, since the QSAR model quality also depends on the modelability26-28 of a dataset as well as the quality of structural and biological data. Thus, it is always advisable to check dataset modelability and perform data curation11, 29-33 (i.e., structural and biological data curation) prior to start with the QSAR model development.

SOFTWARE AND ITS FUNCTIONALITIES The ‘QSAR-Co’ version 1.0.0 software is an open source standalone tool developed using Java and JavaFX programming language, and is freely available to download at https://sites.google.com/view/qsar-co. The software is user-friendly as well as it saves a lot of time on the model building. There are two modules available in the software, i.e.: 1) the Model development module and 2) the Screen/Predict module. Here we will discuss inbrief all the steps and associated functionalities in each module. MODULE 1: Model development In the ‘Model development’ module, all the QSAR model development steps that are provided by the QSAR-Co software are as follows (see Figure 1).

5 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 22

Figure 1. Workflow illustrating the steps involved in the Model development (Module 1).

Step 1. Normal approach and Box-Jenkins approach In this software, the user should opt for ‘Normal approach’ to run a simple classificationbased QSAR study when the employed response data has been determined using the same experimental/theoretical conditions for all the chemicals in the dataset. On the opposite, the user should opt for the ‘Box-Jenkins approach’ to perform a classification-based QSAR study for modeling a response data determined under different experimental/theoretical conditions. In the ‘Normal approach’ option, the input descriptor set is directly employed 6 ACS Paragon Plus Environment

Page 7 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

by the program in the subsequent model development steps since no treatment to the input data is required. While in the ‘Box-Jenkins approach’ a modified descriptor set is computed using

the

Box-Jenkins

moving

average

approach

considering

the

different

experimental/theoretical condition(s) and these ‘modified descriptors’ are then employed in the subsequent model development steps. The details of this approach have been largely reported in the past,15-23 so we limit ourselves here to a brief description underlining only its most important aspects. Initially, the arithmetic average of the descriptors for a specific condition are calculated as follows:

avg ( Di )c j 

1 n (c j )



n(c j )

i 1

Di … (1)

Here, Di is the calculated descriptor of individual compounds ‘i’ whereas n(cj) is the number of actives in modeling dataset compounds assayed by the same element of the experimental condition (ontology) cj. Therefore, the avg(Di)cj is the arithmetic mean of the descriptors (Di) for a specific condition cj. After obtaining the avg(Di)cj values, in the subsequent step the following formula is used to calculate the modified descriptors ((Di)cj). ( Di )c j  Di  avg ( Di )c j

… (2)

Also, note that if a confidence variable (Cv, denoting the confidence in assay information) is also available as input, then the modified descriptors ((Di)cj) are calculated using the following formula: ( Di )c j  Cv  ( Di  avg ( Di )c j ) … (3)

The term (Di)cj is the so-called Box-Jenkin’s operator and these modified descriptors capture the information about both chemical structures and specific elements of the experimental/theoretical conditions (cj) under which the samples were evaluated. Step 2. Data Pre-Treatment Optionally, one can perform a data pre-treatment to remove non-informative (i.e. constant, and inter-correlated) descriptors that may not have significant contribution for building the model. 7 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 22

Step 3. Dataset division The software provides three dataset division techniques that arise from ‘random’ and ‘rational’ approaches of performing dataset divisions. In the random approach, the chemicals are randomly assigned to both the training and test sets. In the rational approaches, two techniques are provided by the software, i.e., the Kennard-Stone’s algorithm and the Euclidean distance-based division method. The Kennard–Stone’s algorithm34, 35 leads to the selection of the most diverse compounds in the training set. The Euclidean distance-based algorithm11 tries to capture chemicals for the test set that are representative of the training set compounds. Step 4. Removal of less-discriminating descriptors There is an option to remove ‘less-discriminating’ descriptors, which is sometimes helpful to remove descriptors that may just enhance the noise in the data with less or no contribution. Here, the ‘less-discriminating’ descriptors are identified using the molecular spectrum analysis approach.11 Step 5. Variable selection technique This software provides the ‘Genetic Algorithm’36 as a variable selection technique for setting up ‘Linear Discriminant Analysis’ (LDA) models. The genetic algorithm (GA) is a well-known technique that is often utilized for developing regression-based QSAR models37-41 as well as classification-based QSAR models.42,

43

The fitness function for

selecting the top models in each iteration is defined as follows:

Fitness Score i 

0.5  Wilk' s i 0.5

where Fitness Scorei is the fitness score for the training model ‘i’, and Wilk' s i stands for the Wilk’s lambda calculated for the respective training model ‘i’. Step 6. Model development techniques At present, the software provides two machine learning techniques to develop robust classification-based QSAR model, namely: Two-class Linear Discriminant Analysis44 8 ACS Paragon Plus Environment

Page 9 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(LDA) and Random Forest45 (RF). In this software, we have used the Weka version 3-9-3 java library46 to perform RF. Step 7. Model selection and model validation The software reports both internal (including cross-validation) and external validation metrics computed using the training set and the test set, respectively. The validation metrics that are computed for both training and test sets comprise of Wilk’s λ47 and metrics computed based on the confusion matrix48 such as accuracy, sensitivity, specificity, precision, Fmeasure, and the Matthews correlation coefficient (MCC). QSAR-Co software also provides Receiver operating characteristics48 (ROC) plots along with the area under the curve (AUC) of ROC value. The present software uses Weka version 3-9-3 java library46 to perform ROC and AUC calculations using the k-fold cross-validation method. The software also performs the Y-randomization test49, where the randomly generated training sets (i.e., shuffled response values) are compared with the original training set using the Wilk’s lambda (λ) values. Step 8. Defining Applicability Domain Two techniques are provided for defining the applicability domain (AD) of the developed classification model that are as follows: i) Using a simple approach based on the standardization technique50 to identify the compounds that are structural outliers (training set) or outside the applicability domain (test set). ii) Using a confidence estimation approach,11, 51 which is simply based on the a posteriori probability values computed for each compound in the training set as well as in the test set and it can be successfully employed to check the reliability of predictions. MODULE 2: Screening/Prediction In the software, module 2 is available in a separate tab and Figure 2 demonstrates the workflow of ‘Predict/Screen’ module. The simple steps performed for screening query chemicals using this module are as follows.

9 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 22

Step 1. To provide the required input information for prediction of query compounds, which depends on the chosen QSAR approach i.e. ‘Normal approach’ or ‘Box-Jenkin’s approach’ as mentioned in the workflow (see Figure 2). Step 2. To select the appropriate model development techniques. The screening or prediction of query compounds works well with both two-class LDA and Random Forest models. Step 3. Screen. On screening, the software not only provides the predicted class for query compounds using the input training model, but it also provides the applicability domain status for every query compound.

Figure 2. Workflow illustrating the steps involved in the prediction/screening of query chemicals/database (Module 2).

10 ACS Paragon Plus Environment

Page 11 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

CASE STUDIES In the present work, to demonstrate the functionalities of the developed software, we examined three case studies. These case studies involve development of mtk-QSAR models employing our earlier published datasets, namely: (i) the first dataset16 comprises 14,854 compounds with inhibitory activity against different breast cancer related proteins, (ii) the second dataset22 comprises 31,894 compounds with antibacterial activity against Gram-negative pathogens and in vitro safety profiles related to absorption, distribution, metabolism, elimination, and toxicity (ADMET), and (iii) the third dataset17 includes 2,123 peptides (containing from 4 to 119 amino acids) with antibacterial activities against multiple Gram-negative bacterial strains, and cytotoxicity against different cell types. The methodology employed in all three case studies is described in the Supporting Information. In each case study, we have developed classification-based mtk-QSAR models using both the available techniques, i.e. GA-LDA and Random Forest. The final models were found to be significantly predictive and the relevant details including models’ equations, validation metrics (internal, external and true external), domain of applicability, ROC analysis and Y-randomization test results, are also reported in Supporting Information. CONCLUSIONS To sum up, we have presented a user-friendly standalone software ‘QSAR-Co’ version 1.0.0, which is freely available to download at https://sites.google.com/view/qsar-co. This software aims to develop robust classification-based QSAR models for all kinds of datasets, even when the response dataset values pertain to different experimental/theoretical conditions or against multiple biological targets. The software provides all the essential functionalities to develop classification-based QSAR models employing two widely known techniques, that is, LDA and RF. It also includes tools to compute the models’ applicability domain and the reliability of predictions for query chemicals, along with proper techniques (i.e., Y-randomization test and ROC analysis) to assess the robustness of the developed models. The presented software is extremely user friendly and efficient, as it works like a workflow-system. Once all the parameters and techniques are set, all the steps required in the development of classification-based QSAR models are executed by simply clicking a 11 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 22

button. Along with the model development module, the software also provides a separate module for predicting the response class of query chemicals or for screening a database using the developed LDA or RF models. Further, to demonstrate the functionalities and stability of the ‘QSAR-Co’ software, we have tested it with three case studies. Herein, we have developed multi-tasking classification-based QSAR models using both techniques that are available with the software, i.e. LDA and RF. We believe the present software will contribute to a widespread application of multi-target (multi-tasking) QSAR analysis to datasets under multiple conditions. Finally, we intend to keep on further adding new features to the QSAR-Co software in a near future. For instance, it will be interesting to explore other machine learning techniques such as libSVM52, sequential minimal optimization53, multilayer perceptron54 and to include them in the present software. Moreover, we are also interested to add a module dedicated to ‘data curation’ of the dataset especially designed for curation of big-datasets downloaded from online chemical databases. Lastly, another important issue will be to handle deviations or perturbations pertaining to small variations of the different experimental/theoretical conditions (cj). Indeed, that has already been proposed and widely applied by combining the Box-Jenkins approach with perturbation theory leading to the development of the so-called Perturbation Theory (PT) QSAR-based models.55-57

ASSOCIATED CONTENT Supporting Information The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim. . Modeling and external set (for 3 datasets) information including the initial set of calculated descriptors (XLSX) Attribute importance information for the RF model (XLSX) Detail description of all three case studies (DOCX)

12 ACS Paragon Plus Environment

Page 13 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

AUTHOR INFORMATION Corresponding Author *E-mail: [email protected]

ORCID M. Natália D. S. Cordeiro: 0000-0003-3375-8670

Notes The authors declare no competing financial interest.

ACKNOWLEDGEMENTS This

work

was

supported

by

UID/QUI/50006/2019

and

project

PTDC/QUI-

QIN/30649/2017 with funding from FCT/MCTES through national funds.

13 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 22

REFERENCES 1. Roy, K. Advances in QSAR Modeling: Applications in Pharmaceutical, Chemical, Food, Agricultural and Environmental Sciences, In Challenges and Advances in Computational Chemistry and Physics; Leszczynski, J., Ed.; 1st ed.; Springer, 2017; Vol. 24. 2. Puzyn, T.; Gajewicz, A.; Leszczynska, D.; Leszczynski, J. Nanomaterials – the Next Great Challenge for QSAR Modelers. In Recent Advances in QSAR Studies. Challenges and Advances in Computational Chemistry and Physics, Puzyn, T., Leszczynski, J., Cronin, M., Eds.; Springer: Dordrecht, 2010; Vol. 8, pp 383409. 3. Lo, Y.-C.; Rensi, S. E.; Torng, W.; Altman, R. B. Machine Learning in Chemoinformatics and Drug Discovery. Drug Discov. Today 2018, 23, 15381546. 4. Amata, E.; Marrazzo, A.; Dichiara, M.; Modica, M. N.; Salerno, L.; Prezzavento, O.; Nastasi, G.; Rescifina, A.; Romeo, G.; Pittalà, V. Comprehensive Data on a 2D-QSAR Model for Heme Oxygenase Isoform 1 Inhibitors. Data Brief 2017, 15, 281299. 5. Kumar, A.; Chauhan, S. QSAR Differential Model for Prediction of SIRT1 Modulation Using Monte Carlo Method. Drug Res. 2017, 67, 156162. 6. Aranda, J. F.; Bacelo, D. E.; Aparicio, M. S. L.; Ocsachoque, M.; Castro, E.; Duchowicz, P. R. Predicting the Bioconcentration Factor through a Conformation-Independent QSPR Study. SAR QSAR Environ. Res. 2017, 28, 749763. 7. Cherkasov, A.; Muratov, E. N.; Fourches, D.; Varnek, A.; Baskin, I. I.; Cronin, M.; Dearden, J.; Gramatica, P.; Martin, Y. C.; Todeschini, R. QSAR Modeling: Where Have You Been? Where Are You Going To? J. Med. Chem. 2014, 57, 49775010. 8. Roy, K.; Kar, S.; Das, R. N. A Primer on QSAR/QSPR Modeling: Fundamental Concepts. In SpringerBriefs in Molecular Science; Springer: London, 2015. 9. Fang, J.; Li, Y.; Liu, R.; Pang, X.; Li, C.; Yang, R.; He, Y.; Lian, W.; Liu, A.-L.; Du, G.-H. Discovery of Multitarget-Directed Ligands against Alzheimer’s Disease through Systematic Prediction of Chemical–Protein Interactions. J. Chem. Inf. Model. 2015, 55, 149164. 10. Nidhi, Glick, M.; Davies, J. W.; Jenkins, J. L. Prediction of Biological Targets for Compounds Using Multiple-Category Bayesian Models Trained on Chemogenomics Databases. J. Chem. Inf. Model. 2006, 46, 11241133. 11. Ambure, P.; Bhat, J.; Puzyn, T.; Roy, K. Identifying Natural Compounds as Multi-TargetDirected Ligands against Alzheimer’s Disease: An In Silico Approach. J. Biomol. Struct. Dyn. 2018, 37, 12821306.

14 ACS Paragon Plus Environment

Page 15 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

12. Kalliokoski, T.; Kramer, C.; Vulpetti, A.; Gedeck, P. Comparability of Mixed IC50 Data–a Statistical Analysis. PloS One 2013, 8, e61007. 13. Eriksson, L.; Jaworska, J.; Worth, A. P.; Cronin, M. T.; McDowell, R. M.; Gramatica, P. Methods for Reliability and Uncertainty Assessment and for Applicability Evaluations of Classification-and Regression-Based QSARs. Environ. Health Perspect. 2003, 111, 13611375. 14. Hill, T.; Lewicki, P.; Lewicki, P. Statistics: Methods and Applications: A Comprehensive Reference for Science, Industry, and Data Mining; StatSoft, Inc., 2006. 15. Casañola-Martin, G. M.; Le-Thi-Thu, H.; Pérez-Giménez, F.; Marrero-, Y.; Merino-Sanjuán, M.; Abad, C.; González-Díaz, H. Multi-Output Model with Box–Jenkins Operators of Linear Indices to Predict Multi-Target Inhibitors of Ubiquitin–Proteasome Pathway. Mol. Divers. 2015, 19, 347356. 16. Speck-Planche, A.; Cordeiro, M. N. D. S. Fragment-Based In Silico Modeling of Multi-Target Inhibitors against Breast Cancer-Related Proteins. Mol. Divers. 2017, 21, 511523. 17. Kleandrova, V. V.; Ruso, J. M.; Speck-Planche, A.; Cordeiro M. N. D. S. Enabling the Discovery and Virtual Screening of Potent and Safe Antimicrobial Peptides. Simultaneous Prediction of Antibacterial Activity and Cytotoxicity. ACS Comb. Sci. 2016, 18, 490498. 18. Speck-Planche, A.; Kleandrova, V. V.; Ruso, J. M.; Cordeiro, M. N. D. S. First Multitarget Chemo-Bioinformatic Model to Enable the Discovery of Antibacterial Peptides against Multiple Gram-Positive Pathogens. J. Chem. Inf. Model. 2016, 56, 588598. 19. Alonso, N.; Caamaño, O.; Romero-Duran, F. J.; Luan, F.; Cordeiro, M. N. D. S.; Yañez, M.; González-Díaz, H.; García-Mera, X. Model for High-Throughput Screening of Multitarget Drugs in Chemical Neurosciences: Synthesis, Assay, and Theoretic Study of Rasagiline Carbamates. ACS Chem. Neurosci. 2013, 4, 13931403. 20. Romero-Durán, F. J.; Alonso, N.; Yañez, M.; Caamaño, O.; García-Mera, X.; González-Díaz, H. Brain-Inspired Cheminformatics of Drug-Target Brain Interactome, Synthesis, and Assay of Tvp1022 Derivatives. Neuropharmacology 2016, 103, 270278. 21. Speck-Planche, A.; Cordeiro M. N. D. S. Simultaneous Modeling of Antimycobacterial Activities and ADMET Profiles: A Chemoinformatic Approach to Medicinal Chemistry. Curr. Top. Med. Chem. 2013, 13, 16561665. 22. Speck-Planche, A.; Cordeiro, M. N. D. S. De Novo Computational Design of Compounds Virtually Displaying Potent Antibacterial Activity and Desirable In Vitro ADMET Profiles. Med. Chem. Res. 2017, 26, 23452356.

15 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 22

23. Speck-Planche, A.; Cordeiro, M. N. D. S. Chemoinformatics for Medicinal Chemistry: In Silico Model to Enable the Discovery of Potent and Safer Anti-Cocci Agents. Future Med. Chem. 2014, 6, 20132028. 24. Speck-Planche, A.; Cordeiro, M. N. D. S. Multitasking Models for Quantitative Structure– Biological Effect Relationships: Current Status and Future Perspectives to Speed up Drug Discovery. Expert Opin. Drug Discov. 2015, 10, 245256. 25. Organization for Economic Co-Operation and Development (OECD). Guidance Document on the Validation of (Quantitative) Structure-Activity Relationship ((Q)SAR) Models. OECD Series on Testing and Assessment. 69. OECD Document ENV/JM/MONO 2007, pp 5565. 26. Golbraikh, A.; Muratov, E.; Fourches, D.; Tropsha, A. Data Set Modelability by QSAR. J. Chem. Inf. Model. 2014, 54, 14. 27. Ruiz, I. L.; Gómez-Nieto, M. Á. Study of Data Set Modelability: Modelability, Rivality, and Weighted Modelability Indexes. J. Chem. Inf. Model. 2018, 58, 17981814. 28. Ruiz, I. L.; Gómez-Nieto, M. Á. Prediction of the Datasets Modelability for the Building of QSAR Classification Models by Means of the Centroid Based Rivality Index. J. Math. Chem. 2018, 120. 29. Fourches, D.; Muratov, E.; Tropsha, A. Trust, but Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research. J. Chem. Inf. Model. 2010, 50, 11891204. 30. Kramer, C.; Kalliokoski, T.; Gedeck, P.; Vulpetti, A. The Experimental Uncertainty of Heterogeneous Public Ki Data. J. Med. Chem. 2012, 55, 51655173. 31. Fourches, D.; Muratov, E.; Tropsha, A. Trust, but Verify II: A Practical Guide to Chemogenomics Data Curation. J. Chem. Inf. Model. 2016, 56, 12431252. 32. Mansouri, K.; Grulke, C.; Richard, A.; Judson, R.; Williams, A. An Automated Curation Procedure for Addressing Chemical Errors and Inconsistencies in Public Datasets Used in QSAR Modelling. SAR QSAR Environ. Res. 2016, 27, 911937. 33. Gadaleta, D.; Lombardo, A.; Toma, C.; Benfenati, E. A New Semi-Automated Workflow for Chemical Data Retrieval and Quality Checking for Modeling Applications. J. Cheminform. 2018, 10, 113. 34. Kennard, R. W.; Stone, L. A. Computer Aided Design of Experiments. Technometrics 1969, 11, 137148.

16 ACS Paragon Plus Environment

Page 17 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

35. Ambure, P.; Aher, R. B.; Gajewicz, A.; Puzyn, T.; Roy, K. “NANOBRIDGES” Software: Open Access Tools to Perform QSAR and Nano-QSAR Modeling. Chemom. Intell. Lab. Syst. 2015, 147, 113. 36. Venkatasubramanian, V.; Sundaram, A. Genetic Algorithms: Introduction and Applications. In Encyclopedia of Computational Chemistry; P. von Ragué Schleyer, P.; Allinger, N. L.; Clark, T.; Gasteiger, J.; Kollman, P. A.; Schaefer, H. F.; Schreiner, P. R., Eds.; John Wiley & Sons, Ltd., 2002, Vol 2. 37. Rogers, D.; Hopfinger, A. J. Application of Genetic Function Approximation to Quantitative Structure-Activity Relationships and Quantitative Structure-Property Relationships. J. Chem. Inf. Comput. Sci. 1994, 34, 854866. 38. Hemmateenejad, B.; Akhond, M.; Miri, R.; Shamsipur, M. Genetic Algorithm Applied to the Selection of Factors in Principal Component-Artificial Neural Networks: Application to QSAR Study of Calcium Channel Antagonist Activity of 1, 4-Dihydropyridines (Nifedipine Analogous). J. Chem. Inf. Comput. Sci. 2003, 43, 13281334. 39. Hasegawa, K.; Miyashita, Y.; Funatsu, K. GA Strategy for Variable Selection in QSAR Studies: GA-Based PLS Analysis of Calcium Channel Antagonists. J. Chem. Inf. Comput. Sci. 1997, 37, 306310. 40. Ambure, P.; Roy, K. Understanding the Structural Requirements of Cyclic Sulfone Hydroxyethylamines as hBACE1 Inhibitors against Aβ Plaques in Alzheimer's Disease: A Predictive QSAR Approach. RSC Adv. 2016,,6, 2817128186. 41. Gramatica, P.; Chirico, N.; Papa, E.; Cassani, S.; Kovarich, S. QSARINS: A New Software for the Development, Analysis, and Validation of QSAR MLR Models. J. Comput. Chem. 2013,,34, 21212132. 42. Gao, H. Application of BCUT Metrics and Genetic Algorithm in Binary QSAR Analysis. J. Chem. Inf. Comput. Sci. 2001, 41, 402407. 43. Sutherland, J. J.; O'Brien, L. A.; Weaver, D. F. Spline-Fitting with a Genetic Algorithm: A Method for Developing Classification Structure−Activity Relationships. J. Chem. Inf. Comput. Sci. 2003, 43, 19061915. 44. Snedecor, G.; Cochran, W. Statistical Methods, 8th ed.; Iowa State University Press: Iowa, 1989; pp 254272. 45. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 532. 46. Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I. H. The Weka Data Mining Software: An Update. SIGKDD Explor. 2009, 11, 1018. 17 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 22

47. Wilks, S. S. Certain Generalizations in the Analysis of Variance. Biometrika 1932, 24, 471494. 48. Fawcett, T. An Introduction to ROC Analysis. Pattern Recognit. Lett. 2006, 27, 861874. 49. Fisher, R. A. The Design of Experiments; Oliver And Boyd, Ltd: Edinburgh, 1937. 50. Roy, K.; Kar, S.; Ambure, P. On a Simple Approach for Determining Applicability Domain of QSAR Models. Chemom. Intell. Lab. Syst. 2015, 145, 2229. 51. Mathea, M.; Klingspohn, W.; Baumann, K. Chemoinformatic Classification Methods and Their Applicability Domain. Mol. Inform. 2016, 35, 160180. 52. Chang, C.-C.; Lin, C.-J. Libsvm: A Library for Support Vector Machines. ACM Trans. Intell. Syst. Technol. 2011, 2, 127. 53. Platt, J. C. Fast Training of Support Vector Machines Using Sequential Minimal Optimization. In Advances in Kernel Methods  Support Vector Learning; Schölkopf, B.; Burges, C. J. C.; Smola, A. J., Eds.; MIT Press: Cambridge, 1999, pp 185208. 54. Gardner, M. W.; Dorling, S. Artificial Neural Networks (the Multilayer Perceptron)—a Review of Applications in the Atmospheric Sciences. Atmos. Environ. 1998, 32, 26272636. 55. Gonzalez-Diaz, H.; Arrasate, S.; Gomez-SanJuan, A.; Sotomayor, N.; Lete, E.; Besada-Porto, L.; Ruso, J. M. General Theory for Multiple Input-Output Perturbations in Complex Molecular Systems. 1. Linear QSPR Electronegativity Models in Physical, Organic, and Medicinal Chemistry. Curr. Top. Med. Chem. 2013, 13, 17131741. 56. Concu, R.; Kleandrova, V. V.; Speck-Planche, A.; Cordeiro, M. N. D. S. Probing the Toxicity of Nanoparticles: A Unified in Silico Machine Learning Model Based on Perturbation Theory. Nanotoxicology 2017, 11, 891906. 57. Simon-Vidal, L., García-Calvo, O.; Oteo, U.; Arrasate, S.; Lete, E.; Sotomayor, N.; GonzálezDíaz, H. PTML: Perturbation-Theory and Machine Learning Model for High-Throughput Screening of Parham Reactions. Experimental and Theoretical Studies. J. Chem. Inf. Model. 2018, 58, 13841396.

18 ACS Paragon Plus Environment

Page 19 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

TOC Graphics

19 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

TOC Graphics

ACS Paragon Plus Environment

Page 20 of 22

Page 21 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 1. Workflow illustrating the steps involved in the Model development (Module 1).

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2. Workflow illustrating the steps involved in the prediction/screening of query chemicals/database (Module 2).

ACS Paragon Plus Environment

Page 22 of 22