QSAR-Co - ACS Publications - American Chemical Society

May 14, 2019 - Department of Organic Chemistry II, University of Basque Country UPV/EHU, 48940 Leioa, Spain ..... (PDF). Modeling and external set (fo...
0 downloads 0 Views 2MB Size
Application Note Cite This: J. Chem. Inf. Model. 2019, 59, 2538−2544

pubs.acs.org/jcim

QSAR-Co: An Open Source Software for Developing Robust Multitasking or Multitarget Classification-Based QSAR Models Pravin Ambure,† Amit Kumar Halder,† Humbert Gonzaĺ ez Díaz,‡ and M. Nataĺ ia D. S. Cordeiro*,† †

LAQV@REQUIMTE, Department of Chemistry and Biochemistry, University of Porto, 4169-007 Porto, Portugal Department of Organic Chemistry II, University of Basque Country UPV/EHU, 48940 Leioa, Spain



Downloaded via BUFFALO STATE on July 17, 2019 at 07:34:57 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.

S Supporting Information *

ABSTRACT: Quantitative structure−activity relationships (QSAR) modeling is a well-known computational technique with wide applications in fields such as drug design, toxicity predictions, nanomaterials, etc. However, QSAR researchers still face certain problems to develop robust classification-based QSAR models, especially while handling response data pertaining to diverse experimental and/or theoretical conditions. In the present work, we have developed an open source standalone software “QSAR-Co” (available to download at https://sites. google.com/view/qsar-co) to setup classification-based QSAR models that allow mining the response data coming from multiple conditions. The software comprises two modules: (1) the Model development module and (2) the Screen/Predict module. This userfriendly software provides several functionalities required for developing a robust multitasking or multitarget classification-based QSAR model using linear discriminant analysis or random forest techniques, with appropriate validation, following the principles set by the Organisation for Economic Co-operation and Development (OECD) for applying QSAR models in regulatory assessments.



INTRODUCTION

QSAR models can be divided into classification-based, when a relationship is aimed to be established between the descriptors and the categorical values of the response variable(s), or regression-based, whenever the aim is to find a relationship between the descriptors and the quantitative values of the response variable(s).8 Here, since the software reported is entirely devoted to setting up classification-based QSAR models, our discussion is limited to their development along with several pertinent coupled functionalities. Problem Statement and Motive of Software Development. The main motivation for developing “QSAR-Co” software version 1.0.0 is to tackle some critical issues usually neglected during the development of conventional classificationbased QSAR models. The commonly observed protocol (see, for example, studies reported in the literature9−11) employed for developing these models involves combining all the compounds with a known response variable to form a data set, irrespectively of the experimental (or theoretical) conditions followed to determine that response variable. Further, based on a prefixed threshold value, the response variable is divided into two or more classes like active/inactive, etc., the classification-based QSAR model being then developed using such categorical variables. Here, it should be noticed that the merging of data

Quantitative structure−activity relationships (QSAR) modeling is a well-known technique that has been proven extremely helpful in several research fields including pharmaceutical, ecotoxicity of industrial chemicals, materials science, etc.1,2 This technique not only helps in screening desirable lead chemicals but also provides hints to improve the physical and (bio)chemical properties of interest. Several advanced approaches such as machine learning algorithms (like Random Forest, Neural network, and deep learning),3 Monte Carlo method,4−6 etc. are successfully implemented for performing the QSAR modeling. Though the QSAR technique is well established, there is still room for upgrades.7 In this work, we have developed a software for tackling issues related to the general practice of setting up a classification-based QSAR model and further suggested a logical solution. Let us first briefly outline the QSAR technique for the reader to understand the motive of developing such software. The core objective of any QSAR modeling is to setup a relationship between the response (e.g., activity/ toxicity/property) variable(s) and the structural features employing enough chemicals with known response(s) of interest. The approach depends on being able to represent the chemical structures in quantitative or numerical termsthe socalled descriptorsand then to find relationships between the descriptor values and targeted response variable(s). © 2019 American Chemical Society

Received: April 8, 2019 Published: May 14, 2019 2538

DOI: 10.1021/acs.jcim.9b00295 J. Chem. Inf. Model. 2019, 59, 2538−2544

Application Note

Journal of Chemical Information and Modeling

Figure 1. Workflow illustrating the steps involved in Model development (Module 1).

beneficial to identify, optimize, and screen lead compounds with significant activity against multiple biological targets (of interest). Such lead compounds are known as multitargetdirected ligands (MTDLs), and these are highly promising candidates in the treatment of complex diseases like Alzheimer’s disease, cancer, etc. However, in all these reported studies,15−23 the calculations related to the Box−Jenkins moving average approach have been performed manually using Microsoft Excel, and all the other required steps have been mostly carried out employing commercial software packages. Therefore, there is still the unavailability of freely available software completely dedicated to performing classification-based mtk-QSAR studies, which are otherwise highly time-consuming. For this reason, we have developed an open source standalone software “QSAR-Co” version 1.0.0 able to perform classification based mtk-QSAR studies. It is noteworthy that QSAR-Co is a short form for “QSAR with conditions”, which highlights the fact that this software can deal with the response data with multiple experimental/theoretical conditions. This feature helps users to develop a mtk-QSAR model that can predict multiple responses (with dif ferent experimental/theoretical conditions or against dif ferent biological targets) simultaneously using a single QSAR model. Another motivation for establishing this software is to provide a distinct platform for setting up classification-based

provides an opportunity to ensure enough diversity in the data set, and thus, the applicability domain of the resultant classification model becomes sufficiently broad. But one should understand that the response values may vary significantly when determined using different experimental protocols or when determined using the same experimental protocol but in different laboratory/environmental/time conditions as well as according to the biological measurements (e.g., IC50, Ki, etc.) or theoretical calculations employed.12,13 Thus, combining such response values to form a data set without considering the above-mentioned conditions can clearly affect the development of a robust QSAR model. However, there is a solution to this subject, that is, to apply the Box−Jenkins moving average approach.14 In fact, in recent years, a significant number of studies15−23 have been reported, where such kinds of merged data sets were successfully employed to develop robust classification models applying the Box−Jenkins approach. Using the latter, not only can one can merge compounds with response variables following different conditions but one can also derive a single classification-based QSAR model by employing the biological activity values jointly (as response variable) against multiple relevant biological targets. Such models are known as multitarget (mt) or multitasking (mtk) QSAR models,15,16,24 and these have proven to be extremely 2539

DOI: 10.1021/acs.jcim.9b00295 J. Chem. Inf. Model. 2019, 59, 2538−2544

Application Note

Journal of Chemical Information and Modeling QSAR models keeping in mind all the guidelines required by the OECD25 toward their successful application in regulatory purposes. However, it is noteworthy that just by following all the steps implemented in the software will not always lead to a robust QSAR model since the QSAR model quality also depends on the modelability26−28 of a data set as well as the quality of structural and biological data. Thus, it is always advisable to check data set modelability and perform data curation11,29−33 (i.e., structural and biological data curation) prior to starting with QSAR model development.

Δ(Di)cj = Cv × (Di − avg (Di)cj)

The term Δ(Di)cj is the so-called Box−Jenkin’s operator, and these modified descriptors capture the information about both chemical structures and specific elements of the experimental/ theoretical conditions (cj) under which the samples were evaluated. Step 2. Data Pretreatment. Optionally, one can perform a data pretreatment to remove noninformative (i.e., constant and intercorrelated) descriptors that may not have significant contribution for building the model. Step 3. Data Set Division. The software provides three data set division techniques that arise from random and rational approaches of performing data set divisions. In the random approach, the chemicals are randomly assigned to both the training and test sets. In the rational approaches, two techniques are provided by the software, i.e., the Kennard−Stone’s algorithm and the Euclidean distance-based division method. The Kennard−Stone’s algorithm34,35 leads to the selection of the most diverse compounds in the training set. The Euclidean distance-based algorithm11 tries to capture chemicals for the test set that are representative of the training set compounds. Step 4. Removal of Less-Discriminating Descriptors. There is an option to remove “less-discriminating” descriptors because it is sometimes helpful to remove descriptors that may just enhance the noise in the data with less or no contribution. Here, the “less-discriminating” descriptors are identified using the molecular spectrum analysis approach.11 Step 5. Variable Selection Technique. This software provides the Genetic Algorithm36 as a variable selection technique for setting up “Linear Discriminant Analysis” (LDA) models. The genetic algorithm (GA) is a well-known technique that is often utilized for developing regression-based QSAR models37−41 as well as classification-based QSAR models.42,43 The fitness function for selecting the top models in each iteration is defined as follows:



SOFTWARE AND ITS FUNCTIONALITIES The “QSAR-Co” version 1.0.0 software is an open source standalone tool developed using Java and JavaFX programming language and is freely available to download at https://sites. google.com/view/qsar-co. The software is user-friendly as well as it saves a lot of time on model building. There are two modules available in the software: (1) the Model development module and (2) the Screen/Predict module. Here, we discuss inbrief all the steps and associated functionalities in each module. Module 1: Model Development. In the Model development module, all the QSAR model development steps that are provided by the QSAR-Co software are as follows (Figure 1): Step 1. Normal Approach and Box−Jenkins Approach. In this software, the user should opt for “Normal approach” to run a simple classification-based QSAR study when the employed response data has been determined using the same experimental/theoretical conditions for all the chemicals in the data set. Oppositely, the user should opt for the “Box−Jenkins approach” to perform a classification-based QSAR study for modeling response data determined under different experimental/theoretical conditions. In the “Normal approach” option, the input descriptor set is directly employed by the program in the subsequent model development steps since no treatment to the input data is required. With the “Box−Jenkins approach”, a modified descriptor set is computed using the Box− Jenkins moving average approach considering the different experimental/theoretical condition(s), and these modif ied descriptors are then employed in the subsequent model development steps. The details of this approach have been largely reported in the past,15−23 so we limit ourselves here to a brief description underlining only its most important aspects. Initially, the arithmetic average of the descriptors for a specific condition are calculated as follows: avg (Di)cj =

1 n(cj)

n(cj)

∑i= 1

Di

FitnessScorei =

0.5 − Wilk’s λi 0.5

where FitnessScorei is the fitness score for the training model i, and λi stands for the Wilk’s lambda calculated for the respective training model i. Step 6. Model Development Techniques. At present, the software provides two machine learning techniques to develop a robust classification-based QSAR model: two-class linear discriminant analysis44 (LDA) and Random Forest45 (RF). In this software, we have used the Weka version 3-9-3 java library46 to perform RF. Step 7. Model Selection and Model Validation. The software reports both internal (including cross-validation) and external validation metrics computed using the training set and the test set, respectively. The validation metrics that are computed for both training and test sets comprise of Wilk’s λ47 and metrics computed based on the confusion matrix48 such as accuracy, sensitivity, specificity, precision, F-measure, and the Matthews correlation coefficient (MCC). QSAR-Co software also provides receiver operating characteristics48 (ROC) plots along with the area under the curve (AUC) of ROC values. The present software uses the Weka version 3-9-3 java library46 to perform ROC and AUC calculations using the k-fold crossvalidation method. The software also performs the Y-randomization test,49 where the randomly generated training sets (i.e.,

(1)

Here, Di is the calculated descriptor of individual compounds i, whereas n(cj) is the number of actives in modeling data set compounds assayed by the same element of the experimental condition (ontology) cj. Therefore, the avg(Di)cj is the arithmetic mean of the descriptors (Di) for a specific condition cj. After obtaining the avg(Di)cj values, in the subsequent step, the following formula is used to calculate the modified descriptors (Δ(Di)cj): Δ(Di)cj = Di − avg (Di)cj

(3)

(2)

Also, note that if a confidence variable (Cv, denoting the confidence in assay information) is also available as input, then the modified descriptors (Δ(Di)cj) are calculated using the following formula: 2540

DOI: 10.1021/acs.jcim.9b00295 J. Chem. Inf. Model. 2019, 59, 2538−2544

Application Note

Journal of Chemical Information and Modeling

Figure 2. Workflow illustrating the steps involved in the prediction/screening of query chemicals/database (Module 2).

Step 3. Screen. On screening, the software not only provides the predicted class for query compounds using the input training model but it provides the applicability domain status for every query compound.

shuffled response values) are compared with the original training set using the Wilk’s lambda (λ) values. Step 8. Defining Applicability Domain. Two techniques are provided for defining the applicability domain (AD) of the developed classification model that are as follows:



CASE STUDIES In the present work, to demonstrate the functionalities of the developed software, we examined three case studies. These case studies involve development of mtk-QSAR models employing our earlier published data sets: (i) The first data set16 comprises 14,854 compounds with inhibitory activity against different breast cancer related proteins. (ii) The second data set22 comprises 31,894 compounds with antibacterial activity against Gram-negative pathogens and in vitro safety profiles related to absorption, distribution, metabolism, elimination, and toxicity (ADMET). (iii) The third data set17 includes 2123 peptides (containing from 4 to 119 amino acids) with antibacterial activities against multiple Gram-negative bacterial strains and cytotoxicity against different cell types. The methodology employed in all three case studies is described in the Supporting Information. In each case study, we have developed classification-based mtk-QSAR models using both the available techniques, i.e., GA-LDA and Random Forest. The final models were found to be significantly predictive, and the relevant details including models’ equations, validation metrics (internal, external and true external), domain of applicability, ROC analysis, and Y-randomization test results are also reported in Supporting Information.

(i) Using a simple approach based on the standardization technique50 to identify the compounds that are structural outliers (training set) or outside the applicability domain (test set). (ii) Using a confidence estimation approach,11,51 which is simply based on the a posteriori probability values computed for each compound in the training set as well as in the test set, which can be successfully employed to check the reliability of predictions. Module 2: Screening/Prediction. In the software, module 2 is available in a separate tab, and Figure 2 demonstrates the workflow of the “Predict/Screen” module. The simple steps performed for screening query chemicals using this module are as follows: Step 1. To Provide the Required input. The required input information is required for prediction of query compounds, which depends on the chosen QSAR approach, i.e., Normal approach or Box−Jenkin’s approach as mentioned in the workflow (Figure 2). Step 2. To Select Appropriate Model Development Techniques. The screening or prediction of query compounds works well with two-class LDA and Random Forest models. 2541

DOI: 10.1021/acs.jcim.9b00295 J. Chem. Inf. Model. 2019, 59, 2538−2544

Application Note

Journal of Chemical Information and Modeling



CONCLUSIONS To sum up, we have presented a user-friendly standalone software “QSAR-Co” version 1.0.0, which is freely available to download at https://sites.google.com/view/qsar-co. This software aims to develop robust classification-based QSAR models for all kinds of data sets, even when the response data set values pertain to different experimental/theoretical conditions or against multiple biological targets. The software provides all the essential functionalities to develop classification-based QSAR models employing two widely known techniques, that is, LDA and RF. It also includes tools to compute the models’ applicability domain and the reliability of predictions for query chemicals, along with proper techniques (i.e., Y-randomization test and ROC analysis) to assess the robustness of the developed models. The presented software is extremely user-friendly and efficient, as it works like a workflow system. Once all the parameters and techniques are set, all the steps required in the development of classification-based QSAR models are executed by simply clicking a button. Along with the model development module, the software also provides a separate module for predicting the response class of query chemicals or for screening a database using the developed LDA or RF models. Further, to demonstrate the functionalities and stability of the “QSAR-Co” software, we have tested it with three case studies. Herein, we have developed multitasking classification-based QSAR models using both techniques that are available with the software, i.e., LDA and RF. We believe the present software will contribute to a widespread application of multitarget (multitasking) QSAR analysis to data sets under multiple conditions. Finally, we intend to keep adding new features to the QSARCo software in the near future. For instance, it will be interesting to explore other machine learning techniques such as libSVM,52 sequential minimal optimization,53 and multilayer perceptron54 and to include them in the present software. Moreover, we are also interested in adding a module dedicated to data curation of the data set especially designed for curation of big data sets downloaded from online chemical databases. Lastly, another important issue will be to handle deviations or perturbations pertaining to small variations of the different experimental/ theoretical conditions (cj). Indeed, that has already been proposed and widely applied by combining the Box−Jenkins approach with perturbation theory leading to the development of the so-called Perturbation Theory (PT) QSAR-based models.55−57



Humbert González Díaz: 0000-0002-9392-2797 M. Natália D. S. Cordeiro: 0000-0003-3375-8670 Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work was supported by UID/QUI/50006/2019 and project PTDC/QUI-QIN/30649/2017 with funding from FCT/MCTES through national funds.



ASSOCIATED CONTENT

* Supporting Information S

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.9b00295.



REFERENCES

(1) Roy, K. Advances in QSAR Modeling: Applications in Pharmaceutical, Chemical, Food, Agricultural and Environmental Sciences. In Challenges and Advances in Computational Chemistry and Physics; Leszczynski, J., Ed.; 1st ed.; Springer, 2017; Vol. 24. (2) Puzyn, T.; Gajewicz, A.; Leszczynska, D.; Leszczynski, J. Nanomaterials-the Next Great Challenge for QSAR Modelers. In Recent Advances in QSAR Studies. Challenges and Advances in Computational Chemistry and Physics; Puzyn, T., Leszczynski, J., Cronin, M., Eds.; Springer: Dordrecht, 2010; Vol. 8, pp 383−409. (3) Lo, Y.-C.; Rensi, S. E.; Torng, W.; Altman, R. B. Machine Learning in Chemoinformatics and Drug Discovery. Drug Discovery Today 2018, 23, 1538−1546. (4) Amata, E.; Marrazzo, A.; Dichiara, M.; Modica, M. N.; Salerno, L.; Prezzavento, O.; Nastasi, G.; Rescifina, A.; Romeo, G.; Pittalà, V. Comprehensive Data on a 2D-QSAR Model for Heme Oxygenase Isoform 1 Inhibitors. Data Brief 2017, 15, 281−299. (5) Kumar, A.; Chauhan, S. QSAR Differential Model for Prediction of SIRT1 Modulation Using Monte Carlo Method. Drug Res. 2017, 67, 156−162. (6) Aranda, J. F.; Bacelo, D. E.; Aparicio, M. S. L.; Ocsachoque, M.; Castro, E.; Duchowicz, P. R. Predicting the Bioconcentration Factor through a Conformation-Independent QSPR Study. SAR QSAR Environ. Res. 2017, 28, 749−763. (7) Cherkasov, A.; Muratov, E. N.; Fourches, D.; Varnek, A.; Baskin, I. I.; Cronin, M.; Dearden, J.; Gramatica, P.; Martin, Y. C.; Todeschini, R.; et al. QSAR Modeling: Where Have You Been? Where Are You Going To? J. Med. Chem. 2014, 57, 4977−5010. (8) Roy, K.; Kar, S.; Das, R. N. A Primer on QSAR/QSPR Modeling: Fundamental Concepts. In Springer Briefs in Molecular Science; Springer: London, 2015. (9) Fang, J.; Li, Y.; Liu, R.; Pang, X.; Li, C.; Yang, R.; He, Y.; Lian, W.; Liu, A.-L.; Du, G.-H. Discovery of Multitarget-Directed Ligands against Alzheimer’s Disease through Systematic Prediction of ChemicalProtein Interactions. J. Chem. Inf. Model. 2015, 55, 149−164. (10) Nidhi; Glick, M.; Davies, J. W.; Jenkins, J. L. Prediction of Biological Targets for Compounds Using Multiple-Category Bayesian Models Trained on Chemogenomics Databases. J. Chem. Inf. Model. 2006, 46, 1124−1133. (11) Ambure, P.; Bhat, J.; Puzyn, T.; Roy, K. Identifying Natural Compounds as Multi-Target-Directed Ligands against Alzheimer’s Disease: An In Silico Approach. J. Biomol. Struct. Dyn. 2019, 37, 1282− 1306. (12) Kalliokoski, T.; Kramer, C.; Vulpetti, A.; Gedeck, P. Comparability of Mixed IC50 Data-a Statistical Analysis. PLoS One 2013, 8, No. e61007. (13) Eriksson, L.; Jaworska, J.; Worth, A. P.; Cronin, M. T.; McDowell, R. M.; Gramatica, P. Methods for Reliability and Uncertainty Assessment and for Applicability Evaluations of Classification-and Regression-Based QSARs. Environ. Health Perspect. 2003, 111, 1361−1375. (14) Hill, T.; Lewicki, P.; Lewicki, P. Statistics: Methods and Applications: A Comprehensive Reference for Science, Industry, and Data Mining; StatSoft, Inc., 2006. (15) Casañola-Martin, G. M.; Le-Thi-Thu, H.; Pérez-Giménez, F.; Marrero-Ponce, Y.; Merino-Sanjuán, M.; Abad, C.; González-Díaz, H. Multi-Output Model with Box-Jenkins Operators of Linear Indices to

Detailed description of all three case studies. (PDF) Modeling and external set (for three data sets) information including the initial set of calculated descriptors and attribute importance information for the RF model. (ZIP)

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Pravin Ambure: 0000-0001-7244-7117 Amit Kumar Halder: 0000-0002-4818-9047 2542

DOI: 10.1021/acs.jcim.9b00295 J. Chem. Inf. Model. 2019, 59, 2538−2544

Application Note

Journal of Chemical Information and Modeling Predict Multi-Target Inhibitors of Ubiquitin-Proteasome Pathway. Mol. Divers. 2015, 19, 347−356. (16) Speck-Planche, A.; Cordeiro, M. N. D. S. Fragment-Based In Silico Modeling of Multi-Target Inhibitors against Breast CancerRelated Proteins. Mol. Divers. 2017, 21, 511−523. (17) Kleandrova, V. V.; Ruso, J. M.; Speck-Planche, A.; Cordeiro, M. N. D. S. Enabling the Discovery and Virtual Screening of Potent and Safe Antimicrobial Peptides. Simultaneous Prediction of Antibacterial Activity and Cytotoxicity. ACS Comb. Sci. 2016, 18, 490−498. (18) Speck-Planche, A.; Kleandrova, V. V.; Ruso, J. M.; Cordeiro, M. N. D. S. First Multitarget Chemo-Bioinformatic Model to Enable the Discovery of Antibacterial Peptides against Multiple Gram-Positive Pathogens. J. Chem. Inf. Model. 2016, 56, 588−598. (19) Alonso, N.; Caamaño, O.; Romero-Duran, F. J.; Luan, F.; Cordeiro, M. N. D. S.; Yañez, M.; González-Díaz, H.; García-Mera, X. Model for High-Throughput Screening of Multitarget Drugs in Chemical Neurosciences: Synthesis, Assay, and Theoretic Study of Rasagiline Carbamates. ACS Chem. Neurosci. 2013, 4, 1393−1403. (20) Romero-Durán, F. J.; Alonso, N.; Yañez, M.; Caamaño, O.; García-Mera, X.; González-Díaz, H. Brain-Inspired Cheminformatics of Drug-Target Brain Interactome, Synthesis, and Assay of Tvp1022 Derivatives. Neuropharmacology 2016, 103, 270−278. (21) Speck-Planche, A.; Cordeiro, M. N. D. S. Simultaneous Modeling of Antimycobacterial Activities and ADMET Profiles: A Chemoinformatic Approach to Medicinal Chemistry. Curr. Top. Med. Chem. 2013, 13, 1656−1665. (22) Speck-Planche, A.; Cordeiro, M. N. D. S. De Novo Computational Design of Compounds Virtually Displaying Potent Antibacterial Activity and Desirable In Vitro ADMET Profiles. Med. Chem. Res. 2017, 26, 2345−2356. (23) Speck-Planche, A.; Cordeiro, M. N. D. S. Chemoinformatics for Medicinal Chemistry: In Silico Model to Enable the Discovery of Potent and Safer Anti-Cocci Agents. Future Med. Chem. 2014, 6, 2013− 2028. (24) Speck-Planche, A.; Cordeiro, M. N. D. S. Multitasking Models for Quantitative Structure-Biological Effect Relationships: Current Status and Future Perspectives to Speed up Drug Discovery. Expert Opin. Drug Discovery 2015, 10, 245−256. (25) Organization for Economic Co-Operation and Development (OECD). Guidance Document on the Validation of (Quantitative) Structure-Activity Relationship ((Q)SAR) Models; OECD Series on Testing and Assessment 69; OECD Document ENV/JM/ MONO2007, pp 55−65. (26) Golbraikh, A.; Muratov, E.; Fourches, D.; Tropsha, A. Data Set Modelability by QSAR. J. Chem. Inf. Model. 2014, 54, 1−4. (27) Ruiz, I. L.; Gómez-Nieto, M. Á . Study of Data Set Modelability: Modelability, Rivality, and Weighted Modelability Indexes. J. Chem. Inf. Model. 2018, 58, 1798−1814. (28) Ruiz, I. L.; Gómez-Nieto, M. Á . Prediction of the Datasets Modelability for the Building of QSAR Classification Models by Means of the Centroid Based Rivality Index. J. Math. Chem. 2019, 57, 1374− 1393. (29) Fourches, D.; Muratov, E.; Tropsha, A. Trust, but Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research. J. Chem. Inf. Model. 2010, 50, 1189−1204. (30) Kramer, C.; Kalliokoski, T.; Gedeck, P.; Vulpetti, A. The Experimental Uncertainty of Heterogeneous Public Ki Data. J. Med. Chem. 2012, 55, 5165−5173. (31) Fourches, D.; Muratov, E.; Tropsha, A. Trust, but Verify II: A Practical Guide to Chemogenomics Data Curation. J. Chem. Inf. Model. 2016, 56, 1243−1252. (32) Mansouri, K.; Grulke, C.; Richard, A.; Judson, R.; Williams, A. An Automated Curation Procedure for Addressing Chemical Errors and Inconsistencies in Public Datasets Used in QSAR Modelling. SAR QSAR Environ. Res. 2016, 27, 911−937. (33) Gadaleta, D.; Lombardo, A.; Toma, C.; Benfenati, E. A New Semi-Automated Workflow for Chemical Data Retrieval and Quality Checking for Modeling Applications. J. Cheminform. 2018, 10, 1−13.

(34) Kennard, R. W.; Stone, L. A. Computer Aided Design of Experiments. Technometrics 1969, 11, 137−148. (35) Ambure, P.; Aher, R. B.; Gajewicz, A.; Puzyn, T.; Roy, K. "NANOBRIDGES" Software: Open Access Tools to Perform QSAR and Nano-QSAR Modeling. Chemom. Intell. Lab. Syst. 2015, 147, 1−13. (36) Venkatasubramanian, V.; Sundaram, A. Genetic Algorithms: Introduction and Applications. In Encyclopedia of Computational Chemistry; P. von Ragué Schleyer, P.; Allinger, N. L.; Clark, T.; Gasteiger, J.; Kollman, P. A.; Schaefer, H. F.; Schreiner, P. R., Eds.; John Wiley & Sons, Ltd., 2002; Vol 2. (37) Rogers, D.; Hopfinger, A. J. Application of Genetic Function Approximation to Quantitative Structure-Activity Relationships and Quantitative Structure-Property Relationships. J. Chem. Inf. Comput. Sci. 1994, 34, 854−866. (38) Hemmateenejad, B.; Akhond, M.; Miri, R.; Shamsipur, M. Genetic Algorithm Applied to the Selection of Factors in Principal Component-Artificial Neural Networks: Application to QSAR Study of Calcium Channel Antagonist Activity of 1, 4-Dihydropyridines (Nifedipine Analogous). J. Chem. Inf. Comput. Sci. 2003, 43, 1328− 1334. (39) Hasegawa, K.; Miyashita, Y.; Funatsu, K. GA Strategy for Variable Selection in QSAR Studies: GA-Based PLS Analysis of Calcium Channel Antagonists. J. Chem. Inf. Comput. Sci. 1997, 37, 306−310. (40) Ambure, P.; Roy, K. Understanding the Structural Requirements of Cyclic Sulfone Hydroxyethylamines as hBACE1 Inhibitors against Aβ Plaques in Alzheimer’s Disease: A Predictive QSAR Approach. RSC Adv. 2016, 6, 28171−28186. (41) Gramatica, P.; Chirico, N.; Papa, E.; Cassani, S.; Kovarich, S. QSARINS: A New Software for the Development, Analysis, and Validation of QSAR MLR Models. J. Comput. Chem. 2013, 34, 2121− 2132. (42) Gao, H. Application of BCUT Metrics and Genetic Algorithm in Binary QSAR Analysis. J. Chem. Inf. Comput. Sci. 2001, 41, 402−407. (43) Sutherland, J. J.; O’Brien, L. A.; Weaver, D. F. Spline-Fitting with a Genetic Algorithm: A Method for Developing Classification Structure-Activity Relationships. J. Chem. Inf. Comput. Sci. 2003, 43, 1906−1915. (44) Snedecor, G.; Cochran, W. Statistical Methods, 8th ed.; Iowa State University Press: IA, 1989; pp 254−272. (45) Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5−32. (46) Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I. H. The Weka Data Mining Software: An Update. SIGKDD Explor. 2009, 11, 10−18. (47) Wilks, S. S. Certain Generalizations in the Analysis of Variance. Biometrika 1932, 24, 471−494. (48) Fawcett, T. An Introduction to ROC Analysis. Pattern Recognit. Lett. 2006, 27, 861−874. (49) Fisher, R. A. The Design of Experiments; Oliver And Boyd, Ltd: Edinburgh, 1937. (50) Roy, K.; Kar, S.; Ambure, P. On a Simple Approach for Determining Applicability Domain of QSAR Models. Chemom. Intell. Lab. Syst. 2015, 145, 22−29. (51) Mathea, M.; Klingspohn, W.; Baumann, K. Chemoinformatic Classification Methods and Their Applicability Domain. Mol. Inf. 2016, 35, 160−180. (52) Chang, C.-C.; Lin, C.-J. Libsvm: A Library for Support Vector Machines. ACM Trans. Intell. Syst. Technol. 2011, 2, 1−27. (53) Platt, J. C. Fast Training of Support Vector Machines Using Sequential Minimal Optimization. In Advances in Kernel Methods Support Vector Learning; Schölkopf, B.; Burges, C. J. C.; Smola, A. J., Eds.; MIT Press: Cambridge, 1999; pp 185−208. (54) Gardner, M. W.; Dorling, S. Artificial Neural Networks (the Multilayer Perceptron)a Review of Applications in the Atmospheric Sciences. Atmos. Environ. 1998, 32, 2627−2636. (55) González-Díaz, H.; Arrasate, S.; Gomez-SanJuan, A.; Sotomayor, N.; Lete, E.; Besada-Porto, L.; Ruso, J. M. General Theory for Multiple Input-Output Perturbations in Complex Molecular Systems. 1. Linear QSPR Electronegativity Models in Physical, Organic, and Medicinal Chemistry. Curr. Top. Med. Chem. 2013, 13, 1713−1741. 2543

DOI: 10.1021/acs.jcim.9b00295 J. Chem. Inf. Model. 2019, 59, 2538−2544

Application Note

Journal of Chemical Information and Modeling (56) Concu, R.; Kleandrova, V. V.; Speck-Planche, A.; Cordeiro, M. N. D. S. Probing the Toxicity of Nanoparticles: A Unified in Silico Machine Learning Model Based on Perturbation Theory. Nanotoxicology 2017, 11, 891−906. (57) Simon-Vidal, L.; García-Calvo, O.; Oteo, U.; Arrasate, S.; Lete, E.; Sotomayor, N.; González-Díaz, H. PTML: Perturbation-Theory and Machine Learning Model for High-Throughput Screening of Parham Reactions. Experimental and Theoretical Studies. J. Chem. Inf. Model. 2018, 58, 1384−1396.

2544

DOI: 10.1021/acs.jcim.9b00295 J. Chem. Inf. Model. 2019, 59, 2538−2544