Enhancing acute oral toxicity predictions by using consensus

2 days ago - In addition, a retrospective study on 261 withdrawn drugs due to their toxic/side effects was carried out, in order to assess the usefuln...
3 downloads 0 Views 1MB Size
Subscriber access provided by UNIV OF SOUTHERN QUEENSLAND

Article

Enhancing acute oral toxicity predictions by using consensus modeling and algebraic form-based 0D-to-2D molecular encodes Cesar Garcia-Jacas, yovani Marrero-Ponce, Fernando Cortes-Guzman, José Suárez-Lezcano, Felix O. Martinez-Rios, Luis A. García-González, Mario Pupo-Meriño, and Karina Martinez-Mayorga Chem. Res. Toxicol., Just Accepted Manuscript • DOI: 10.1021/acs.chemrestox.9b00011 • Publication Date (Web): 08 May 2019 Downloaded from http://pubs.acs.org on May 8, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

1

Enhancing acute oral toxicity predictions

2

by using consensus modeling and algebraic

3

form-based 0D-to-2D molecular encodes

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

César R. García-Jacas ( ),1 Yovani Marrero-Ponce,2-3 Fernando Cortés-Guzmán,4 José Suárez-Lezcano,5 Felix O. Martinez-Rios,6 Luis A. García-González,7 Mario Pupo-Meriño7 and Karina Martinez-Mayorga.4

1Departamento

de Ciencia de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), Ensenada, Baja California, México. 2Universidad San Francisco de Quito (USFQ), Grupo de Medicina Molecular y Traslacional (MeM&T), Colegio de Ciencias de la Salud (COCSA), Escuela de Medicina, Edificio de Especialidades Médicas, Quito, Pichincha, Ecuador. 3Grupo de Investigación Ambiental (GIA), Programas Ambientales, Facultad de Ingenierías, Fundacion Universitaria Tecnologico Comfenalco – Cartagena, Cr44 DN 30 A, 91, Cartagena, Bolívar, Colombia. 4Instituto de Química, Universidad Nacional Autónoma de México (UNAM), Ciudad de México, México. 5Pontificia Universidad Católica del Ecuador Sede Esmeraldas (PUCESE), Esmeraldas, Ecuador. 6Facultad de Ingeniería, Universidad Panamericana (UP), Ciudad de México, México. 7Grupo de Investigación de Bioinformática, Universidad de las Ciencias Informáticas (UCI), La Habana, Cuba.

Corresponding author (

):

33

César R. García-Jacas

34

[email protected], [email protected], [email protected]

1 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

35

Page 2 of 45

Table of Contents (TOC) graphic

36

2 ACS Paragon Plus Environment

Page 3 of 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80

Chemical Research in Toxicology

Abstract Quantitative structure-activity relationships (QSAR) are introduced to predict acute oral toxicity (AOT), by using the QuBiLS-MAS framework for the molecular encoding. Three training sets were employed to build the models: EPA training set (5931 compounds), EPA-full training set (7413 compounds) and Zhu training set (10152 compounds). Additionally, the EPA test set (1482 compounds) was used for the validation of the QSAR models built on the EPA training set, while the ProTox (425 compounds) and T3DB (284 compounds) external sets were employed for the assessment of all the models. The k-nearest neighbor, multilayer perceptron, random forest and support vector machine procedures were employed to build several base (individual) models. The base models with 𝑅𝐸𝑃𝐴 ― 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 ≥ 0.75 and 𝑀𝐴𝐸𝐸𝑃𝐴 ― 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 ≤ 0.5 were retained to build consensus models. As a result, two consensus models based on the minimum operator and denoted as M19 and M22, as well as a consensus model based on the weighted average operator and denoted as M24, were selected as the best ones for each training set considered. According to the applicability domain (AD) analysis performed, model M19 (built on the EPA training set) has 𝑀𝐴𝐸𝑡𝑒𝑠𝑡 ― 𝐴𝐷 = 0.4044, 𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 ― 𝐴𝐷 = 0.4067 and 𝑀𝐴𝐸𝑇3𝐷𝐵 ― 𝐴𝐷 = 0.2586 on the EPA test set, ProTox external set and T3DB external set, respectively; whereas model M22 (built on the EPA-full set) and model M24 (built on the Zhu set) present 𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 ― 𝐴𝐷 = 0.3992 and 𝑀𝐴𝐸𝑇3𝐷𝐵 ― 𝐴𝐷 = 0.2286, and 𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 ― 𝐴𝐷 = 0.3773 and 𝑀𝐴𝐸𝑇3𝐷𝐵 ― 𝐴𝐷 = 0.2471 on the two external sets accounted for, respectively. These outcomes were compared and statistically validated with respect to 14 QSAR methods (e.g., admetSAR, ProTox-II) from the literature. As a result, model M22 presents the best overall performance. In addition, a retrospective study on 261 withdrawn drugs due to their toxic/side effects was carried out, in order to assess the usefulness of prospectively using the QSAR models proposed in the labeling of chemicals. A comparison with regard to the methods from the literature was also made. As a result, model M22 has the best ability of labelling a compound as toxic according to the Globally Harmonized System of Classification and Labelling of Chemicals (GHS). Therefore, it can be concluded that the models proposed, especially model M22, constitute prominent tools for studying AOT, at providing the best results among all the methods examined. A freely available software was also developed in order to be used in virtual screening tasks (http://tomocomd.com/apps/ptoxra).

Keywords: quantitative structure-activity relationships (QSAR), acute oral toxicity (AOT), machine learning, consensus model, ensemble model, applicability domain, algebraic forms, QuBiLS-MAS molecular descriptors. Running Head: Prediction of acute oral toxicity using the QuBiLS-MAS approach…

3 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

81

Page 4 of 45

1. Introduction

82

Chemical toxicity is the ability of a substance to cause hazardous biological damages in

83

vulnerable sites (e.g., in organs as the liver – hepatoxicity) or cells (cytotoxicity)1 in the human

84

body, or in other living systems such as plants,2 animals3 or ecosystems.4 Acute oral toxicity (AOT)

85

constitutes one of the most important toxicological endpoints to be assessed. LD50, indicator used

86

to measure AOT, is the dose of a chemical which kills 50% of the animals treated, either

87

immediately or in short time, after administration of a single dose or multiple doses within 24

88

hours.5 In vivo assays of animal tests are required to obtain an exact measure of AOT. However,

89

these are complex, expensive and time-consuming, especially when a large number of chemicals

90

are studied. In addition, because of ethical reasons and animal rights, LD50 experiments are quite

91

controversial. Thus, QSAR (Quantitative Structure-Activity Relationship) models constitute one

92

of the alternatives to be used to predict AOT.

93

Since the last quarter of the past century, several QSAR models have been introduced to

94

estimate this endpoint. In this sense, Enslein et al.6, 7 built models based on the multiple linear

95

regression (MLR) technique using non-congeneric chemicals, but they lack of predictive power.

96

To tackle this drawback, Eldred et al.8 and Guo et al.9 developed models using congeneric

97

compounds. Not surprisingly these have limited application ranges. Posteriorly, Zhu et al.10, 11

98

proposed models based on k-NN, random forest, hierarchical clustering and other methods, as well

99

as consensus models, showing the latter better performances than the formers. Nonetheless, the

100

predictive ability achieved by the consensus models is from low to moderate (see Table 3 in ref.10).

101

In general, these results highlight not only the complexity of the biological mechanisms involved

102

in AOT, but also the diversity of modes of action and the lack of chemical space coverage. It is

103

also important to remark that LD50 measurements, as any similarly complex in vivo endpoint, are

4 ACS Paragon Plus Environment

Page 5 of 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

104

expected to contain rather high quantity of noise, which reflects biological diversity as well as

105

testing conditions.

106

More recently, with the purpose of obtaining better results in the AOT prediction, the QSAR

107

models for the admetSAR12, 13 and ProTox-II14, 15 web tools were developed by using training sets

108

comprised of 10207 and 38515 compounds, respectively, in order to consider a wider chemical

109

space in the learning process. Also, other modeling strategies have been applied, for instance: (1)

110

Lagunin et al.16 presented models based on the combination of QNA (Quantitative Neighborhoods

111

of Atoms) descriptors, PASS (Prediction of Activity Spectra for Substances) predictions and self-

112

consistent regression; (2) Lu et al.17 proposed local lazy learning (LLL)-based methods, which use

113

linear regression, arithmetic mean or weighted mean to determine the predictions from k nearest

114

neighbors to a chemical structure given; and (3) Xu et al.18 introduced models based on a deep

115

learning architecture using improved molecular graph encoding convolutional neural networks.

116

As it can be seen, all these models were built with classic (e.g., k-NN) and modern (e.g., deep

117

learning) learning methods. The majority of them were created using similarity-based strategies10-

118

17

119

similar activities/toxicities. In addition, as a common characteristic, these QSAR models perform

120

the molecular encoding using ‘traditional’ molecular descriptors (MDs) (e.g., DRAGON MDs)

121

and/or fingerprints (e.g., MACCS keys). In general, all these approaches may seem an indicative

122

of that all the efforts have been made to build good models to predict AOT. However, the QSAR

123

models mentioned above present poor predictive power when are used in external sets (see Section

124

3.3). In our opinion, this behavior could be due to the MDs used. Indeed, as it has been analyzed

125

elsewhere,19-21 it is more important to use the set of “right” predictors than the choice of the “right”

126

learning algorithm.

(e.g., FDA-method), under the assumption of that structurally similar compounds may have

5 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 45

127

The traditional MDs (and fingerprints) included in the QSAR models reported to date to

128

predict AOT have as a characteristic that they are mostly chemically interpretable. However, an

129

easily interpretable descriptor is not necessarily a good descriptor to predict a determined endpoint.

130

In fact, a property of the MDs is that they present the ability of highlighting molecular features

131

that are not highlighted by a chemical interpretation. Consequently, the objective is often to

132

determine the ‘best’ predictive model, while the interpretation is an add-on. The latter is stated in

133

Principle 5 (Chapter 6) of the OECD Guidance Document on the Validation of QSAR Models:22

134

“a (Q)SAR should be associated with a mechanistic interpretation, if possible”. Therefore, the use

135

of more recent definitions of MDs, whose interpretation could be more difficult, may contribute

136

to develop better QSAR models to predict AOT.

137

Moreover, according to the complexity theory through the No Free Launch Theorem,23 it can

138

be said that no single QSAR model to predict AOT achieves superior results to all the others when

139

its performance is averaged over all possible chemicals. The reasoning for this opinion is that one

140

MDs family (e.g., DRAGON 0D-to-3D MDs) may not suffice to encode all the relevant structural

141

information for distinct compounds. In other words, the significance of the MDs depends on the

142

nature of the chemical structures under analysis. Thus, following the same line of thought, it is

143

necessary to use alternative approaches to encode new and orthogonal chemical information in

144

order to build more robust QSAR models. In this sense, several MDs both topological (2D) and

145

geometrical (3D) have recently been presented,24-29 being the QuBiLS-MAS (acronym for

146

Quadratic, Bilinear and N-Linear mapS based on graph-theoretic electronic-density Matrices and

147

Atomic weightingS) 2D-MDs29, 30 one of these novel definitions.

148

The QuBiLS-MAS 2D-MDs are the current definition of the former ToMoCoMD-CARDD

149

2D-MDs,31-36 at presenting novel generalizations and extensions.29 These MDs have been applied

6 ACS Paragon Plus Environment

Page 7 of 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

150

in a wide variety of applications.37-45 Indeed, the significant predictive power demonstrated by this

151

ligand-based method suggests that these 2D-MDs encode relevant chemical information of the

152

molecules. Therefore, all in all, this manuscript introduces a novel topological approach based on

153

the QuBiLS-MAS 2D-MDs to predict AOT. This manuscript is organized as follows. Section 2

154

describes the chemical datasets and the modeling methodology used. Section 3 analyzes the results

155

obtained and compares them with respect to several tools established in the literature. A statistical

156

assessment of the predictions obtained is also presented, as well as the outcomes achieved in a

157

retrospective virtual screening. Finally, Section 4 establishes the findings, conclusions and future

158

outlooks.

159 160

2. Material and Methods

161

2.1. Chemical datasets

162

The chemical dataset contained in the Toxicity Estimation Software Tool (T.E.S.T), provided

163

by the U.S. Environmental Protection Agency (EPA), was used to build the QSAR models.11 This

164

dataset is comprised of 5931 training compounds (hereafter denoted as EPA training set) and 1482

165

test compounds (hereafter denoted as EPA test set) annotated with oral rat LD50 values. Two other

166

training sets comprised of 7413 and 10207 compounds, respectively, were also accounted for. The

167

first one, denoted as EPA-full training set, was created by joining the EPA training/test sets

168

aforementioned; whereas the second one, widely used in the literature12-18 and denoted as Zhu

169

training set, was collected from Zhu et al.10 With these three training sets, comparability of results

170

among the QSAR models stated in the literature to those to be developed is guaranteed. It is

171

important to highlight that the EPA training set ⊆ EPA-full training set ⊆ Zhu training set (⊆

172

means subset).

7 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 45

173

Moreover, two sets comprised of 500 and 428 chemical structures annotated with oral rat

174

LD50 values were downloaded from the ProTox-II website15 and Toxin and Toxin Target Database

175

(T3DB – also well-known as Toxic Exposome Database),46 respectively, in order to be considered

176

as external validation sets. Additionally, a dataset composed of 292 withdrawn drugs because of

177

their toxic/side effects was manually obtained from the ‘Toxicities’ section in the WITHDRAWN

178

database.47 In this database, it is considered as a withdrawn drug the one recalled from market in

179

at least one country. The WITHDRAWN dataset was used to perform a simulation of prospective

180

virtual screening with the best QSAR models built. In this way, according to the AOT values to be

181

predicted, the relevance of employing the novel predictive models in the labelling of compounds

182

was analyzed (see Sections 2.6 and 3.5).

183

All these datasets were obtained as Simplified Molecular Input Line Entries (SMILES) or

184

Structure Data Format (SDF), and then converted to 2D-representations without H-depleted by

185

using the OpenBabel software.48 Filters to remove inorganic and organometallic compounds, salts,

186

compound mixtures and repeated compounds with respect to the Zhu training set were applied.

187

After that, the EPA training/test sets and EPA-full training set kept the same number of structures;

188

whereas the Zhu training set, ProTox external set, T3DB external set and WITHDRAWN set

189

contain 10152, 425, 284 and 267 compounds, respectively. The LD50 values for the chemicals

190

considered were expressed as mol/kg, and then these were stated as log[1/(mol/kg)] according to

191

standard QSAR practices. Supplementary Information 1 shows the SDF files of the chemical sets

192

explained above.

193

2.2. QuBiLS-MAS topological molecular descriptors

194

The atom- and bond-based QuBiLS-MAS topological molecular descriptors (2D-MDs) were

195

recently presented as a novel framework to characterize compounds.29, 30 This framework is the

8 ACS Paragon Plus Environment

Page 9 of 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

196

current definition of the former ToMoCoMD-CARDD 2D-MDs,31-36 at including novel matrix

197

approaches (e.g., double-stochastic), novel molecular cutoffs to take into account particular inter-

198

atomic relations (see Eq. 7 in ref.29), novel local-fragments to consider chemical regions of interest,

199

diversity of operators to generalize the calculation from LOVIs (Local Vertex Invariants) or LOEIs

200

(Local Edge Invariants), among other features.29 These 2D-MDs have been used in several

201

applications, such as: antimalarials,37 trichomonacidals,38-40 paramphistomicides,41 tyrosinase

202

inhibitors42, 43 and others.44, 45 The QuBiLS-MAS 2D-MDs calculation is based on the bilinear,

203

quadratic and linear algebraic forms. Thus, the original approach to compute the atom- [𝑎𝑏𝑘(𝑥,𝑦)]

204

and bond-based [𝑒𝑏𝑘(𝑥,𝑦)] bilinear QuBiLS-MAS MDs is a bilinear mapping as shown below: 𝑛

205

𝑎𝑏

𝑘

(𝑥,𝑦) =

𝑛

∑∑𝑚 𝑥 𝑦 = [𝑋] ℳ [𝑌] 𝑘 𝑖 𝑗 𝑖𝑗

𝑇

𝑘

(𝟏)

𝑖 = 1𝑗 = 1 𝑚

206

𝑒𝑏

𝑘

(𝑥,𝑦) =

𝑚

∑∑𝑒 𝑥 𝑦 = [𝑋] ℰ [𝑌] 𝑘 𝑖 𝑗 𝑖𝑗

𝑇 𝑘

(𝟐)

𝑖 = 1𝑗 = 1

207

where n (or m) is the total number of atoms (or bonds); k  1,  2,  ,  15 is the power of the

208

matrix approaches; 𝑚𝑘𝑖𝑗 (or 𝑒𝑘𝑖𝑗) denotes the coefficients of the k-th atom-based electronic-density

209

matrix ℳ𝑘 – see ref.33 (or bond-based matrix ℰ𝑘 – see ref.35); and xi and yi are the components of

210

the x and y atom-based (or bond-based) property vectors, respectively. Note that if the x and y

211

vectors encode the same property x  y , then Equations 1 and 2 define atom-

212

bond-based [𝑒𝑞𝑘(𝑥,𝑥)] quadratic QuBiLS-MAS 2D-MDs (quadratic mapping), respectively; and

213

if all the entries of the x vector are equal to 1 and the y vector encodes a property, then Equations

214

1 and 2 define atom- [𝑎𝑓𝑘(𝑥,𝑥)] and bond-based [𝑒𝑓𝑘(𝑥,𝑥)] linear QuBiLS-MAS 2D-MDs (linear

215

mapping), respectively.





[𝑎𝑞𝑘(𝑥,𝑥)] and

9 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 45

216

The main contribution of the QuBiLS-MAS framework (formerly ToMoCoMD-CARDD) is

217

the generalization of the vector-matrix-vector procedure described in Equations 1 and 2. The ℳ𝑘

218

and ℰ𝑘 matrices are divided to atom-level (ℳ𝑎,𝑘) and bond-level (ℰ𝔢,𝑘), respectively, by using the

219

procedure to determine local-fragment matrices, but considering as fragment each atom “a” (see

220

Equation 6 in ref.31) or bond “𝔢” (see Equation 7 in ref.36) of the molecule. In this way, bilinear

221

(quadratic, linear) QuBiLS-MAS 2D-MDs for each atom [LOVI –

222

𝑏(𝑞,𝑓)ℒ𝑒]

223

tendency statistics) can be used on the LOVIs (or LOEIs) calculated to obtain diversity of global

224

molecular characterizations, just as confirmed elsewhere.49, 50 This generalization is unique in the

225

2D-MDs computation to date. Remark that if the summation operator is applied on the LOVIs (or

226

LOEIs) calculated, then the results exactly coincide with the original approach (Equations 1-2).

227

Schema 1 depicts a flowchart regarding the QuBiLS-MAS 2D-MDs calculation. 𝑏(𝑞,𝑓)ℒ𝑎

or bond [LOEI –

can be computed (see Equations 3-4). So, different aggregation operators (e.g., central

𝑛

228

𝑏(𝑞,𝑓)ℒ𝑎]

𝑎,𝑘

= 𝑏(𝑞,𝑓) (𝑥,𝑦) =

𝑛

∑ ∑𝑚

𝑎,𝑘 𝑖 𝑗 𝑖𝑗 𝑥 𝑦

= [𝑋]𝑇ℳ𝑎,𝑘[𝑌]

∀ 𝑎 = 1, 2, …, 𝑛

(𝟑)

𝑖 = 1𝑗 = 1 𝑚

229

𝑏(𝑞,𝑓)ℒ𝑒

= 𝑏(𝑞,𝑓)𝔢,𝑘(𝑥,𝑦) =

𝑚

∑ ∑𝑒

𝔢,𝑘 𝑖 𝑗 𝑖𝑗 𝑥 𝑦

= [𝑋]𝑇ℰ𝔢,𝑘[𝑌]

∀ 𝔢 = 1, 2, …, 𝑚

(𝟒)

𝑖 = 1𝑗 = 1

230 231

Schema 1 comes about here 2.3. Feature selection and modeling methodology

232

Firstly, 335 configurations of the bilinear, quadratic and linear QuBiLS-MAS 2D-MDs were

233

computed on the EPA training set. These configurations were created according to the outcomes

234

obtained in several cheminformatics studies (see Section 4 in ref.29), and they are available in the

235

software of the same name of the descriptors.30 Since numerous MDs are calculated with this

236

program, then the following workflow for data reduction was carried out on each descriptors

10 ACS Paragon Plus Environment

Page 11 of 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

237

matrix obtained: (1) the MDs with NaN (Not-a-Number) values and/or with values represented as

238

power of 10 were dropped; and (2) a supervised feature selection with regard to the LD50 values

239

corresponding to the EPA training compounds was performed using an in-house program. This

240

program retains the best 2D-MDs according to the Correlation Subset51 and Relief-F52 methods

241

available in the Weka software (v3.8).53 This supervised selection was repeated until obtaining an

242

only matrix of until 5000 QuBiLS-MAS 2D-MDs.

243

Once the previous workflow concluded, the two selection methods aforementioned and a

244

Shannon Entropy (SE)-based unsupervised method54 were applied on the MDs matrix obtained.

245

Thus, three subsets with the most suitable MDs were obtained, so that: (1) with the Correlation

246

Subset method51 a subset of features highly correlated with the activity and lowly correlated among

247

them was retained; (2) with the Relief-F method52 the features that best distinguish among

248

instances that are near each other were retained; and (3) with the SE-based method54 the features

249

that best yield distinct descriptions for structurally different molecules were retained. The SE-

250

based selection was performed with the IMMAN software,55 following a discretization scheme

251

equal to 5931 bins (size of the EPA training set). These three subsets were also shuffled among

252

themselves (ensemble feature selection56), obtaining four additional subsets. A wrapper-type

253

selection57 was also applied on all the previous subsets, by using the Best-First method as search

254

strategy, the Root Mean Square Error (RMSE) as evaluation measure, and the k-Nearest Neighbor

255

(k-NN),58 Multilayer Perceptron (MLP),59 Random Forest (RF)60 and Support Vector Machine

256

(SVM)61 learning methods.

257

Subsequently, several individual regression models (QSAR) were built using the learning

258

methods aforementioned, which are available in the Weka software (v3.8).53 In detail, the k-NN

259

method was configured with k-values (number of neighbors) between 3 and 6 (determined by cross

11 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 45

260

validation), and the ratio 1/distance was used as distance weighting scheme. Moreover, the Puk

261

function62 was used as kernel in the SVM. The remaining learning methods were utilized with the

262

respective default configurations. These configurations were also employed in the wrapper-type

263

selection explained above. It is important to highlight that on the subsets of MDs obtained from

264

the wrapper-type selection, only the learning method used in the wrapper was employed to build

265

the model. The performance of the models was measured using the Correlation Coefficient (𝑅),

266

Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) statistical parameters. The

267

training assessment was performed through 10-fold cross-validation. The base (individual) models

268

with 𝑅𝐸𝑃𝐴 ― 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 ≥ 0.75 and 𝑀𝐴𝐸𝐸𝑃𝐴 ― 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 ≤ 0.5 were retained to build ensemble and

269

consensus models.

270

The ensemble models were built using the Bagging,63 Additive Regression64 and Stacking65

271

procedures; while the consensus models were developed using the Maximum (Max), Minimum

272

(Min), Average (AVE) and Weighted Average (WA) aggregation functions. The weightings for

273

the WA function were calculated according to Equation 9 described in ref.49, with   1 . The

274

ensemble models with a training performance better than the one achieved by the best individual

275

model, as well as all the consensus models built, were assessed on the EPA test set. In this way,

276

the best models based on the EPA training set were selected according to their training and test

277

results. Finally, the same QuBiLS-MAS 2D-MDs and learning method used in the best models

278

built on the EPA training set were employed to build the models on the EPA-full and Zhu training

279

sets, respectively. All these models were assessed on the external sets considered to measure their

280

predictive abilities. Lastly, a comparison with respect to several QSAR methodologies established

281

in the literature (e.g., admetSAR,12 ProTox14) was performed. Schema 2 depicts the workflow

282

described above.

12 ACS Paragon Plus Environment

Page 13 of 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

Schema 2 comes about here

283 284

2.4. Applicability domain

285

An applicability domain (AD) analysis was carried out to know the behavior of the models

286

to perform reliable predictions.66-68 This analysis is often performed using only one method to

287

determine the structures that fall within the AD of a model (e.g., Liew et al.69 only used an approach

288

based on range, and Aranda et al.70 only used the leverage method). However, there are several

289

AD methods that characterize of distinct way the interpolation space stated by the MDs used.67 So,

290

inspired on the idea of consensus-based decision, five approaches (i.e., City-block, Euclidean,

291

Mahalanobis, Range and Density – see refs.71, 72 for a detailed description) were considered in a

292

recent study to determine the reliability of the predictions.73 More specifically, if the prediction for

293

a molecule lies outside of the bounds in at least two AD methods, then it is considered as unreliable.

294

This approach will be the one used in this work. These AD methods are available in the Ambit

295

Discovery software74 and they were used with their respective default configurations.

296

2.5. Statistical analysis of the external predictive ability

297

A statistical analysis of the predictive ability achieved on the ProTox and T3DB external sets

298

was carried out, with the purpose of knowing if the results obtained by the best QSAR models

299

proposed are significantly different to the ones achieved by the QSAR methods from the literature

300

(e.g., admetSAR,12 ProTox,14 DL-AOT18). To this end, the predictions within the AD of the best

301

models built and QSAR methods from the literature were accounted for. Then, the absolute error

302

module was computed to be used in this study. An analysis to examine the data normality was

303

made, by using the Kolmogorov–Smirnov test corrected by Lilliefors, and the Shapiro-Wilk test.75

304

If no normality was indicated  pvalue  0.05 , then non-parametric tests were applied. Thus, the

305

Friedman test76 was used to determine global differences, and if these exist, then the post-hoc

13 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 45

306

Holm method77 was used for multiple comparisons. In this report, a significance level   0.05

307

was considered.

308

2.6. Retrospective virtual screening in a withdrawn drug dataset

309

With this retrospective screening, a study to know the ability of the best models proposed to

310

generate toxicological warnings was made. In this way, the usefulness of prospectively using the

311

QSAR models proposed in the early prediction of toxicity or in the labeling of chemicals was

312

assessed. An analysis with regard to the QSAR methods most commonly employed in the literature

313

for predicting AOT (i.e., TEST Consensus,10 admetSAR,12 ProTox14 and DL-AOT18) was also

314

performed. To this end, the WITHDRAWN dataset, composed of 267 drugs recalled from the

315

market in at least one country due to their toxic/side effects, was used (see Section 2.1). The

316

predictions obtained on the WITHDRAWN set were converted to mg/kg. Then, these predictions

317

were labeled (classified) according to the Globally Harmonized System of Classification and

318

Labelling of Chemicals (GHS):78 Class I (fatal if swallowed – LD50 ≤ 5), Class II (fatal if

319

swallowed – 5 < LD50 ≤ 50), Class III (toxic if swallowed – 50 < LD50 ≤ 300), Class IV (harmful

320

if swallowed – 300 < LD50 ≤ 2000), Class V (may be harmful if swallowed – 2000 < LD50 ≤ 5000)

321

and Class VI (non-toxic – LD50 > 5000). A labelling between Class I and Class V is considered

322

as toxicological warning.

323 324

3. Results and discussion

325

3.1. Performance of the best individual, ensemble and consensus models

326

From the data matrices built, 84 individual (or base) models (42 k-NN, 14 MLP, 14 RF and

327

14 SVM) were developed and examined. Only 12 of these regression models achieved the cutoff

328

of 𝑅𝐸𝑃𝐴 ― 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 ≥ 0.75 and 𝑀𝐴𝐸𝐸𝑃𝐴 ― 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 ≤ 0.5 in a 10-fold cross validation. Therefore, they

14 ACS Paragon Plus Environment

Page 15 of 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

329

were included in the pool of base models for the ensemble and consensus models building. Table

330

1 shows the performance of these models on the EPA training/test sets. As it can be observed, the

331

number of QuBiLS-MAS 2D-MDs included in the models range from 37 to 228, for a ratio of

332

cases-to-independent variable of 160:1 and 26:1 in the smallest and largest models, respectively.

333

It can also be observed that all the models present a good behavior on the test set, at presenting

334

values of 0.7671 ≤ 𝑅𝑡𝑒𝑠𝑡 ≤ 0.7991 and 0.4329 ≤ 𝑀𝐴𝐸𝑡𝑒𝑠𝑡 ≤ 0.4620. The best performances

335

belong to models M02 (𝑅𝑡𝑒𝑠𝑡 = 0.7991, 𝑀𝐴𝐸𝑡𝑒𝑠𝑡 = 0.4329), M01 (𝑅𝑡𝑒𝑠𝑡 = 0.7952, 𝑀𝐴𝐸𝑡𝑒𝑠𝑡

336

= 0.4355) and M03 (𝑅𝑡𝑒𝑠𝑡 = 0.7942 and 𝑀𝐴𝐸𝑡𝑒𝑠𝑡 = 0.4416), at being the only ones with 𝑅𝑡𝑒𝑠𝑡

337

≥ 0.79. Lastly, it can also be observed that the RF and SVM learning methods allow to achieve

338

the best correlations, except in models M11 and M12 that are based on the k-NN (𝑘 = 6) method.

339

Table 1 comes about here

340

Posteriorly, from the best individual (or base) models (see Table 1), ensemble models were

341

built using the Bagging,63 Additive Regression64 and Stacking65 procedures, as well as consensus

342

models based on the Max, Min, AVE and WA functions. Table 2 shows the statistics of the best

343

ensemble and consensus models built for the EPA training/test sets. As it can be observed, the

344

ensemble models present 0.7801 ≤ 𝑅𝐸𝑃𝐴 ― 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 ≤ 0.7885 and 0.4224 ≤ 𝑀𝐴𝐸𝐸𝑃𝐴 ― 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔

345

≤ 0.4312, being both statistics better than the ones obtained by the best base model (see Table 1).

346

It indicates that these models may have a better behavior on the test set. In this sense, it can be

347

noted that all the ensemble and consensus models achieve 0.8024 ≤ 𝑅𝑡𝑒𝑠𝑡 ≤ 0.8103 and 0.4106

348

≤ 𝑀𝐴𝐸𝑡𝑒𝑠𝑡 ≤ 0.4267. In detail, consensus models M17 (𝑅𝑡𝑒𝑠𝑡 = 0.8103 and 𝑀𝐴𝐸𝑡𝑒𝑠𝑡 = 0.4136),

349

M18 (𝑅𝑡𝑒𝑠𝑡 = 0.8102 and 𝑀𝐴𝐸𝑡𝑒𝑠𝑡 = 0.4138) and M19 (𝑅𝑡𝑒𝑠𝑡 = 0.8095 and 𝑀𝐴𝐸𝑡𝑒𝑠𝑡 = 0.4106)

350

are those with the best performance. Model M19 presents the lowest 𝑀𝐴𝐸𝑡𝑒𝑠𝑡.

351

Table 2 comes about here 15 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 45

352

Moreover, as it was explained above, regression models to predict AOT were also built on

353

both the EPA-full (7413 structures) and Zhu (10152 structures) training sets, in order to consider

354

a wider chemical space in the learning process and, in addition, to guarantee comparability of

355

results with regard to several models reported in the literature.12-17 Specifically, the QSAR models

356

developed are based on the same QuBiLS-MAS 2D-MDs and learning method used to build

357

models M13 and M15, which make up the consensus models from M17 to M19, that are the ones

358

with the best performances so far. Therefore, only consensus models based on the AVE, WA and

359

Min functions will be analyzed for the two training sets mentioned above. It is expected that the

360

new consensus models yield better predictions on external sets than the consensus models created

361

on the EPA training set (5931 structures).

362

In this sense, Table 3 shows the results obtained by the consensus models on the ProTox and

363

T3DB external sets. On one hand, if the performance of the models built on the EPA training set

364

(from M17 to M19), EPA-full training set (from M20 to M22) and Zhu training set (from M23 to

365

M25) is analyzed, it can be seen that, for the first two training sets, the best result is achieved by

366

the Min function-based consensus models M19 (𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 = 0.4073, 𝑀𝐴𝐸𝑇3𝐷𝐵 = 0.2968) and

367

M22 (𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 = 0.3971, 𝑀𝐴𝐸𝑇3𝐷𝐵 = 0.2796), respectively; whereas for the latter training set

368

mentioned, the AVE function-based model M23 (𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 = 0.3807, 𝑀𝐴𝐸𝑇3𝐷𝐵 = 0.3016) and

369

the WA function-based model M24 (𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 = 0.3805, 𝑀𝐴𝐸𝑇3𝐷𝐵 = 0.3049) are the best ones.

370

On the other hand, it can be observed that the performance of the models on the ProTox set

371

improves as long as the number of training compounds increases. A similar behavior can be noted

372

in the T3DB external set. However, in this external set, the models built on the Zhu set present a

373

slight decrease in its performance as compared to the ones built on the EPA-full training set. See

374

Supplementary Information 2 for the values predicted.

16 ACS Paragon Plus Environment

Page 17 of 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

375 376

Chemical Research in Toxicology

Table 3 comes about here 3.2. Applicability domain of the best models built

377

An applicability domain (AD) analysis (see Section 2.4) for each consensus model built was

378

done, with the purpose of determining how trustworthy are the predictions performed by them.

379

Table 4 shows the outcomes obtained in this study for the EPA test set, ProTox external set and

380

T3DB external set. If the coverage statistical parameter (i.e., ratio of compounds that is expected

381

to fall within the AD) is examined, it can be observed that for the EPA test set, the models present

382

a coverage equal to 98.25%; while for the ProTox and T3DB external sets, the models present a

383

coverage between 93.58% and 96.89% and between 93.66% and 95.42% of the predictions

384

performed, respectively. Therefore, in a general sense, it can be concluded that all the consensus

385

models built perform highly reliable predictions according to the corresponding AD. Nonetheless,

386

it is important to remark that the models built on the EPA-full training set (from M20 to M22) are

387

those with the best coverage, followed by the models built on the EPA training set (from M17 to

388

M19) and Zhu training set (from M23 to M25), respectively.

389

Table 4 comes about here

390

Moreover, if the statistics of the predictions performed are analyzed (see Table 4), it can be

391

observed that all the consensus models present comparable-to-superior results with respect to the

392

ones obtained when the AD is not accounted for (see Tables 2-3). Note that the best results were

393

achieved on the T3DB external set, where the models present 0.2286 ≤ 𝑀𝐴𝐸𝑇3𝐷𝐵 ― 𝐴𝐷 ≤ 0.2635

394

(without AD: 0.2796 ≤ 𝑀𝐴𝐸𝑇3𝐷𝐵 ≤ 0.3057) and 0.3859 ≤ 𝑅𝑀𝑆𝐸𝑇3𝐷𝐵 ― 𝐴𝐷 ≤ 0.4405 (without

395

AD: 0.5091 ≤ 𝑅𝑀𝑆𝐸𝑇3𝐷𝐵 ≤ 0.5272). Lastly, it can be noted that models M19 (𝑀𝐴𝐸𝑡𝑒𝑠𝑡 ― 𝐴𝐷

396

= 0.4044, 𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 ― 𝐴𝐷 = 0.4067, 𝑀𝐴𝐸𝑇3𝐷𝐵 ― 𝐴𝐷 = 0.2586), M22 (𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 ― 𝐴𝐷 = 0.3992

397

, 𝑀𝐴𝐸𝑇3𝐷𝐵 ― 𝐴𝐷 = 0.2286), M23 (𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 ― 𝐴𝐷 = 0.3774, 𝑀𝐴𝐸𝑇3𝐷𝐵 ― 𝐴𝐷 = 0.2434) and M24

17 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 45

398

(𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 ― 𝐴𝐷 = 0.3773, 𝑀𝐴𝐸𝑇3𝐷𝐵 ― 𝐴𝐷 = 0.2471), just as in the previous analysis without AD,

399

are the ones with the best performances.

400

3.3. Comparison with respect to other QSAR models reported in the literature

401

So far, a description of the best models built, and an analysis of the outcomes achieved by

402

them have been carried out. However, it is important to perform a comparison with respect to other

403

procedures established in the literature. To this end, from the results previously described, it can

404

be stated that consensus models M19, M22 and M24 (see Supplementary Information 3 to choose

405

between models M23 and M24) are the best ones for each training set accounted for. Thus, they

406

will be used in the comparison. Several freely accessible or commercial QSAR/similarity-based

407

tools are also used for this study. These tools include the T.E.S.T software (version 4.2.1),10, 11

408

GUSAR software,16 admetSAR web server (v1.0),12, 13 DL-AOT web server18 and ProTox-II web

409

server.14, 15 Hereafter, the combinatorial use of the QNA (Quantitative Neighborhoods of Atoms)

410

and MNA (Multilevel Neighborhoods of Atoms) MDs is referred to as GUSAR consensus method

411

(see Supplementary Information 4 for the training results with this approach).

412

Study Case 1: Performance on the EPA test set

413

Figure 1 shows a comparative graphic of the performance (MAE) achieved by seven QSAR

414

methods reported in the literature with regard to the performance achieved by consensus model

415

M19 on the EPA test set (1482 compounds). All these methods were built on the EPA training set

416

(5931 compounds). Therefore, comparability of results is guaranteed. In detail, the methods from

417

the literature to be compared are the GUSAR consensus method considering nearest neighbors

418

(GUSAR wNN),16 the GUSAR consensus method without considering nearest neighbors (GUSAR

419

nNN),16 the GUSAR method using only QNA MDs,16 the GUSAR method using only MNA

420

MDs,16 the T.E.S.T FDA method (TEST FDA),10 the T.E.S.T Nearest Neighbors method (TEST

18 ACS Paragon Plus Environment

Page 19 of 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

421

NN)10 and the T.E.S.T Consensus method (TEST Consensus).10 The comparisons were performed

422

according to the AD of the GUSAR wNN method, the TEST Consensus method and model M19.

423

T.E.S.T Hierarchical Clustering method10 was not included in this comparison due to its poor

424

performance. Figure 1 comes about here

425 426

As it can be observed in Figure 1, consensus model M19, built with the QuBiLS-MAS 2D-

427

MDs, presents the best performance irrespectively of the AD considered, being the only model in

428

always achieving 𝑀𝐴𝐸𝑡𝑒𝑠𝑡 < 0.41. It can also be noted that the AD of model M19 is the second

429

best (𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒 = 98.25%), which is only slightly overcome by the AD of the TEST Consensus

430

method (𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒 = 98.38%). Moreover, in a general sense, the GUSAR wNN method achieves

431

the second-best results, only surpassed by the TEST Consensus method if the AD of this latter is

432

analyzed. Finally, it can be observed that the GUSAR nNN and TEST NN methods present similar

433

performances, and that the TEST FDA, GUSAR MNA and GUSAR QNA methods are the ones

434

with the worst outcomes. Supplementary Information 5 shows the statistical parameters obtained

435

according to all the ADs.

436

Study Case 2: Performance on the ProTox external set

437

Figure 2 depicts the performance achieved by the GUSAR wNN,16 GUSAR nNN,16 DL-

438

AOT18 and admetSAR12 methods, as well as the one obtained by consensus models M19, M22 and

439

M24, according to their ADs on the ProTox external set (425 compounds) (for more details see

440

Supplementary Information 6). The T.E.S.T software10 fails during the prediction (error in the

441

MDs computation) on the ProTox set and, thus, none of its methods were taken into account in the

442

current comparison. ProTox method14, 15 was not considered in this study because the external set

443

used presents a similar distribution to the training set (38515 compounds) of this method. Note

19 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 45

444

that, excepting consensus models M19 and M22, all the remaining QSAR methods were built on

445

the Zhu training set (10152 compounds). Figure 2 comes about here

446 447

As a result, consensus model M24 is the one with the best performance for all the ADs

448

analyzed. This model is the only one in always achieving 𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 < 0.38, whereas the other

449

QSAR methods yield 𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 > 0.39. It can also be observed that consensus models M19

450

(0.3965 ≤ 𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 ≤ 0.4054) and M22 (0.3907 ≤ 𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 ≤ 0.3967), as well as the

451

admetSAR

452

(0.3915 ≤ 𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 ≤ 0.4029) methods, present comparable performances, even when models

453

M19 and M22 were built on smaller training sets. Note that the GUSAR nNN

454

(0.4164 ≤ 𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 ≤ 0.4269) and DL-AOT (0.4602 ≤ 𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 ≤ 0.4654) methods are

455

those with the poorest outcomes. DL-AOT method is based on a convolutional neural network,

456

which is a more complex procedure than the one used by the models proposed here; even then,

457

models M24, M22 and M19 achieve better results for the external set under study.

458

Study Case 3: Performance on the T3DB external set

(0.3905 ≤ 𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 ≤ 0.3999)

and

GUSAR

wNN

459

Finally, Figure 3 represents the performance obtained by the GUSAR wNN,16 GUSAR

460

nNN,16 TEST Consensus,10 DL-AOT,18 admetSAR12 and ProTox14, 15 methods, as well as the one

461

achieved by the best QSAR models proposed in this report (i.e., consensus models M19, M22 and

462

M24), according to their ADs on the T3DB external set (284 compounds). Other QSAR models

463

were also analyzed (e.g., TEST NN, GUSAR MNA), but as they yielded poor outcomes, then these

464

were not graphically represented. Supplementary Information 7 contains the results corresponding

465

to the 14 methods examined. Remark that the ProTox method was built on 38515 training

466

compounds, that the T.E.S.T software-based methods and model M19 were created on the EPA 20 ACS Paragon Plus Environment

Page 21 of 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

467

training set (5931 compounds), that model M22 was developed on the EPA-full training set (7413

468

compounds), and that the other QSAR methods were built on the Zhu training set (10152

469

compounds). Figure 3 comes about here

470

As it can be noted, consensus model M22 (0.2165 ≤ 𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 ≤ 0.2634), followed by

471

DL-AOT

method

(0.2374 ≤ 𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 ≤ 0.2752) and consensus models M19

472

the

473

(0.2368 ≤ 𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 ≤ 0.2798) and M24 (0.2451 ≤ 𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 ≤ 0.2870), are the ones with

474

the best outcomes for all the ADs analyzed. Specifically, consensus model M22 is the only one in

475

obtaining 𝑀𝐴𝐸𝑇3𝐷𝐵 < 0.23 (𝑀𝐴𝐸𝑇3𝐷𝐵 = 0.2165 and 𝑀𝐴𝐸𝑇3𝐷𝐵 = 0.2226 within the AD of the

476

TEST Consensus method and model M24, respectively); whereas the remaining procedures

477

achieve results superior to the threshold aforementioned. Moreover, it can also be seen that the

478

GUSAR

479

(0.3080 ≤ 𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 ≤ 0.3544) methods have comparable performances. Curiously, the

480

ProTox, admetSAR and TEST Consensus methods, that are those most commonly used in the

481

literature, obtain the poorest behaviors, at presenting 𝑀𝐴𝐸𝑇3𝐷𝐵 > 0.6.

nNN

(0.2754 ≤ 𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 ≤ 0.3113)

and

GUSAR

wNN

482

All in all, it can be concluded that the consensus models (M19, M22 and M24) proposed

483

constitute prominent alternatives to be used in the study of AOT, due to the fact that they achieve

484

results superior to the ones obtained by all the QSAR methods analyzed from the literature. In this

485

way, the hypothesis assumed in the present work is confirmed, which pose that the use of current

486

definitions of MDs may contribute to develop better QSAR models to predict AOT. It is important

487

to remark that the models built always presented the best performance regardless of the chemical

488

dataset used to assess them, to difference of the QSAR methods from the literature, which showed

489

different performances in the sets considered. For instance: DL-AOT method was the worst one in 21 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 45

490

the ProTox external set, whereas it was second-best one in the T3DB external set. A statistical

491

assessment to confirm the results obtained is presented below.

492

3.4. Statistical analysis of the external predictive ability achieved by the best models built

493

This analysis was made to determine if the outcomes obtained by the best consensus models

494

built (i.e., M19, M22 and M24) are statistically different to the ones achieved by the QSAR

495

methods from the literature. To this end (see Section 2.5), the absolute error module was computed

496

for the 360 (out of the 425) and 250 (out of the 284) predictions within the AD of all the QSAR

497

methods (models) in the ProTox and T3DB external sets, respectively (see Supplementary

498

Information 8). For the analysis in the ProTox set, the admetSAR12 and DL-AOT18 methods, as

499

well as the GUSAR software-based methods16 (i.e., GUSAR wNN, GUSAR nNN, GUSAR MNA,

500

GUSAR QNA) were considered; whereas for the T3DB set, in addition of the QSAR methods

501

aforementioned, the ProTox14, 15 method and the T.E.S.T software-based methods10, 11 (i.e., TEST

502

HC, TEST FDA, TEST NN and TEST Consensus) were also accounted for.

503

According to the normality test made, the null hypothesis was rejected for all the methods

504

under study in both chemical datasets (Supplementary Information 9). Therefore, non-parametric

505

tests were applied. In this sense, global differences were determined according to the results

506

obtained from the Friedman test (Supplementary Information 10). According to this test, model

507

M24 has the rank of 1 (best performing model) in the ProTox set, followed by models GUSAR

508

wNN, M19, M22, admetSAR, DL-AOT, GUSAR nNN, GUSAR MNA and GUSAR QNA, in that

509

order; whereas in the T3DB set, model M22 is the one with the best performance, followed by

510

models M19, DL-AOT, M24, GUSAR MNA, GUSAR nNN, TEST FDA, GUSAR wNN, GUSAR

511

QNA, TEST NN, TEST Consensus, ProTox, admetSAR and TEST HC, in that order. These

512

outcomes confirm the descriptive assessment performed in the previous section, where was

22 ACS Paragon Plus Environment

Page 23 of 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

513

indicated that the QSAR models based on the QuBiLS-MAS 2D-MDs obtain the best overall

514

performances.

515

Finally, it can be observed that, according to the results obtained from the Holm post-hoc

516

procedure (Supplementary Information 10), model M24 is statistically superior to the DL-AOT,

517

GUSAR nNN, GUSAR MNA and GUSAR QNA methods in the ProTox set. In this dataset, model

518

M24 is not statistically better than the GUSAR wNN and admetSAR methods, neither with respect

519

to models M19 and M22. Moreover, in the T3DB set, model M22 presents a statistically superior

520

performance with regard to all the QSAR methods from the literature (Supplementary Information

521

10), excepting the DL-AOT method, as well as models M19 and M24. Therefore, it can be

522

statistically stated that the QuBiLS-MAS 2D-MDs allowed to build the best QSAR models to

523

predict AOT to date. Lastly, it can also be statistically concluded that model M22 is the best of all

524

by a short difference, followed by models M24 and M19 (Supplementary Information 11),

525

respectively.

526

3.5. Results of the retrospective virtual screening in a withdrawn drug dataset

527

This study was carried out to assess the usefulness of prospectively using the best QSAR

528

models created in the labeling of chemicals. To this end, the WIHTDRAWN dataset was used

529

because it contains drugs recalled from the market in at least one country, due to their toxic/side

530

effects. Figure 4A shows the labeling yielded by models M19, M22 and M24 according to the

531

GHS standard on the set aforementioned (see Supplementary Information 12). On one hand, it is

532

expected that the majority of these withdrawn drugs is not labeled as highly toxic (Class I-II – fatal

533

if swallowed), due to the fact that they were marketable drugs and, thus, they passed several pre-

534

clinical toxicological assays. Thereby, it can be analyzed that the models considered only label

535

four drugs as highly toxics: indomethacin (gastrointestinal), amphetamine (cardiovascular),

23 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 45

536

aprobarbital (respiratory) and bithionol (dermatological). This labeling is in correspondence with

537

the toxic/side effects evidenced by these drugs.

538

Figure 4 comes about here

539

On the other hand, it is also expected that only a few of these drugs be in a Class VI (non-

540

toxic) labeling, due to the fact that, as it was aforementioned, they present toxic/side effects that

541

caused them to be recalled from the market. In this sense, it can be noted that only 7 (0.026%), 9

542

(0.033%) and 17 (0.063%) drugs were labeled in this class (e.g., gentamicin) by models M19, M22

543

and M24, respectively. Lastly, it can be observed that the majority of these compounds were

544

labeled between Class III (e.g., oxyphenisatin) and Class V (e.g., practolol), mainly Class IV (e.g.,

545

alosetron). That is, the majority of these withdrawn drugs were identified as toxic or harmful if

546

swallowed. In general, models M19, M22 and M24 yielded toxicological warnings between Class

547

I and Class IV for 226 (84.64%), 221 (82.77%) and 201 (75.28%) drugs, and between Class I and

548

Class V for 260 (97.38%), 258 (96.63%) and 250 (93.63%) drugs, respectively.

549

Moreover, Figure 4B shows the labeling obtained by model M22 and the TEST Consensus,10

550

DL-AOT,18 admetSAR12 and ProTox14 methods. Model M22 was selected because it presents the

551

best external predictive power (see Section 3.4). As it can be noted, similar results were obtained

552

by the methods from the literature, at labelling the majority of the withdrawn drugs between Class

553

III and Class V. In summary, model M22, TEST Consensus method, DL-AOT method, admetSAR

554

method and ProTox method generate toxicological warnings between Class I and Class IV for 221

555

(82.77%), 207 (77.53%), 197 (73.78%), 192 (71.91%) and 200 (74.91%) drugs, and between Class

556

I and Class V for 258 (96.63%), 250 (93.63%), 253 (94.76%), 248 (92.88) and 250 (93.63%) drugs,

557

respectively. As it can be analyzed, for both warning groups, model M22 is the one with the highest

558

ability of labelling a compound as toxic (or harmful). Note that model M19 also presents better

24 ACS Paragon Plus Environment

Page 25 of 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

559

ability than the methods from the literature to generate toxicological warnings, while model M24

560

has a comparable behavior.

561

To conclude, Table 5 shows the Pearson correlation-based similarity coefficients among the

562

consensus models proposed and the QSAR methods from the literature considered in this study.

563

The similarity coefficients were determined from the toxicity classes obtained according to the

564

GHS labelling. As it can be observed, models M19, M22 and M24 majorly present moderate-to-

565

low similarity coefficients   0.7  with respect to the TEST Consensus, DL-AOT, admetSAR and

566

ProTox methods. Only consensus models M22 and M24 have high similarities with regard to the

567

admetSAR method, with coefficients equal to 0.723 and 0.848, respectively. It can also be noted

568

that no similarity coefficient among the models proposed is greater than 0.9. All in all, it can be

569

stated that models M19, M22 and M24 yield non-redundant and better results than the other QSAR

570

methods studied in the labelling of chemicals, justifying in this way their use in prospective studies.

571

Tables 5 comes about here

572

3.6. Chemometric analysis of the QuBiLS-MAS molecular descriptors included in the best

573

QSAR models built

574

This section is devoted to assessing the usefulness of using the QuBiLS-MAS 2D-MDs in

575

the AOT modeling. A Shannon Entropy (SE)-based variability analysis (VA) was firstly carried

576

out in order to quantify the information content codified by MDs.54 In this way, relevant MDs for

577

cheminformatics studies can be identified according to the principle that high-entropy values

578

correspond to those MDs with good ability to discriminate among structurally distinct compounds,

579

while low-entropy values are indicative of the opposite.79 To this end, a dataset comprising by all

580

the compounds belonging to the Zhu training set and ProTox/T3DB external sets was used (see

581

Supplementary Information 13). The discretization scheme adopted is equal to 10861 bins, which

25 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 45

582

is the number of compounds within the dataset mentioned above. The IMMAN software55 was

583

employed to perform this VA.

584

Figure 5 depicts the SE distribution corresponding to the 200 QuBiLS-MAS MDs integrating

585

the best QSAR models built, as well as the SE distributions corresponding to the best 200 0D-to-

586

2D MDs, according to the VA-criterion, computed by the DRAGON (v5.5) and PaDEL software,

587

respectively. If the most relevant 50 MDs for each distribution are analyzed, it can be stated that

588

the QuBiLS-MAS framework allows to compute MDs with superior information content than all

589

the universe of 0D-to-2D MDs obtained by the two other software considered. In this way, it is

590

demonstrated that the QSAR models proposed contain MDs that better characterize the compounds

591

considered for the AOT modeling in this report. Thus, the use of the QuBiLS-MAS 2D-MDs is

592

justified. It is important to highlight that in a previous study (see Figure 5 in ref.29) was proved that

593

the best QuBiLS-MAS 2D-MDs, according to the VA-criterion, always present a SE behavior

594

superior than all the 0D-to-2D MDs computed by several tools established in the literature (e.g.,

595

DRAGON, MOL2, CDK, BLUECAL).

596

Figure 5 comes about here

597

Moreover, a linear independence analysis was performed through the Principal Component

598

Analysis (PCA) method, by using the same dataset and the QuBiLS-MAS, DRAGON and PaDEL

599

MDs employed in the previous VA. Orthogonal MDs are loaded in different projections, while

600

correlated MDs are loaded in a same component. Supplementary Information 13 shows the

601

eigenvalues and the percentages of the explained variance by 10 principal components (PC)

602

obtained, which explain approximately 69.38% of the cumulative variance. As a result, the MDs

603

analyzed show collinearity in PC1 (34.16%), PC3 (6.41%), PC4 (4.36%), PC5 (3.98%) and PC7

604

(2.55%). Also note that the QuBiLS-MAS 2D-MDs present exclusive loadings in PC2 (9.29%)

26 ACS Paragon Plus Environment

Page 27 of 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

605

and PC8 (2.34%). That is, they contain chemical information not codified by the DRAGON and

606

PaDEL MDs considered, justifying in this way the use of the QuBiLS-MAS approach in the AOT

607

modeling.

608

Nonetheless, this PCA study evidences that exclusive loadings are also presented by the

609

DRAGON and PaDEL MDs in PC6 (2.97%) and PC9 (1.74%), and PC10 (1.56%), respectively.

610

This result suggests two alternatives: (1) that better QSAR models could be obtained by combining

611

the QuBiLS-MAS, DRAGON and PaDEL MDs; or (2) that better strategies for the filtering and

612

selection of QuBiLS-MAS 2D-MDs can be applied in order to use MDs encoding all the chemical

613

information captured by the DRAGON and PaDEL MDs. Finally, it is important to highlight that

614

according to the collinearity shown among the QuBiLS-MAS, DRAGON and PaDEL MDs, it can

615

be stated that the former codifies chemical information related to the molecular weight (MW), the

616

sum of Sanderson electronegativities (Se) and the sum of atomic polarizabilities (Sp) in PC1; to

617

the number of aromatic atoms (naAromAtom), the number of double bonds (nBondsD) and the

618

number of doubly bound carbon bounds to two other carbons (C2SP2) in PC3; to the sum of E-

619

States for (strong) hydrogen bond donors (SHBd) in PC4; to the sum of path lengths starting from

620

nitrogen atoms (WTPT-5) in PC5; and to the minimum atom-type E-State: >C< (minssssC) in

621

PC7.

622 623

4. Conclusions

624

Several QSAR models have been developed to predict AOT, by using the QuBiLS-MAS

625

framework for the molecular encoding. Three training sets comprised of 5931 (EPA training set),

626

7413 (EPA-full training set) and 10152 (Zhu training set) compounds were used for the models

627

building (EPA training set ⊆ EPA-full set ⊆ Zhu set). The models created on the EPA training set

27 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 45

628

were validated on the EPA test set (1482 compounds), as well as on the ProTox (425 compounds)

629

and T3DB (284 compounds) external sets; whereas the models built on the other two training sets

630

were only assessed on the external sets mentioned above. The models developed ranging from 37

631

to 228 QuBiLS-MAS 2D-MDs, and they present good predictive ability and high coverage percent

632

according to the applicability domain (AD) analysis performed.

633

Specifically, two Min operator-based consensus models denoted as M19 and M22, as well

634

as a WA operator-based consensus model denoted as M24, were selected as the best ones for each

635

training set accounted for. On one hand, model M19, created on the EPA training set, presents

636

𝑀𝐴𝐸𝑡𝑒𝑠𝑡 ― 𝐴𝐷 = 0.4044, 𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 ― 𝐴𝐷 = 0.4067 and 𝑀𝐴𝐸𝑇3𝐷𝐵 ― 𝐴𝐷 = 0.2586 on the 1456, 402

637

and 271 compounds included within the AD for the EPA test set, ProTox external set and T3DB

638

external set, respectively. On the other hand, model M22, built on the EPA-full training set, has

639

𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 ― 𝐴𝐷 = 0.3992 and 𝑀𝐴𝐸𝑇3𝐷𝐵 ― 𝐴𝐷 = 0.2286 on the 414 and 271 compounds inside its

640

AD for the ProTox and T3DB external sets, respectively. In addition, for these two external sets,

641

model M24, built on the Zhu training set, presents 𝑀𝐴𝐸𝑃𝑟𝑜𝑇𝑜𝑥 ― 𝐴𝐷 = 0.3773 and 𝑀𝐴𝐸𝑇3𝐷𝐵 ― 𝐴𝐷

642

= 0.2471 on the 400 and 266 compounds within its AD, respectively.

643

These outcomes were compared and statistically validated with respect to the ones obtained

644

by 14 QSAR methods from the literature, which include the T.E.S.T software-based methods,10

645

the GUSAR software-based methods,16 and the admetSAR,12 ProTox14 and DL-AOT18 methods.

646

As a result, model M22 presents the best performance. It is important to remark that model M19

647

is the one with the best performance on the EPA test set among all the QSAR methods built on the

648

EPA training set (e.g., T.E.S.T software-based methods). Moreover, a retrospective study on 261

649

withdrawn drugs due to their toxic/side effects was performed, in order to assess the ability of the

650

models proposed in generating toxicological warnings. A comparison with regard to the methods

28 ACS Paragon Plus Environment

Page 29 of 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

651

from the literature was also made. As a result, model M22 has the highest ability of labelling a

652

compound as toxic (or harmful) according to the GHS scheme.78 All in all, it can be concluded

653

that the models proposed here, especially model M22, constitute prominent tools to predict AOT,

654

at providing the best results among all the methods examined. To this end, a software was built in

655

order to be used in virtual screening tasks (http://tomocomd.com/apps/ptoxra).

656 657

Acknowledgement: CRGJ acknowledges to the program “Cátedras CONACYT” from “Consejo

658

Nacional de Ciencia y Tecnología (CONACYT), México” by the support to the endowed chair

659

501/2018 at “Centro de Investigación Científica y de Educación Superior de Ensenada

660

(CICESE)”. YMP thanks to the program “Profesor Invitado” for a post-doctoral fellowship to

661

work at Valencia University in 2018-2019. YMP also acknowledges the support from USFQ

662

“Chancellor Grant 2017-2018 (Project ID11192)

663 664

Conflict of Interest: The authors declare no conflict of interest.

665 666

5. References

667 668 669 670 671 672 673 674 675 676 677 678 679 680 681

1. Boyerinas, B.; Jochems, C.; Fantini, M.; Heery, C. R.; Gulley, J. L.; Tsang, K. Y.; Schlom, J., AntibodyDependent Cellular Cytotoxicity Activity of a Novel Anti–PD-L1 Antibody Avelumab (MSB0010718C) on Human Tumor Cells. Cancer Immunol. Res. 2015, 3, 1148-1157. 2. Adrees, M.; Ali, S.; Rizwan, M.; Zia-ur-Rehman, M.; Ibrahim, M.; Abbas, F.; Farid, M.; Qayyum, M. F.; Irshad, M. K., Mechanisms of silicon-mediated alleviation of heavy metal toxicity in plants: A review. Ecotoxicol. Environ. Saf. 2015, 119, 186-197. 3. Mebs, D., Toxicity in animals. Trends in evolution? Toxicon 2001, 39, 87-96. 4. Saha, N. C.; Bhunia, F.; Kaviraj, A., Toxicity of Phenol to Fish and Aquatic Ecosystems. Bull. Environ. Contam. Toxicol. 1999, 63, 195-202. 5. Walum, E., Acute oral toxicity. Environ. Health Perspect. 1998, 106, 497-503. 6. Enslein, K., A toxicity estimation model. J. Environ. Pathol. Toxicol. 1978, 2, 115-121. 7. Enslein, K.; Lander, T. R.; Tomb, M. E.; Craig, P. N., A Predictive Model for Estimating Rat Oral Ld50 Values. Toxicol. Ind. Health 1989, 5, 265-265. 8. Eldred, D. V.; Jurs, P. C., Prediction of Acute Mammalian Toxicity of Organophosphorus Pesticide Compounds from Molecular Structure. SAR QSAR Environ. Res. 1999, 10, 75-99.

29 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731

Page 30 of 45

9. Guo, J.-X.; Wu, J. J. Q.; Wright, J. B.; Lushington, G. H., Mechanistic Insight into Acetylcholinesterase Inhibition and Acute Toxicity of Organophosphorus Compounds:  A Molecular Modeling Study. Chem. Res. Toxicol. 2006, 19, 209-216. 10. Zhu, H.; Martin, T. M.; Ye, L.; Sedykh, A.; Young, D. M.; Tropsha, A., Quantitative Structure−Activity Relationship Modeling of Rat Acute Toxicity by Oral Exposure. Chem. Res. Toxicol. 2009, 22, 19131921. 11. T.E.S.T software. https://www.epa.gov/chemical-research/toxicity-estimation-software-tool-test (April 2, 2019), 12. Cheng, F.; Li, W.; Zhou, Y.; Shen, J.; Wu, Z.; Liu, G.; Lee, P. W.; Tang, Y., admetSAR: A Comprehensive Source and Free Tool for Assessment of Chemical ADMET Properties. J. Chem. Inf. Model. 2012, 52, 3099-3105. 13. admetSAR software. http://lmmd.ecust.edu.cn/admetsar1/home/ (April 2, 2019), 14. Drwal, M. N.; Banerjee, P.; Dunkel, M.; Wettig, M. R.; Preissner, R., ProTox: a web server for the in silico prediction of rodent oral toxicity. Nucleic Acids Res. 2014, 42, W53-W58. 15. ProTox-II software. http://tox.charite.de/protox_II/ (April 2, 2019), 16. Alexey, L.; Alexey, Z.; Dmitry, F.; Vladimir, P., QSAR Modelling of Rat Acute Toxicity on the Basis of PASS Prediction. Mol. Inf. 2011, 30, 241-250. 17. Lu, J.; Peng, J.; Wang, J.; Shen, Q.; Bi, Y.; Gong, L.; Zheng, M.; Luo, X.; Zhu, W.; Jiang, H.; Chen, K., Estimation of acute oral toxicity in rat using local lazy learning. J. Cheminf. 2014, 6, 26. 18. Xu, Y.; Pei, J.; Lai, L., Deep Learning Based Regression and Multiclass Models for Acute Oral Toxicity Prediction with Automatic Chemical Feature Extraction. J. Chem. Inf. Model. 2017, 57, 2672-2685. 19. Hand, D. J., Classifier Technology and the Illusion of Progress. Stat. Sci. 2006, 21, 1-14. 20. Stanley, Y. S.; Fei, Y.; Mu, Z., Chemical Descriptors Are More Important Than Learning Algorithms for Modelling. Mol. Inf. 2012, 31, 707-710. 21. Skoraczyński, G.; Dittwald, P.; Miasojedow, B.; Szymkuć, S.; Gajewska, E. P.; Grzybowski, B. A.; Gambin, A., Predicting the outcomes of organic reactions via machine learning: are current descriptors sufficient? Sci. Rep. 2017, 7, 3582. 22. Validation of (Q)SAR Models. http://www.oecd.org/chemicalsafety/riskassessment/validationofqsarmodels.htm (April 2, 2019), 23. Whitley, D.; Watson, J. Complexity Theory and the No Free Lunch Theorem. In Search Methodologies, Burke, E.; Kendall, G., Eds.; Springer US: Boston, MA, 2005; Chapter 11, pp 317-339. 24. Barigye, S. J.; Marrero-Ponce, Y.; López, Y. M.; Santiago, O. M.; Torrens, F.; Domenech, R. G.; Galvez, J., Event-based criteria in GT-STAF information indices: theory, exploratory diversity analysis and QSPR applications. SAR QSAR Environ. Res. 2013, 24, 3-34. 25. Martínez-Santiago, O.; Millán-Cabrera, R.; Marrero-Ponce, Y.; Barigye, S. J.; Martínez-López, Y.; Torrens, F.; Pérez-Giménez, F., Discrete Derivatives for Atom-Pairs as a Novel Graph-Theoretical Invariant for Generating New Molecular Descriptors: Orthogonality, Interpretation and QSARs/QSPRs on Benchmark Databases. Mol. Inf. 2014, 33, 343-368. 26. Cubillán, N.; Marrero-Ponce, Y.; Ariza-Rico, H.; Barigye, S. J.; García-Jacas, C. R.; Valdes-Martini, J. R.; Alvarado, Y. J., Novel global and local 3D atom-based linear descriptors of the Minkowski distance matrix: theory, diversity–variability analysis and QSPR applications. J. Math. Chem. 2015, 53, 2028-2064. 27. Marrero-Ponce, Y.; García-Jacas, C. R.; Barigye, S. J.; Valdés-Martiní, J. R.; Rivera-Borroto, O. M.; Pino-Urias, R. W.; Cubillán, N.; Alvarado, Y. J., Optimum Search Strategies or Novel 3D Molecular Descriptors: is there a Stalemate? Curr. Bioinf. 2015, 10, 533-564. 28. García-Jacas, C. R.; Marrero-Ponce, Y.; Barigye, S. J.; Valdés-Martiní, J. R.; Rivera-Borroto, O. M.; Verbel, J. O., N-Linear Algebraic Maps to Codify Chemical Structures: is a suitable generalization to the atom-pairs approaches? Curr. Drug. Metab. 2014, 15, 441-469. 29. Valdés-Martiní, J. R.; Marrero-Ponce, Y.; García-Jacas, C. R.; Martinez-Mayorga, K.; Barigye, S. J.; Vaz d‘Almeida, Y. S.; Pham-The, H.; Pérez-Giménez, F.; Morell, C. A., QuBiLS-MAS, open source 30 ACS Paragon Plus Environment

Page 31 of 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782

Chemical Research in Toxicology

30. 31. 32. 33. 34. 35.

36. 37.

38.

39.

40.

41.

42.

43.

multi-platform software for atom- and bond-based topological (2D) and chiral (2.5D) algebraic molecular descriptors computations. J. Cheminf. 2017, 9, 1-26. QuBiLS-MAS software. http://tomocomd.com/qubils-mas (April 2, 2019), Marrero-Ponce, Y., Total and Local Quadratic Indices of the Molecular Pseudograph's Atom Adjacency Matrix: Applications to the Prediction of Physical Properties of Organic Compounds. Molecules 2003, 8, 687-726. Marrero-Ponce, Y., Total and local (atom and atom type) molecular quadratic indices: significance interpretation, comparison to other molecular descriptors, and QSPR/QSAR applications. Bioorg. Med. Chem. 2004, 12, 6351-6369. Castillo-Garit, J. A.; Martinez-Santiago, O.; Marrero-Ponce, Y.; Casañola-Martín, G. M.; Torrens, F., Atom-based non-stochastic and stochastic bilinear indices: Application to QSPR/QSAR studies of organic compounds. Chem. Phys. Lett. 2008, 464, 107-112. Marrero-Ponce, Y.; Torrens, F.; García-Domenech, R.; Ortega-Broche, S. E.; Zaldivar, V. R., Novel 2D TOMOCOMD-CARDD molecular descriptors: atom-based stochastic and non-stochastic bilinear indices and their QSPR applications. J. Math. Chem. 2008, 44, 650-673. Marrero-Ponce, Y.; Martínez-Albelo, E. R.; Casañola-Martín, G. M.; Castillo-Garit, J. A.; EcheveríaDíaz, Y.; Zaldivar, V. R.; Tygat, J.; Borges, J. E. R.; García-Domenech, R.; Torrens, F.; PérezGiménez, F., Bond-based linear indices of the non-stochastic and stochastic edge-adjacency matrix. 1. Theory and modeling of ChemPhys properties of organic molecules. Mol. Diversity 2010, 14, 731-753. Marrero-Ponce, Y.; Torrens, F.; Alvarado, Y. J.; Rotondo, R., Bond-based global and local (bond, group and bond-type) quadratic indices and their applications to computer-aided molecular design. 1. QSPR studies of diverse sets of organic chemicals. J Comput Aided Mol Des 2006, 20, 685-701. Marrero-Ponce, Y.; Iyarreta-Veitía, M.; Montero-Torres, A.; Romero-Zaldivar, C.; Brandt, C. A.; Ávila, P. E.; Kirchgatter, K.; Machado, Y., Ligand-Based Virtual Screening and in Silico Design of New Antimalarial Compounds Using Nonstochastic and Stochastic Total and Atom-Type Quadratic Maps. J. Chem. Inf. Model. 2005, 45, 1082-1100. Meneses-Marcel, A.; Marrero-Ponce, Y.; Machado-Tugores, Y.; Montero-Torres, A.; Pereira, D. M.; Escario, J. A.; Nogal-Ruiz, J. J.; Ochoa, C.; Arán, V. J.; Martínez-Fernández, A. R.; García Sánchez, R. N., A linear discrimination analysis based virtual screening of trichomonacidal lead-like compounds: Outcomes of in silico studies supported by experimental results. Bioorg. Med. Chem. Lett. 2005, 15, 3838-3843. Marrero-Ponce, Y.; Meneses-Marcel, A.; Castillo-Garit, J. A.; Machado-Tugores, Y.; Escario, J. A.; Barrio, A. G.; Pereira, D. M.; Nogal-Ruiz, J. J.; Arán, V. J.; Martínez-Fernández, A. R.; Torrens, F.; Rotondo, R.; Ibarra-Velarde, F.; Alvarado, Y. J., Predicting antitrichomonal activity: A computational screening using atom-based bilinear indices and experimental proofs. Bioorg. Med. Chem. 2006, 14, 6502-6524. Meneses-Marcel, A.; Marrero-Ponce, Y.; Ibáñez-Escribano, A.; Gómez-Barrio, A.; Escario, J. A.; Barigye, S. J.; Terán, E.; García-Jacas, C. R.; Machado-Tugores, Y.; Nogal-Ruiz, J. J.; Arán-Redó, V. J., Drug repositioning for novel antitrichomonas from known antiprotozoan drugs using hierarchical screening. Future Med. Chem. 2018, 10, 863-878. Marrero-Ponce, Y.; Huesca-Guillén, A.; Ibarra-Velarde, F., Quadratic indices of the ‘molecular pseudograph's atom adjacency matrix’ and their stochastic forms: a novel approach for virtual screening and in silico discovery of new lead paramphistomicide drugs-like compounds. J. Mol. Struct.: THEOCHEM 2005, 717, 67-79. Casañola-Martín, G. M.; Marrero-Ponce, Y.; Khan, M. T. H.; Ather, A.; Sultan, S.; Torrens, F.; Rotondo, R., TOMOCOMD-CARDD descriptors-based virtual screening of tyrosinase inhibitors: Evaluation of different classification model combinations using bond-based linear indices. Bioorg. Med. Chem. 2007, 15, 1483-1503. Casañola-Martín, G. M.; Khan, M. T. H.; Marrero-Ponce, Y.; Ather, A.; Sultankhodzhaev, M. N.; Torrens, F., New tyrosinase inhibitors selected by atomic linear indices-based classification models. Bioorg. Med. Chem. Lett. 2006, 16, 324-330. 31 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833

Page 32 of 45

44. A., C. G. J.; Yovani, M. P.; Francisco, T.; Ramón, G. D., Estimation of ADME properties in drug discovery: Predicting Caco‐2 cell permeability using atom‐based stochastic and non‐stochastic linear indices. J. Pharm. Sci. 2008, 97, 1946-1976. 45. Medina Marrero, R.; Marrero-Ponce, Y.; Barigye, S. J.; Echeverría Díaz, Y.; Acevedo-Barrios, R.; Casañola-Martín, G. M.; García Bernal, M.; Torrens, F.; Pérez-Giménez, F., QuBiLs-MAS method in early drug discovery and rational drug identification of antifungal agents. SAR QSAR Environ. Res. 2015, 26, 943-958. 46. Toxin and Toxin Target Database (T3DB). http://www.t3db.ca/ (April 2, 2019), 47. WITHDRAWN: A Resource for Withdrawn and Discontinued Drugs. http://cheminfo.charite.de/withdrawn/index.html (April 2, 2019), 48. O'Boyle, N. M.; Banck, M.; James, C. A.; Morley, C.; Vandermeersch, T.; Hutchison, G. R., Open Babel: An open chemical toolbox. J. Cheminf. 2011, 3, 1-14. 49. García-Jacas, C. R.; Cabrera-Leyva, L.; Marrero-Ponce, Y.; Suárez-Lezcano, J.; Cortés-Guzmán, F.; García-González, L. A., GOWAWA Aggregation Operator-based Global Molecular Characterizations: Weighting Atom/bond Contributions (LOVIs/LOEIs) According to their Influence in the Molecular Encoding. Mol. Inf. 2018, 37, 1-14. 50. García-Jacas, C. R.; Cabrera-Leyva, L.; Marrero-Ponce, Y.; Suárez-Lezcano, J.; Cortés-Guzmán, F.; Pupo-Meriño, M.; Vivas-Reyes, R., Choquet integral-based fuzzy molecular characterizations: when global definitions are computed from the dependency among atom/bond contributions (LOVIs/LOEIs). J. Cheminf. 2018, 10, 1-17. 51. Hall, M. A. Correlation-based Feature Selection for Machine Learning. University of Waikato, Hamilton, NewZealand, 1999. 52. Robnik-Šikonja, M.; Kononenko, I., Theoretical and Empirical Analysis of ReliefF and RReliefF. Mach. Learn. 2003, 53, 23-69. 53. WEKA software. http://www.cs.waikato.ac.nz/ml/weka/ (April 2, 2019), 54. Godden, J. W.; Stahura, F. L.; Bajorath, J., Variability of Molecular Descriptors in Compound Databases Revealed by Shannon Entropy Calculations. J. Chem. Inf. Comput. Sci. 2000, 40, 796-800. 55. Urias, R. W. P.; Barigye, S. J.; Marrero-Ponce, Y.; García-Jacas, C. R.; Valdes-Martiní, J. R.; PerezGimenez, F., IMMAN: free software for information theory-based chemometric analysis. Mol. Divers. 2015, 19, 305-319. 56. Kuncheva, L. I. Ensemble Feature Selection. In Combining Pattern Classifiers: Methods and Algorithms; John Wiley & Sons: New Jersey, 2014. 57. Kohavi, R.; John, G. H., Wrappers for feature subset selection. Artif. Intell. 1997, 97, 273-324. 58. Altman, N. S., An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression. Amer. Statist. 1992, 46, 175-185. 59. Murtagh, F., Multilayer perceptrons for classification and regression. Neurocomputing 1991, 2, 183197. 60. Breiman, L., Random Forests. Mach. Learn. 2001, 45, 5-32. 61. Shevade, S. K.; Keerthi, S. S.; Bhattacharyya, C.; Murthy, K. R. K., Improvements to the SMO algorithm for SVM regression. IEEE Trans. Neural Netw. Learn. Syst. 2000, 11, 1188-1193. 62. Üstün, B.; Melssen, W. J.; Buydens, L. M. C., Facilitating the application of Support Vector Regression by using a universal Pearson VII function based kernel. Chemom. Intell. Lab. Syst. 2006, 81, 29-40. 63. Breiman, L., Bagging predictors. Mach. Learn. 1996, 24, 123-140. 64. Friedman, J. H., Stochastic gradient boosting. Comput. Stat. Data. Anal. 2002, 38, 367-378. 65. Wolpert, D. H., Stacked generalization. Neural Netw. 1992, 5, 241-259. 66. Dragos, H.; Gilles, M.; Alexandre, V., Predicting the Predictability: A Unified Approach to the Applicability Domain Problem of QSAR Models. J. Chem. Inf. Model. 2009, 49, 1762-1776. 67. Sahigara, F.; Mansouri, K.; Ballabio, D.; Mauri, A.; Consonni, V.; Todeschini, R., Comparison of Different Approaches to Define the Applicability Domain of QSAR Models. Molecules 2012, 17, 4791. 68. Roy, K.; Kar, S.; Ambure, P., On a simple approach for determining applicability domain of QSAR models. Chemom. Intell. Lab. Syst. 2015, 145, 22-29. 32 ACS Paragon Plus Environment

Page 33 of 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858

Chemical Research in Toxicology

69. Liew, C. Y.; Lim, Y. C.; Yap, C. W., Mixed learning algorithms and features ensemble in hepatotoxicity prediction. J Comput Aided Mol Des 2011, 25, 855-871. 70. Aranda, J.; Garro Martinez, J.; Castro, E.; Duchowicz, P., Conformation-Independent QSPR Approach for the Soil Sorption Coefficient of Heterogeneous Compounds. Int. J. Mol. Sci. 2016, 17, 1247. 71. J, J.; N, N.-J.; T, A., QSAR applicabilty domain estimation by projection of the training set descriptor space: a review. Altern. Lab. Anim. 2005, 33, 445-459. 72. Netzeva, T. I.; Worth, A.; Aldenberg, T.; Benigni, R.; Cronin, M. T.; Gramatica, P.; Jaworska, J. S.; Kahn, S.; Klopman, G.; Marchant, C. A.; Myatt, G.; Nikolova-Jeliazkova, N.; Patlewicz, G. Y.; Perkins, R.; Roberts, D.; Schultz, T.; Stanton, D. W.; van de Sandt, J. J.; Tong, W.; Veith, G.; Yang, C., Current status of methods for defining the applicability domain of (quantitative) structure-activity relationships. The report and recommendations of ECVAM Workshop 52. Altern. Lab. Anim. 2005, 33, 155-73. 73. García-Jacas, C. R.; Martinez-Mayorga, K.; Marrero-Ponce, Y.; Medina-Franco, J. L., Conformationdependent QSAR approach for the prediction of inhibitory activity of bromodomain modulators. SAR QSAR Environ. Res. 2017, 28, 41-58. 74. Jaworska, J.; Nikolova-Jeliazkova, N. Ambit Discovery v0.0.4. http://ambit.sourceforge.net/download_ambitdiscovery.html (April 2, 2019), 75. Yap, B. W.; Sim, C. H., Comparisons of various types of normality tests. J. Stat. Comput. Sim. 2011, 81, 2141-2155. 76. Friedman, M., A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings. Ann. Math. Statist. 1940, 11, 86-92. 77. Holm, S., A Simple Sequentially Rejective Multiple Test Procedure. Scand. J. Stat 1979, 6, 65–70. 78. Winder, C.; Azzi, R.; Wagner, D., The development of the globally harmonized system (GHS) of classification and labelling of hazardous chemicals. J. Hazard. Mater. 2005, 125, 29-44. 79. Randić, M., Generalized molecular descriptors. J. Math. Chem. 1991, 7, 155-168.

33 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 45

ANNEXES (Tables, Figures and Schemas to be inserted in the Main Text) Table 1. Statistical parameters of the best individual models built on the EPA training set (5931 structures). The performance achieved by these models on the EPA test set (1482 structures) is reported between parentheses. ID Size Model* Rtraining (test) MAEtraining (test) RMSEtraining (test) M01 166 (ReF-CFS)_RF 0.7790 (0.7952) 0.4382 (0.4355) 0.5986 (0.5899) M02 159 (ReF-CFSB)_RF 0.7770 (0.7991) 0.4411 (0.4329) 0.6013 (0.5865) M03 228 (ReF-CFS-SE)_RF 0.7753 (0.7942) 0.4420 (0.4416) 0.6034 (0.5927) M04 78 CFS_RF 0.7689 (0.7865) 0.4493 (0.4478) 0.6096 (0.6010) M05 95 ReF_RF 0.7681 (0.7869) 0.4473 (0.4400) 0.6109 (0.5997) M06 103 SE_SVM 0.7594 (0.7807) 0.4455 (0.4401) 0.6173 (0.6033) M07 37 SE_Wrapper(RF) 0.7552 (0.7687) 0.4602 (0.4620) 0.6254 (0.6223) M08 78 CFS_SVM 0.7543 (0.7671) 0.4582 (0.4529) 0.6246 (0.6203) M09 228 (ReF-CFS-SE)_SVM 0.7536 (0.7773) 0.4568 (0.4479) 0.6259 (0.6097) M10 166 (ReF-CFS)_SVM 0.7521 (0.7818) 0.4568 (0.4427) 0.6255 (0.6032) M11 166 (ReF-CFS)_kNN(6) 0.7510 (0.7789) 0.4617 (0.4475) 0.6347 (0.6054) M12 228 (ReF-CFS-SE)_kNN(6) 0.7500 (0.7724) 0.4580 (0.4532) 0.6305 (0.6140) Average performance 0.7620 (0.7824) 0.4513 (0.4453) 0.6173 (0.6040) * The name of the models is denoted as ‘base name’_‘learning method’, where ‘base name’ is denoted by the feature selection methods used: ReF (Relief-F), CFS (Correlation subset forward), CFSB (Correlation subset bidirectional) and SE (Shannon entropy-based method).

34 ACS Paragon Plus Environment

Page 35 of 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Chemical Research in Toxicology

Table 2. Statistical parameters achieved by the best ensemble and consensus models built for the EPA training set (5931 structures) and the EPA test set (1482 structures), respectively. Model ID Size Rtraining MAEtraining RMSEtraining Rtest MAEtest RMSEtest Ensemble method Base models M13 166 Additive regression M01 0.7885 0.4224 0.5843 0.8059 0.4160 0.5725 M14 228 Additive regression M03 0.7864 0.4256 0.5875 0.8092 0.4179 0.5694 M15 159 Additive regression M02 0.7841 0.4266 0.5898 0.8093 0.4157 0.5690 M16 166 Stacking (linear regression) M01, M10, M11 0.7801 0.4312 0.5933 0.8024 0.4267 0.5755 M17 200 Consensus (average) M13, M15 0.8103 0.4136 0.5676 M18 200 Consensus (weighted average) M13, M15 0.8102 0.4138 0.5677 M19 200 Consensus (minimum) M13, M15 0.8095 0.4106 0.5693

35 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Page 36 of 45

Table 3. Statistical parameters achieved by the consensus models from M17 to M25. The models from M17 to M19, from M20 to M22 and from M23 to M25 were built on the EPA training set (5931 structures), EPA-full training set (7413 structures) and Zhu training set (10152 structures), respectively. ProTox external set T3DB external set (425 compounds) (284 compounds) ID Size Model MAEProTox RMSEProTox MAET3DB RMSET3DB M17 200 Consensus (average) 0.4106 0.5901 0.3004 0.5157 M18 200 Consensus (weighted average) 0.4108 0.5902 0.3006 0.5155 M19 200 Consensus (minimum) 0.4073 0.5890 0.2968 0.5221 M20 200 Consensus (average) 0.3975 0.5784 0.3019 0.5098 M21 200 Consensus (weighted average) 0.3976 0.5785 0.3029 0.5097 M22 200 Consensus (minimum) 0.3971 0.5781 0.2796 0.5161 M23 200 Consensus (average) 0.3807 0.5663 0.3016 0.5091 M24 200 Consensus (weighted average) 0.3805 0.5656 0.3049 0.5095 M25 200 Consensus (minimum) 0.3848 0.5756 0.3057 0.5272

36 ACS Paragon Plus Environment

Page 37 of 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Chemical Research in Toxicology

Table 4. Performance of the best consensus models built on the EPA training set (from M17 to M19), EPA-full training set (from M20 to M22) and Zhu training set (from M23 to M25), respectively, according to their applicability domains. EPA test set (1482 compounds) ProTox external set (425 compounds) T3DB external set (284 compounds) ID % % % MAEtest-AD RMSEtest-AD MAEProTox-AD RMSEProTox-AD MAET3DB-AD RMSET3DB-AD coverage coverage coverage M17 98.25 0.4084 0.5611 94.59 0.4096 0.5925 95.42 0.2632 0.4402 M18 98.25 0.4087 0.5613 94.59 0.4098 0.5926 95.42 0.2635 0.4403 M19 98.25 0.4044 0.5613 94.59 0.4067 0.5916 95.42 0.2586 0.4405 M20 97.41 0.3992 0.5813 95.42 0.2547 0.4023 M21 97.41 0.3993 0.5814 95.42 0.2559 0.4030 M22 97.41 0.3992 0.5809 95.42 0.2286 0.3957 M23 94.12 0.3774 0.5688 93.66 0.2434 0.3859 M24 94.12 0.3773 0.5681 93.66 0.2471 0.3871 M25 94.12 0.3811 0.5779 93.66 0.2444 0.3979

37 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 38 of 45

Table 5. Pearson correlation-based similarity coefficients among the models proposed and QSAR methods most commonly used in the literature to predict AOT. The similarities were calculated from the toxicity classes obtained by each model in the WITHDRAWN dataset according to the GHS labelling. TEST DL-AOT admetSAR ProTox M19 M22 M24 Consensus TEST Consensus 1.000 DL-AOT .496 1.000 admetSAR .433 .658 1.000 ProTox .351 .557 .612 1.000 M19 .447 .531 .648 .506 1.000 M22 .465 .599 .723 .548 .890 1.000 M24 .456 .666 .848 .648 .735 .814 1.000

38 ACS Paragon Plus Environment

Page 39 of 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

Schema 1. General workflow to compute the QuBiLS-MAS molecular descriptors. (1) Computation of the molecular vectors according to the properties selected; (2) Computation of the non-stochastic matrices for 𝐤 = 𝟏; (3) To consider atom-types or local-fragments (optional); (4) To apply molecular cutoffs (optional); (5) Computation of the simple-stochastic, double-stochastic and mutual probability matrices; (6) To perform the division in atom-level (or bond-level) matrices; (7) Computation of the atom-level (or bondlevel) indices (descriptors) using the property vectors calculated in the step (1); and (8) To apply one or several aggregation operators on the atom-level (or bond-level) descriptors vector.

39 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 40 of 45

Schema 2. Workflow corresponding to the feature selection and modelling process.

40 ACS Paragon Plus Environment

Page 41 of 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

Figure 1. Comparative graphic of the performance achieved by consensus model M19 with respect to seven methods reported in the literature on the EPA test set (1482 compounds), according to the applicability domain (AD) of the GUSAR wNN method, TEST Consensus method and model M19.

41 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 42 of 45

Figure 2. Comparative graphic of the performance achieved by consensus models M19, M22 and M24 with respect to four methods reported in the literature on the ProTox external set (425 compounds), according to the applicability domain (AD) of the GUSAR wNN method, GUSAR nNN method and model M24.

42 ACS Paragon Plus Environment

Page 43 of 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

Figure 3. Comparative graphic of the performance achieved by consensus models M19, M22 and M24 with respect to six methods reported in the literature on the T3DB external set (284 compounds), according to the applicability domain (AD) of the GUSAR wNN method, GUSAR nNN method, TEST Consensus method and model M24.

43 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Page 44 of 45

Figure 4. Graphics corresponding to the GHS labelling obtained in the WITHDRAWN set by: (A) Consensus models M19, M22 and M24 proposed in this work; and (B) Consensus model M22 (the best one in external prediction) and the TEST Consensus, DL-AOT, admetSAR and ProTox methods established in the literature.

44 ACS Paragon Plus Environment

Page 45 of 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

Figure 5. Shannon entropy distribution corresponding to the 200 QuBiLS-MAS 2D-MDs included in the best QSAR models proposed, as well as of the best 200 0D-to-2D MDs computed by the DRAGON and PaDEL software, respectively.

45 ACS Paragon Plus Environment