Multioutput Perturbation-Theory Machine Learning (PTML) Model of

Aug 19, 2019 - Multioutput Perturbation-Theory Machine Learning (PTML) Model of ChEMBL Data for Antiretroviral Compounds ...
0 downloads 0 Views 713KB Size
Subscriber access provided by Drexel University Libraries

Article

Multi-output Perturbation-Theory Machine Learning (PTML) Model of ChEMBL Data for Antiretroviral Compounds Emilia Vásquez-Domínguez, Vinicio Danilo Armijos-Jaramillo, Eduardo Tejera, and Humbert González-Díaz Mol. Pharmaceutics, Just Accepted Manuscript • DOI: 10.1021/ acs.molpharmaceut.9b00538 • Publication Date (Web): 19 Aug 2019 Downloaded from pubs.acs.org on August 24, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 22

Molecular Pharmaceutics

Multi-output Perturbation-Theory Machine Learning (PTML) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Model of ChEMBL Data for Antiretroviral Compounds

Emilia Vásquez-Domínguez a,b*, Vinicio Danilo Armijos-Jaramillo b,c, Eduardo Tejera b,c and Humbert González-Díaz a,d* a Department b

of Organic Chemistry II, University of Basque Country UPV/EHU, 48940, Leioa, Spain.

Faculty of Engineering and Applied Sciences-Biotechnology, Universidad de Las Américas (UDLA), 170125, Quito, Ecuador c Bio-chemioinformatics

group, Universidad de Las Américas (UDLA), 170125, Quito, Ecuador.

d IKERBASQUE,

Basque Foundation for Science, 48011, Bilbao, Spain.

ABSTRACT. Retroviral infections, such as HIV are, until now, diseases with no cure. Medicine and pharmaceutical chemistry need and consider a huge goal defining target proteins of new antiretroviral compounds. ChEMBL manages Big Data features with complex dataset, which is hard to organize. This makes information difficult to analyze due to a big number of characteristics described in order to predict new drug candidates for retroviral infections. For this reason, we propose to develop a new predictive model combining Perturbation Theory (PT) bases and Machine Learning (ML) modelling to create a new tool that can take advantage of all the available information. The PTML model proposed in this work for ChEMBL dataset preclinical experimental assays for antiretroviral compounds consists in a linear equation with four variables. The PT operators used are founded on multi-condition moving averages, combining different features and simplifying the difficulty to manage all data. More than 140,000 preclinical assays for 56,105 compounds with different characteristics or experimental conditions have been carried out and can be found in ChEMBL database, covering combinations with 359 biological activity parameters (c0), 55 protein accessions (c1), 83 cell lines (c2), 64 organisms of assay (c3), and 773 subtypes or strains. We have included 150,148 preclinical experimental assays for HIV virus, 1,188 for HTLV virus, 84 for Simian immunodeficiency virus, 370 for Murine Leukemia virus, 119 for Rous sarcoma virus, 1,581 for MMTV, etc. We also included 5,277 assays for Hepatitis B virus. The developed PTML model reached considerable values in sensibility (73.05% for training and 73.10% for validation), specificity (86.61% for training and 87.17% for validation), and accuracy (75.84% for training and 75.98% for validation). We also compared alternative PTML models with different PT operators such as covariance, moments and exponential terms. Finally, we made a comparison between literature ML models with our PTML model and also ANN nonlinear models. We conclude that this PTML model is the first one to consider multiple characteristics of preclinical experimental antiretroviral assays combined, generating a simple, useful and adaptable instrument, which could reduce time and costs in antiretroviral drugs research.

ACS Paragon Plus Environment

1

Molecular Pharmaceutics

Page 2 of 22

Keywords: ChEMBL; antiretroviral compounds; Perturbation Theory; Machine Learning; Big data 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

■ INTRODUCTION The Joint United Nations Program on HIV/AIDS (UNAIDS) declares that near 37 million HIV persons were infected in 2017, and only 59% had access to antiretroviral (ARV) therapy.1 HIV is the most representative of retroviruses, a type of ssRNA virus that can cause important diseases in humans and other organisms. Viral infections can be controlled by ARV therapy, by blocking different steps of the viral life cycle.2 There has not been discovered any cure for HIV infections, so, ARV treatment is an efficient way to keep the viral cycle controlled and prevent transmission.3 Four different kinds of mechanisms of ARV action have been described, all of them consist on inhibiting different processes or molecules such as nucleoside or nucleotide analog reverse transcriptase, nonnucleoside reverse transcriptase, protease or fusion.4 ARV drugs have diverse mechanisms of action. Taking this into consideration, it is important to search and apply new more efficient treatments, with the aim of controlling retroviral diseases. Hepatitis B virus (HBV) is a dsDNA virus which replication system uses an RNA intermediate for reverse transcription, just like retroviruses and share some features. For this reason, it is classified as a Hepadnavirus.5; 6

Moreover, HBV genome has shown homology to corresponding regions of retroviruses or retrovirus-like

endogenous human DNA elements, like long terminal repeats (LTR). Studies consider that exist a common evolutionary origin between HIV and retroviruses, and HBV has gone through deletions from a retrovirus, or progenitor.7 HBV has eight different and characteristic genotypes (A through H) and some of them are related to liver diseases and cancer.8 The World Health Organization (WHO) informed on 2017 that more than 255 million people were HBV positive, and each year, more than 600 thousand people die for this viral infection.9 WHO guidelines point out that the choice of ARV drugs must contemplate a significant number of facts: possibility of coinfections, comorbidities, drugs and other concomitant health conditions. Also, drug resistance and toxicity, are issues to be considered.10 For this reason, it is necessary to increase the number of effective and safe drugs for retroviral infection treatments. To validate the effectiveness and selectivity of new compounds, several experimental assays must be carried out in different conditions and demonstrate the inhibition potential in most of the life viral cycle steps. Computational models are easy access and effective tools that give important predictive information to reduce the experimental drug development phase. The existence of important drug experimental trials databases such as ChEMBL, has provided the opportunity to search for new relevant information, due to its capacity to deliver huge amounts of data.11 The application of cheminformatics methodology and other computational methods increases drug design and best candidate selection for experimental trials,12 providing the relationship insights into structural molecular information contributing to biological activity.13 The prediction of new drugs against different targets can be done by computational methods. For example, Machine Learning (ML) techniques 14-16 use Big Data sets of preclinical assays, such as ChEMBL database to determine the chemical structure of compounds that can be interesting, calculating different molecular

ACS Paragon Plus Environment

2

Page 3 of 22

Molecular Pharmaceutics

descriptors.17-34 The large number of assays, heterogeneity, and complexity of the data make it difficult to work 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

with this database. 35; 36 Many researchers have developed computational models to discover ARV candidate compounds. However, they are specific for certain and unique characteristics and cannot add in data about different compounds’ characteristics, such as the target, organism of assay, or others.37-39 Conversely, ChEMBL Big Data sets include important information about biological activity parameters (EC50, Ki, Activity, Potency, etc.), protein target organism, cell lines used, organisms of assay, etc of ARV compounds, as other drugs,.

40-42

Then again, ML

methods can be combined with Perturbation Theory (ML + PT= PTML) in order to solve these problems combining compounds information in drugs discovery.43; 44 PTML big data analysis method accounts for large data sets of ChEMBL reports of different preclinical assays, considering a high diversity of information such as target proteins, assay organisms, cell lines, and more.45 In fact, PTML models have been used to model large data sets in a sort of areas, like medicinal chemistry, proteomics, and nanotechnology.46-51 Recently, several groups have designed a multitasking chemoinformatic model for virus, and antiviral biological activities and HIV studies.52-54 However, until our best knowledge, no other studies of PTML models have been carried out using different retrovirus types together, to find new antiretroviral compounds. The objective of this work was generating a predictive model to select antiretroviral candidates establishing the relationships between the biological activity and experimental and calculated properties (descriptors) through statistical methods in using PTML modeling. We developed a multiple assay conditions PTML model predict antiretroviral compounds against different retrovirus. The general workflow used for the PTML model for antiretroviral compounds development is shown in Figure 1.

Figure 1. PTML model for antiretroviral compounds development workflow ACS Paragon Plus Environment

3

Molecular Pharmaceutics

Page 4 of 22

■ EXPERIMENTAL SECTION 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ChEMBL Data curation. Preclinical assays outcomes dataset was obtained from ChEMBL using the following keywords: Hepatitis B virus, HTLV, HIV, XMRV, FeLV, SIV, MuLV, RSV, MMTV. We can note that more than 158,000 preclinical assays with different characteristics or experimental conditions have been carried out and can be found in ChEMBL database, covering combinations with 359 biological activity parameters (c0), 55 protein accessions (c1), 83 cell lines (c2), 64 organisms of assay (c3), and 773 subtypes or strains. We have included 5,277 preclinical experimental assays for Hepatitis B virus, 1,188 for HTLV virus, 150,148 for HIV virus, 84 for Simian immunodeficiency virus, 370 for Murine Leukemia virus, 119 for Rous sarcoma virus, 1,581 for MMTV, etc. We merged the different datasets obtained with each keyword into a single raw dataset. After that, we performed curation of this dataset by eliminating all duplicated cases using MS Excell function. We also eliminated all cases with missing values of vij, LogP, and/or PSA. In addition, we corroborated the necessity on using two columns in the ChEMBL dataset to express the existence of one organism and one organism of assay. After comparison, we verified that for our specific data these columns are redundant. Consequently, we eliminated one of them. All in all, after data curation we retained a working dataset with a total of 140,644 assays to be analyzed. We zorted the working dataset (from A to Z) according to column c0 and also (from minor to higher) to nj (number of cases for each kind of property c0). After that, we were able to sample every one case out of four training and validation series. This ensures a random, stratified, and representatative, sampling of training vs. validation series.55 In Table 1 we can see the distribution of cases obtained for both series with respect to virus type. Table 1. Train and validation assays according to virus type Virus a HIV Hep B HTLV MLV RSV SIV FeLV Various Total a Assays

Assays 97863 4526 304 263 108 56 2 37522 140644

Train 73370 3367 222 203 78 44 2 28197 105483

Val 24493 1159 82 60 30 12 0 9325 35161

% val 25.03 25.61 26.97 22.81 27.78 21.43 0.00 24.85 25.00

were classified according to assay organism c3. If the organism was found in more than one virus database, it was classified in “Various” category

ChEMBL Data pre-processing. All experiment results are expressed by the quantification of biological activity vij of the studied molecule i over the target j. The subset of conditions c1, c2, c3, c4 are used to encode the experimental conditions characterizing the assay (protein accession, cell line; organism of assay, subtype or strain), as seen in Table 2. We also calculated Multi-condition Moving Average (MMA) operators to quantify the deviation of the molecular descriptors of the drug from the expected value of these descriptors for a group of conditions cj. These MMA resemble classic Box-Jenkins MA PT operators used in other works as input.43; 44 Table 2. Input PT operators in PTML models in equation (2) information ACS Paragon Plus Environment

4

Page 5 of 22

Molecular Pharmaceutics

Conditionsa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(cj)

Symbol

Formula

Information The function of reference expresses the

c0 – activity type

f(vij)ref

expected value of probability with which f(vij)obs=1 for the activity vij of c0

n(f(vij)obs=1)/nj c0 -– activity type

Observed probability value for the

(vij)obs

activity vij of c0 Quantification of changes in the hydrophobicity or polar surface area depending on chemical structure LOGPi - changes of i compound D1=LOGPi or

cj = [c1, c2, c3, c4] – multiple ΔD1(cj) / ΔD2(cj)

D2=PSAi from or of multiple assay conditions

conditions

(c1=Protein Accession; c2=Cell Line; c3=Organism of Assay; c4=Subtype / Strain) a MMA

operators with subset of multiple conditions are used in equation (3)

The biological activity parameter desirability measured as d(c0) with 1 or -1 values, determines if it has desired biological effect, which will show an increase. If not, with an undesired biological effect, it will show a decrease. The observed value is f(vij)obs value is 1 when vij is superior to the cutoff and d(c0) value is 1. f(vij)obs value is also 1 when vij value is less than cutoff and d(c0) is equal to -1 and f(vij)obs is equal to 0. Contrarily, the f(vij)obs will be 0. Observed biological activity values f(vij)obs equal to 1 mean the compound exerts a strong response over the target. Random, stratified, and representative criteria were followed to select training and validation groups.56 To maintain stratification in sampling, all assays were ordered by type of biological activity (c0). PTML linear model. After data preprocessing and curation we developed a PTML linear model of the present dataset. An important tool to predict models for Big Data complex datasets is Perturbation Theory-Machine Learning modeling technique (PTML). -ff57; 58 Predicted function compound i values f(vij)calc of j preclinical assay with subset of multiple conditions cj = (c0, c1, c2, c3, c4) was made by PTML modeling technique. We used as input variables the function of reference f(vij)ref and MMA operators calculated in previous sections.43; 44 Biological activity classification as active or non-active compounds and classification can be made using new linear PTML models developed, using Linear Discriminant Analysis (LDA),56 with the following equation: 𝑘

𝑘

,𝑗

𝑚𝑎𝑥 𝑓(vij)𝑐𝑎𝑙𝑐 = a0 + a1·𝑓(vij)𝑟𝑒𝑓 + ∑𝑘𝑚𝑎𝑥 a ·D + ∑𝑘𝑚𝑎𝑥 a ·∆Dk(𝐜𝐣) = 1 𝑘𝑗 k = 1,𝑗 = 0 𝑘𝑗

(1)

■ RESULTS AND DISCUSSION Antiretrovirals PTML model. The developed model applied to ARVs calculates the probability of interaction of a molecule i with different retrovirus under a set of multiple conditions of assay cj. The applied dataset ACS Paragon Plus Environment

5

Molecular Pharmaceutics

Page 6 of 22

contains more than 140,000 ARV experimental preclinical assays for retroviruses HIV, HTLV, SIV,HBV, 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

MLV, RSV, FeLV. HBV has been included as the presence of coinfections with HIV and HBV in patients are common. HIV prolongs HBV viremia, increases rates of chronicity also the risk of cirrhosis and liver-related morbidity. For this reason, there should be a specific coordination when treating patients with both HIV and HBV infections.59 Some studies have found effective ARV drugs and have significant activity in the treatment of certain types of resistant HBV in HIV/HBV-coinfected patients.60 Yang's group suggest that, in case of coinfection, the ARV therapy should include drugs with activity against both viruses.61 The multicondition PT operators were calculated using combinatorial or multiple moving averages (MMA). The PTML model equation with LDA is described below: 𝑓(vij)𝑐𝑎𝑙𝑐 = ―16.6473 + 10.6828·𝑓(vij)𝑟𝑒𝑓 + 5.4195· D1 ― 5.0349·∆D1(𝐜j) + 0.0512·∆D2(𝐜j) 𝜒2 = 37,710.77

n = 140,644

(2)

p < 0.05

In the equation, f(vij)calc is the output variable which consists in the original biological activity vij(c0) non dimensional scoring function. The input variables used in this equation were as follow: f(vij)ref represents the observed the biological activity value of the molecule m under cj subset of multiple conditions. After that, ΔDk and ΔDk(cj) perturbation effects in the molecule structure are added to the equation. The expected value f(vij)ref used in the equation corresponds to the expected probability to find an active drug in each biological activity measured pi(c0) = n(f(vij)obs = 1)/nj(c0), which can be obtained by dividing the total of active compounds n(f(vij)obs = 1) of a specific biological activity vij(c0) by the total number of compounds in the assay nj(c0). Active compounds were classified following the desirability criteria of the biological activity vij(c0) and its cutoff relation. Active compounds (f(vij)obs = 1) were classified when vij > cutoff and a priori desirability function d(c0) = 1; also, if vij < cutoff and d(c0) = -1, otherwise compounds were classified as non-active f(vij)obs = 0. d(c0) was defined as 1 if the interest is to increase the biological activity vij(c0); e.g., Activity(%). In contrast, d(c0) = -1 means the biological activity must decrease; e.g., IC50 (nM). See Table 3. In any case, the values of d(c0) for the same property may be customized (switched) if the situation requires so.62 Table 3. Selected pharmacological parameters (c0) in ChEMBL data set of antiretrovirals preclinical experimental assays c0 = Activity parameter for vij(c0) (Units) a IC50(nM) EC50(nM) Potency(nM) Inhibition(%) Ratio CC50/EC50() Ki(nM) Ratio EC50() AC50(nM) IC90(nM) CC50(nM) IC95(nM)

nj(c0) b 39245 31383 17375 8048 5878 5572 3241 3028 2079 1881 1869

dj(c0) c

Cutoff d

n(f(vij)obs = 1) e

pj(c0) f

-1 -1 -1 1 1 -1 1 -1 -1 -1 -1

1.00 250.00 100.00 60.00 10000.00 100.00 1056.57 100.00 100.00 100.00 100.00

1107 12093 48 3364 398 3464 209 3 975 15 824

0.028 0.385 0.003 0.418 0.068 0.622 0.064 0.001 0.469 0.008 0.441

ACS Paragon Plus Environment

6

Page 7 of 22

Molecular Pharmaceutics

ED50(uM) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

a See

1521

-1

0.10

353

0.232

detailed list for all parameters in Table SI01. b nj(c0) represents the number of compounds reported with experimental values

vij(c0) in ChEMBL dataset. c As mentioned before, desirability dj(c0) was assigned with a value of 1, or -1, according to the necessity of increasing or decreasing activity. d The cutoff determines if one compound is active or not, basing on the threshold value of the biological activity vij(c0). e n(f(vij)obs = 1) represents the number biologically active (f(vij)obs = 1) analyzed compounds considered according to the experimental values vij(c0) reported for the parameters j. f pj(c0) is the expected probability of one compound to be considered as active in one assay with the same parameter c0 (= n(f(vij)obs = 1)/nj(c0)).

PT operator ΔD1(cj) = ΔLOGP(cj) was the next input term used in this equation. ΔPSA(cj) was not included, as it didn’t improve the specificity or sensibility of the model. ΔLOGP(cj) is a MMA PT operator, which considers perturbations in multiple variables at the same time, 62; 63 and gives the difference or deviation ΔD1(cj) among the molecule D1(mi) and the expected value obtained from compounds assayed under the same subset of multiple conditions (ΔD1(cj) = D1(mi) - ). The values have been calculated for all combinations; some selected examples are shown in Table 4. The complete list of calculated values can be found in the file SI02.xlsx. Table 4. Selected examples of average values used to calculate MMAs operators Experimental conditions included in the PTML MMA modela c0 =vij(c0) (Units)

EC50 (nM)

IC50 (nM)

Inhibition (%)

c1= Protein accession Q72874 ND ND

c2= Cell name ND ND ND

ND

MT4

ND

CCRFCEM

ND

ND

Q72874 ND ND

ND ND ND

ND

MT4

ND

CCRFCEM

ND

ND

Q72874 ND ND

ND ND ND

ND

MT4

c3= Assay organism ND Hepatitis B virus Homo sapiens Human immunodeficiency virus type 2 (ISOLATE ROD) Homo sapiens Human immunodeficiency virus 1 ND Hepatitis B virus Homo sapiens Human immunodeficiency virus type 2 (ISOLATE ROD) Homo sapiens Human immunodeficiency virus 1 ND Hepatitis B virus Homo sapiens Human immunodeficiency



Cases

c4 = Subtype/ Strain ND ND ND

(LOGP)

(PSA)

n (cases)

3.89990715 2.79956631 2.1699879

134.8321599 104.8764887 115.2590018

36 452 337

ND

4.19960379

86.29472868

858

ND

2.4339234

128.319966

573

JR-FL

4.40202154

57.37584093

3

ND ND ND

3.89990715 2.79956631 2.1699879

134.8321599 104.8764887 115.2590018

960 407 270

ND

4.19960379

86.29472868

144

ND

2.4339234

128.319966

230

JR-FL

4.40202154

57.37584093

1

ND ND ND

3.89990715 2.79956631 2.1699879

134.8321599 104.8764887 115.2590018

198 111 20

ND

4.19960379

86.29472868

0

ACS Paragon Plus Environment

7

Molecular Pharmaceutics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 22

virus type 2 (ISOLATE ROD)

Ki (nM)

Potency (nM)

ND

CCRFCEM

ND

ND

Q72874 ND ND

ND ND ND

ND

MT4

ND

ND

Homo sapiens Human immunodeficiency virus 1 ND Hepatitis B virus Homo sapiens Human immunodeficiency virus type 2 (ISOLATE ROD) Human immunodeficiency virus 1 a

ND

2.4339234

128.319966

4

JR-FL

4.40202154

57.37584093

1,200

ND ND ND

3.89990715 2.79956631 2.1699879

134.8321599 104.8764887 115.2590018

966 0 0

ND

4.19960379

86.29472868

0

JR-FL

4.40202154

57.37584093

0

ND: No data available

Sensitivity Sn (%), specificity Sp (%) and accuracy Ac (%) results using the chosen MMA operators D1(cj)= D1(c1,c2,c3,c4) and ΔDk(cj) = ΔDk(c1,c2,c3,c4) are shown in Table 5. We reached more than 73% sensibility and accuracy for both training and validation cases and more than 85% specificity. The obtained values for a huge number of cases considered with high complex data (> 140,000) tells us this is a simple but very useful prediction model. Full list of pharmacological parameters (c0) in ChEMBL data set of antiretrovirals preclinical experimental assays can be found in supporting information document SI01.doc, see Table S1. For full list of average values used to calculate MMAs operators, refer to Table S2, document SI02.xlsx. Table S3, file SI03.xlsx shows the outcomes from MMA model including relevant information, and the dataset used for the calculation. Table 5. PTML MMA linear models results for ARVs Train

Param.

%

f(vij)obs = 0

f(vij)obs = 1

f(vij)pred = 1

Sn

73.05

61,175

22,573

f(vij)pred = 0 Total Validation

Sp Ac Param.

86.61 75.84 %

2,910

18,825

f(vij)obs = 0

f(vij)obs = 1

f(vij)pred = 1

Sn

73.10

20,439

7,520

f(vij)pred = 0 Total

Sp Ac

87.17 75.98

92

6,278

a Sn

= Sensitivity, Sp = Specificity, and Ac = Accuracy.

PTML prediction of other targets for ARV candidates. As we have mentioned, a compound was considered to be active, a priori, when desirability value d(c0) = 1 and the experimental value vij > cutoff, or d(c0) = -1 and the experimental value vij < cutoff. Using the developed model, we carried out a predictive study to analyze different biological activities c0 for retroviruses HIV,HTLV, SIV, HBV, MLV, RSV in more than 60,000 cases. In so doing, we used the model to calculate the probability p(f(vij) = 1)min-max with which the model predicts if a compound is active posteriori. The probability was calculated as follows p(f(vij) = 1)min-max = 8 ACS Paragon Plus Environment

Page 9 of 22

Molecular Pharmaceutics

[f(vij)calc – f(vij)min]/[ f(vij)max – f(vij)min].64 The values of f(vij)calc have been calculated directly from the equation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

of the PTML model. The values f(vij)max and f(vij)min are the maximum and minimum values predicted for all assays with the same c0 for all compounds assayed against the same retrovirus species. The Table 6 compares the average values of p(f(vij) = 1)min-max predicted for these assays. Table 6. PTML prediction of different biological activities c0 for retroviruses Retrovirusa Antiviral Activity c

SIV

HIV

HTLV

HBV

MLV

RSV

Casesb

c0

47

59,979

223

1,986

218

101

EC50(ug. mL-1)

0.532

0.501

0.491

0.47

0.445

0.417

Ki(nM)

0.517

0.485

0.474

0.453

0.427

0.398

IC90(nM)

0.503

0.471

0.46

0.439

0.413

0.384

IC50(ug. mL-1)

0.499

0.469

0.458

0.438

0.412

0.385

IC95(nM) Inhibition(%) EC50(nM)

0.5 0.498 0.495

0.469 0.466 0.463

0.458 0.456 0.453

0.437 0.435 0.432

0.41 0.408 0.405

0.382 0.38 0.377

0.49

0.447

0.443

0.418

0.406

0.355

ED50(uM) Activity(%) Ratio CC50/EC50

0.481 0.47 0.466

0.449 0.44 0.435

0.439 0.43 0.424

0.418 0.409 0.403

0.391 0.383 0.376

0.363 0.356 0.348

Ratio EC50 Selectivity Index FC CC50(nM) Potency(nM) AC50(nM) Ratio

0.466 0.46 0.46 0.461 0.46 0.46 0.452

0.434 0.43 0.429 0.429 0.429 0.428 0.421

0.424 0.419 0.419 0.418 0.418 0.418 0.411

0.402 0.399 0.398 0.397 0.397 0.397 0.391

0.376 0.373 0.372 0.371 0.37 0.37 0.365

0.347 0.346 0.345 0.342 0.342 0.342 0.337

IC50(nM)

aRetrovirus

abbreviations: SIV: Simian immunodeficiency virus; HIV: Human immunodeficiency virus;

HTLV: Human-T lymphotropic virus; HBV: hepatitis B virus; MLV: Murine Leukemia virus; RSV: Rous sarcoma virus. b

Number of cases predicted for each retrovirus. C Type of biological activity parameter (c0)

As we can see, in Table 6 is the description of the average values obtained of the biological activity of different studies for different types of Retrovirus. Of all the analyzed biological activities, EC50(ug. mL-1), a potency measure, reached the highest values in different retroviruses, which means it is the most sensitive parameter to be predicted among others. It reached more than 0.5 in SIV, as Ki(nM) and IC90(nM). Nevertheless, the behavior of all parameters is similar in each retrovirus, something that tells us the model is consistent, as it maintains the capacity of prediction for each retrovirus, no matter the biological activity characteristic. Interestingly, SIV was the retrovirus that showed the highest prediction probabilities according to biological activity, even when the number of ChEMBL experimental assays that fed our database was not the most abundant. This could mean ACS Paragon Plus Environment

9

Molecular Pharmaceutics

Page 10 of 22

that SIV is more sensitive to different ARV drugs. On the other hand, Rous sarcoma virus showed low 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

probability values. The lowest prediction value obtained in these examples reached a score near 0.3, which means that it is the less sensitive virus to determine its suitability. The development of the model made possible the prediction of all cases, considering multiple conditions for all retroviruses. This prediction suggests that different ARVs could be candidates for other retroviruses, and not only those assayed. It is important to declare that, as ChEMBL database includes different types of biological activities and different units, some biological activities are presented separately if the units of experimental assays used different units. Comparison to other PTML models for ARV compounds. To validate the proposed model, we applied alternative equations using covariance, moments and exponentials. On Table 7, we display the corresponding equations. Using MMA operators, these trained and validated alternative models considered different conditions at the same time. For example, ΔD1(c1,c2,c3,c4) = ΔLOGP(c1,c2,c3,c4) = LOGP(mi) – quantifies the deviation of the compound descriptor from the expected value for every assayed molecule at equal conditions cj =c1,c2,c3,c4: protein accession, cell line, organism of assay and subtype or strain. This means that the ΔD1(c1,c2,c3,c4) is a 4-fold MMA operator that covers 4 different conditions of assay in one (cj) equation. Table 7. Antiretroviral PTML MMA linear model equations PT

Equation

Operators Moving

37,710 .77

Dk(cj) 𝑓(vij)𝑐𝑎𝑙𝑐

Dk(cj)· Dk’(cj) Moments (Dk(cj))q q=2 Moments (Dk(cj))q

𝑓(vij)𝑐𝑎𝑙𝑐

Average

p