Subscriber access provided by Kaohsiung Medical University
Computational Chemistry
A Machine Learning Approach for Predicting HIV Reverse Transcriptase Mutation Susceptibility of Biologically Active Compounds Thomas Maxwell Kaiser, Pieter B. Burger, Christopher J. Butch, Stephen Pelly, and Dennis C. Liotta J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00475 • Publication Date (Web): 28 Jun 2018 Downloaded from http://pubs.acs.org on June 30, 2018
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
TranscrippA Machine Learning Approach for Predicting HIV Reverse Transcri tase Mutation Susceptibility of Biologically Active Compounds Thomas M. Kaiser1,2*, Pieter B. Burger1,3, Christopher J. Butch1,4, Stephen C. Pelly1, Dennis C. Liotta1,* 1
Department of Chemistry, Emory University, 201 Dowman Drive, Atlanta, Georgia, United States 30322
2
Current Address: St Peter’s College, University of Oxford, New Inn Hall St, Oxford, United Kingdom OX1 2DL
3
Department of Drug Discovery and Biomedical Sciences, College of Pharmacy, Medical University of South Carolina, 280 Calhoun St. MSC 141 Charleston, South Carolina, United States 29425-1410
4
Earth-Life Science Institute, Tokyo Institute of Technology, 2-12-1-IE-1 Ookayam Meguro-ku, Tokyo, Japan 152-8550
Machine Learning, HIV, NNRIT, Chemical Biology, Drug Discovery, Chemoinformatics ABSTRACT: HIV resistance emerging against antiretroviral drugs represents a great threat to the continued prolongation of HIV-infected patients' lifespans. Methods capable of predicting resistance susceptibility in the development of compounds are therefore in great need. Targeting the major reverse transcription residues Y181, K103 and L100, we used the biological activities of compounds against these enzymes and the wild type reverse transcriptase to create Naïve Bayes Networks. Through this machine learning approach, we could predict with high accuracy whether a compound would be susceptible to loss of potency due to resistance. Also, we could perfectly predict retrospectively whether compounds would be susceptible to both a K103 mutant RT and a Y181 mutant RT. In the study presented here our method outperformed a traditional molecular mechanics approach. This method should be of broad interest beyond drug discovery efforts, and serves to expand the utility of machine learning for the prediction of physical, chemical or biological properties using the Keywords: HIV, Drug Resistance, Machine Learning, Bayesian Networks, Reverse Transcriptase vast information available in the literature.
INTRODUCTION In the past three decades, more than 25 antiretroviral drugs and drug combinations have been developed for the treatment of HIV-1. HIV still has no known cure, however; and, HIV is a major public health threat with an estimated 30 million infected individuals worldwide.1 Also, strict patient adherence (>95% of dose) to the prescribed combination therapy is needed to ensure suppression of HIV viral load, and interruptions in dosing lead to a loss of viral response due to mutation.2 The percentage of patients with an individual adherence rate > 95% was found to be 53% for older subjects and only 26% for younger subjects in the United States.3 Resistance emerging due to a high rate of poor patient adherence represents a great threat to the continued success of antiretrovirals especially when coupled with the high error rate of reverse transcriptase producing mutant virions.4 As a result, HIV resistance susceptibility is a drug developmental concern which currently has no method for prediction of suscep-
tibility. Methods capable of predicting resistance susceptibility in the preclinical development of compounds acting against HIV are therefore in great need. Given ours and others success with predicting the activity of compounds using Naïve Bayes Networks (NBNs), we decided to extend our machine learning methodology to predict resistance susceptibility of nonnucleoside reverse transcriptase inhibitor (NNRTI) compounds.5 We additionally demonstrate that, trained solely with knowledge of the small molecule scaffold and activity, NBNs more accurately identify non-susceptible compounds as compared to a more traditional docking based approach which requires both knowledge of the protein structure and considerable computational time. Of the 24 single agents currently approved for treating HIV, 13 target reverse transcriptase (RT) making RT the most frequently targeted component of HIV.6 Reverse transcriptase is the enzyme responsible for converting the single-stranded RNA genome of HIV into the doublestranded DNA needed for integration into the genome of
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
the host.7 Inhibitors of RT approved for the clinic are in either the nucleoside reverse transcriptase inhibitor (NRTI) family or the NNRTI family. While NRTIs directly act at the active site of RT, NNRTIs bind to a hydrophobic pocket within the palm subdomain of p66 and exert an inhibitory effect through allosterism.8 Key interactions between the allosteric pocket and NNRTIs involve the residues Tyr 181, Lys 101, Tyr 188, Trp 229, Tyr 318, Leu 100, and Val 106. Single amino acid substitution is often sufficient to confer resistance to inhibitors, and key mutations observed in the clinic that confer resistance to approved Drug
Nevirapine
Efavirenz
Etravirine
Rilpivirine
Prevalent Mutations
Mutation
Percentage in ART-Patients
L100I
L100I
7.7
K101E
K101E
16
K103N
K103N
61
Y181C
K103S
0.7
Y188C
V106M
15
K103N
Y181C
33
K103N/Y181C
Y181I
3.6
K103N/E478Q
Y188L
8.8
Y181C
G190A
36
K101H
G190S
16
V106I/V179D
P225H
10
L100I Y181I/V Y188L
NNRTIs are listed in Table 1.8-11 As can be seen from Table 1, a majority of patients on ART experience the K103N mutation and multiple mutations are not uncommon in patients receiving therapy. However, most of the work regarding drug resistance prediction revolves around sequencing viral genomes in patients and making predictions about which drugs would be
Table 1. Prevalent Mutations Associated with NNRTI Resistance and Percentage of Patients with Mutations in Reverse Transcriptase
unusable in a patient due to viral resistance.12-14 A general method capable of predicting resistance susceptibility against clinically observed mutants for hit-to-lead development has not been published at the time of this study.15
Page 2 of 12
Our machine learning approach would have to delineate between two sets of compounds: those compounds that retained wild-type activity against a single-amino-acid mutant, and those compounds that lost activity when a residue was mutated. Furthermore, our machine learning approach would be ignorant of any 3D structural information regarding active site-ligand interactions.
RESULTS AND DISCUSSION Our ultimate goal for this study was to create a workflow that would allow the construction of a machine learning algorithm that would predict resistance susceptibility for compounds when a residue was mutated to any other amino acid. If this were to fail, we would then separate out all of the individual mutant types (e.g. Y181C) and perform a more limited analysis. We used the ChEMBL database as our source for data regarding compounds known to be active against RT or any mutant form of RT.16 The ChEMBL database was selected as our data source due to the rigorous curation process activity data undergo before being incorporated.17 We found 3899 entries concerning RT activity, and we decided to first focus on the Y181 mutation, one of the major mutants responsible for a 50-fold loss of efficacy of nevirapine.11 We only focused on single mutant data as there were only a handful of compounds with double mutant data (both the Y181 and K103 residues being altered) in the ChEMBL dataset. We then took the ChEMBL dataset and processed it as shown in the schematic workflow in Figure 1. Selecting for compounds with Y181 mutant data gave 340 compounds out of the 3899. Removing compounds that were tested against the Y181-K103 double mutant gave 311 compounds that had only data against any Y181 mutant. We then selected those compounds from the 3899 that had data against wild-type RT and filtered those for compounds that were also in the set of 311 Y181 mutant compounds to give 308 compounds that had WT data. Finally, we filtered the Y181 compounds using the WT compounds to give a mutant Y181 set of 308 compounds which were common to both sets. The logic behind this was to find a set of compounds that had both Y181 mutant and WT activity, which allowed us to quantify the degree of activity loss by calculating the fold change for each compound. To ensure we have a molecularly diverse set of compounds, we clustered the Y181 dataset resulting in 16 clusters with an average of 19.5 molecules per cluster (Scheme 1 and supporting information). Visual inspection showed a wide degree of chemical diversity present in each of the cluster centers for clusters containing more than 1 molecule. Table 2 summarizes the Wild Type RT activities for the 308 compounds and highlights that the analysis preformed in this study is concerned with the loss of activity for highly active compounds resulting from a 181 RT mutant (75% of compounds show