Autoencoder-based detection of dynamic allostery triggered by ligand

Yuko Tsuchiya 1*, Kei Taneishi 2, Yasushige Yonezawa 3. 1) Artificial ... 2) Cluster for Science, Technology and Innovation Hub, RIKEN, 6-7-3 Minatoji...
0 downloads 0 Views 702KB Size
Subscriber access provided by Nottingham Trent University

Bioinformatics

Autoencoder-based detection of dynamic allostery triggered by ligand binding based on molecular dynamics Yuko Tsuchiya, Kei Taneishi, and Yasushige Yonezawa J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.9b00426 • Publication Date (Web): 06 Aug 2019 Downloaded from pubs.acs.org on August 7, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Autoencoder-Based Detection of Dynamic Allostery Triggered by Ligand Binding based on Molecular Dynamics Yuko Tsuchiya 1*, Kei Taneishi 2, Yasushige Yonezawa 3

1) Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology, 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan 2) Cluster for Science, Technology and Innovation Hub, RIKEN, 6-7-3 Minatojimaminamimachi, Chuo-ku, Kobe, Hyogo 650-0047, Japan 3) High Pressure Protein Research Center, Institute of Advanced Technology, Kindai University, 930 Nishimitani, Kinokawa, Wakayama, 649-6493, Japan

Keywords: autoencoder, molecular dynamics simulation, dynamic allostery, clustering analysis, drug discovery, disease mechanism

ACS Paragon Plus Environment

1

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 39

Abstract Dynamic allostery on proteins, triggered by regulator binding or chemical modifications, transmits information from the binding site to distant regions, dramatically altering protein function. It is accompanied by subtle changes in side-chain conformations of the protein, indicating that the changes in dynamics, and not rigid or large conformational changes, are essential to understand regulation of protein function. Although a lot of experimental and theoretical studies have been dedicated to investigate this issue, the regulation mechanism of protein function is still debate. Here, we propose an autoencoder-based method that can detect dynamic allostery. The method is based on the comparison of time fluctuations of protein structures, in the form of distance matrices, obtained from molecular dynamics simulations in ligand-bound and -unbound forms. Our method detected that the changes in dynamics by ligand binding in the PDZ2 domain led to the reorganization of correlative fluctuation motions among residue pairs, which revealed a different view of the correlated motions from the PCA and DCCM. In addition, other correlative motions were also found as a result of the dynamic perturbation from the ligand binding, which may lead to dynamic allostery. This autoencoderbased method would be usefully applied to the signal transduction and mutagenesis systems involved in protein functions and severe diseases.

ACS Paragon Plus Environment

2

Page 3 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1. Introduction Regulator binding, such as a ligand, induces a change in protein dynamics wherein the information of the dynamics changes are transmitted to distant regions in the protein without large conformational deformations.1–6 This event, known as dynamic allostery, regulates protein functions and is accompanied by changes in the conformational fluctuation of the protein around the stable state.1,7 The introduction of mutations into proteins can also induce dynamic allostery, which is triggered by the transformations in side-chain conformational dynamics at the mutation sites and can be transmitted to the distant region. It alters protein function and its interactions with other molecules, which frequently lead to severe diseases. An understanding of dynamic allostery can provide information that is essential to understand disease mechanisms and for drug discovery. In this study, we only focused on the allosteric motion of proteins produced by the change of side-chain dynamics, which cannot be explained by the (relatively large) main-chain conformational changes and rigid-body motions. Dynamic allostery has been analyzed by experimental and computational techniques, such as nuclear magnetic resonance (NMR) and molecular dynamics (MD) simulations. Fuentes et al. 8 and Zhang et al.9 measured side-chain methyl dynamics by NMR to reveal the allosteric behavior in the second PDZ domain (PDZ2) of human tyrosine phosphatase 1E (hPTP1E) by binding the ligand (Ras-associated guanine nucleotide exchange factor 2, RA-GEF2). The PDZ2 domain is a 96-residue protein that plays important roles in the regulation of multiple

ACS Paragon Plus Environment

3

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 39

receptor-coupled signal transduction pathways.10 Therefore, it is often been used as a model protein for studying dynamic allostery by ligand binding. Additionally, the structural features of the PDZ2 domain protein were revealed by another NMR study.10 The authors reported that the domain consists of six β-strands and two α-helices. These β-strands form two β-sheets, wherein one sheet consists of β1, β6, β4, and β5 strands and another sheet consists of β2, β3, and β4 strands. The two sheets also form a β-sandwich topology. The hydrophobic core is formed by the L20, V22, V58, V61, V75, L78, and L87 residues. A portion of the core comprising a hydrophobic pocket formed by the L18, L20, V22, and L78 residues is filled by the V8 C-terminal ligand residue. NMR studies by Fuentes et al. 8 and Zhang et al.9 have shown that eight residues, which display significant changes in picosecond-nanosecond (ps-ns) side-chain dynamic structure parameters in the ligandbound form, are located in regions distant from the ligand binding site. This suggests that the energy of ligand binding is transmitted from the direct ligand binding residues through a residue–residue interaction network to the distal regions.8 For instance, several computational studies revealed key residues and residue pairs for PDZ2 allostery by analyzing protein dynamic motions, such as the degree of conformational fluctuation and correlation between fluctuations of two residues, which were obtained from MD simulations of ligand-bound and -unbound PDZ2 domains.11–14 The key residues and pairs found in these studies were then compared with those found in the experimental study by

ACS Paragon Plus Environment

4

Page 5 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Fuentes et al,8 in the respective manuscripts, which showed partial overlaps between computational and experimental studies. However, it is not known whether largely fluctuating residues and correlative pairs of those residues are involved in dynamic allostery. Dynamic allostery should result from the changes in protein fluctuations triggered by ligand binding or introduction of mutations without large conformational changes.8,9,11 Thus, analyses of subtle changes, particularly side-chain conformational fluctuations, are essential to understand allosteric effects. These subtle changes may be disregarded if the averaging of structural coordinates in MD trajectories is executed by least-square fitting of the coordinates during the simulation, e.g., into the initial coordinate, due to incomplete structural superimposition. In many studies, this kind of averaging is usually executed. Here, we propose a novel method to analyze dynamic allostery triggered by ligand binding or mutagenesis, without dropping structural information by averaging, which detects motion patterns observed only in the ligand-bound form or mutant (holo form), but not in the ligandunbound form or wild-type (apo form). The main component of this method comprises a multilayer autoencoder, an unsupervised neural network having an encoder–decoder architecture, wherein the encoder part learns and represents an input model via lowerdimensional features and the decoder part reconstructs the original input model using lowdimensional features.15–21 The autoencoder is widely used for dimensional reduction, feature extraction, representation learning, and pattern classification. The encoder part of the

ACS Paragon Plus Environment

5

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 39

autoencoder learns hidden relational information of the original input data and extracts its representative features by compressing the data into a low-dimensional code. The decoder part reconstructs the original data from the low-dimensional encoded data. The autoencoder learns the weights by minimizing a mean squared error of differences between the input and reconstructed data. For example, Chicco et al.22 constructed a deep autoencoder that predicts new gene ontology annotations from the features of known annotations extracted by the autoencoder during the training step. Wang et al.18 used the autoencoder to learn and extract the feature of amino acid sequences of drug target proteins to predict drug–protein interactions. Tan et al.23 extracted key biological principles from breast cancer gene expression data to classify tumor and normal samples predicting estrogen receptor status and enriched signaling pathways. In our new method, the autoencoder was first trained based on the data of “time fluctuations of protein motions” in the apo form, and the data in the holo form was then applied to the trained autoencoder. The output showed motion patterns that were modified and exaggerated according to the features in the apo data. The difference between the input (original data) and output (modified data), referred to as DIO, can represent essential and unique motion patterns observed only in the holo form. This paper introduces our novel method by explaining the procedure for analyzing dynamic allostery in the hPTP1E PDZ2 domain triggered by binding of the ligand Rap guanine nucleotide exchange factor 6 (RAPGEF6). First, MD simulations of PDZ2 with

ACS Paragon Plus Environment

6

Page 7 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

and without the ligand were performed. The distance between the centers of mass of side-chain atoms in a pair of residues was calculated in respective time steps from 50 ns to 200 ns at intervals of 0.1 ns (1500 time steps) in the MD trajectories. The autoencoder was designed to accept an input of a 1500-dimensional vector of the distances in each residue pair and to produce an output of a 1500-dimensional vector. We trained the autoencoder based on the data in ligand-unbound (apo) form and used it to inspect data in both the ligand-bound (holo) and unbound (apo) forms. The clustering of the DIOs in both the holo and apo forms together into several groups based on their similarity detected correlatively fluctuating protein–protein and protein–ligand residue pairs in the holo form, with the key residues that lead to the correlative motions. These correlative motions result from the change in dynamics by ligand binding, which may be involved in dynamic allostery. The comparison with the conventional analyses, such as principal component analysis (PCA), dynamical cross-correlation matrix (DCCM) and NMR study, revealed that our new method detected several outstanding residues that not only fluctuate in a long time scale and with a high correlation, as revealed in the conventional methods, but also lead other residues to fluctuate correlatively in a specific pattern. We believe that this method can be used to detect essential protein residues leading to changes in protein functions during mutagenesis. This information will help in understanding of disease mechanisms and in drug discovery.

ACS Paragon Plus Environment

7

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 39

2. Methods The procedure for the analysis of dynamic allostery is summarized in Figure 1. MD simulations of the PDZ2 domain in ligand-bound and -unbound forms MD simulations were performed twice for the apo and holo forms of the PDZ2 domain in human PTPN13 (identical with PTPN1E). We used crystal structures in apo (chain A of PDBID: 3lnx, containing the coordinates of the residues from 1 to 94) and holo (chains A and B of PDBID: 3lny, in which the A and B chains contain the coordinates of the PDZ2 protein residues 1 to 94 and ligand residues 3 to 8) forms as the initial structures of simulations, wherein the peptide ligand RAPGEF6 is identical to RA-GEF2.9 Ligand binding did not lead to substantial conformational changes because of the small RMSD (0.29Å) of the Cα atoms between the apo and holo forms and only altered fluctuation patterns. In this study, all simulations were performed using the Gromacs package 2018.24 The AMBER99SB-ILDN force field was used as the all-atom potential energy function. After energy minimization for approximately 50,000 steps and equilibration for 1 ns (500 ps NVT and then 500 ps NPT steps), we performed MD production runs of 200 ns at 300 K in dodecahedron boxes with a margin of 10 Å from the boundary, which were filled with TIP3P water molecules. MD product runs were terminated at 200 ns because RMSD of the whole system was sufficiently stabilized. Hereafter, the datasets of MD trajectories in apo

ACS Paragon Plus Environment

8

Page 9 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

and holo forms are referred to as apo1 and apo2, and holo1 and holo2, respectively. Distance fluctuation between residues in proteins The distance between the centers of mass of side-chain atoms in a pair of residues was calculated in respective time steps from 50 ns to 200 ns by 0.1 ns increments (1500 steps) in each of the four trajectories obtained by MD simulations. The data of the distances were then reconstituted as a 1500-dimensional vector for each residue pair. We did not consider hydrogen atoms in this calculation. The 1500-dimensional vector represents the time fluctuation of protein motions between two residues at certain time points. The vector was used as input data into the autoencoder to detect the features of correlative motions (Figure 2). For all residue pairs, including the ligand residues in all the trajectories, distance-based input vectors were prepared. Multilayer pyramidal autoencoder The autoencoder is an unsupervised neural network, which is widely used for dimensional reduction, feature extraction, and pattern classification.15,17–21 Our multilayer autoencoder consisted of nine fully-connected layers containing 1500, 1000, 500, 200, 100, 200, 500, 1000, and 1500 neurons, respectively, to obtain the best result in terms of loss and accuracy in the training and validation steps among various combinations of hyperparameters, such as the number of layers and neurons. The autoencoder was trained using the 1500-dimensional input vectors of the apo form, in which the dataset was randomly divided into training (80% of apo

ACS Paragon Plus Environment

9

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 39

data; vectors of 3497 residue pairs) and validation data (20%; vectors of 874 residue pairs). We used the Adam optimizer with a 0.00001 coefficient of learning rate. The training data were divided into data of every 100 residue pairs (batch size = 100). The training was iterated by 30000 epochs and terminated when validation loss decreased. All the datasets in the apo and holo forms were then inspected by the trained autoencoder. DIO The difference between each element in the input vector, showing the time fluctuation of protein motions, and that in the output vector, showing the time fluctuations of protein motions exaggerated according to the features in apo data (Figure 3), was calculated for each residue pair. The vector of the difference is referred to as DIO. As shown in Figures 4A and S1, many residue pairs in holo forms have non-zero DIOs, while most of the residue pairs in apo forms have near-zero DIOs. We assumed that the residue pairs with non-small values of DIO elements fluctuated in unique motion patterns, which were not observed in the apo form. Therefore, the analyses of DIOs were performed to detect the unique features of protein motions in the holo forms. Clustering of residue pairs in the apo and holo forms The protein–protein, protein–ligand, and ligand–ligand residue pairs in the apo and holo forms were clustered based on DIO vectors using the “dist” and “hclust” functions in the R program.25 We used “cosine” and “ward.D2” methods in the dist and hclust functions,

ACS Paragon Plus Environment

10

Page 11 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

respectively, because we assumed that clustering of DIO vectors based on cosine similarity would segment residue pairs according to the correlativity of the motion patterns at certain time steps. The “cuttree” function, which cuts a dendrogram tree into several groups by specifying the desired number of clusters, was also used to observe the process of dividing residue pairs in the apo and holo forms. We executed the clustering of four combinations of apo and holo datasets (apo1 and holo1, apo1 andholo2, apo2 and holo1 and apo2 and holo2). For each combination, the clustering was executed six times, with the cuttree value from two to seven for each of the sets (Figures 4B and S2). We also clustered the input and output vectors for confirmation (Figures 4C and S3 and 4D and S4, respectively), which revealed no clear separations between residue pairs in the apo and holo forms. This may be because the absolute values of the elements in the input and output vectors are much larger than those of the difference between apo and holo forms.

3. Results We performed the autoencoder-based regression method (Figure 2) to detect unique motion patterns observed only in holo forms, which resulted from the changes in protein dynamics triggered by ligand binding. Figure 1 summarizes the procedure for the analysis of dynamic allostery, which is described in detail in the Methods section. In this study, we focused on allostery in the PDZ2 domain of human tyrosine phosphatase PTPN13 triggered by the binding

ACS Paragon Plus Environment

11

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 39

of the ligand RAPGEF6. First, we performed MD simulations twice for the apo and holo forms. This was done because the time fluctuation of protein motions in certain time steps may be diverse in respective MD runs, even in the same protein. The autoencoder was used for the comparison of dynamic futures between the apo and holo forms, where the encoder part learned and represented an input model into a latent space which abstracts the trajectory patterns in the form of the lower-dimensional vector space, and the decoder part reconstructed the original input model from the latent space using the low-dimensional feature in the hidden layer. We constructed two multilayer autoencoders, each of which was trained based on either dataset of the two runs of MD simulations for the apo form (apo1 and apo2 datasets). Hereafter, the autoencoders trained using apo1 and apo2 datasets are referred to as apo1-trained and apo2-trained autoencoders, respectively. The inspections by the two autoencoders were performed for each of the holo1 and holo2 datasets, respectively, which outputted the data in holo forms modified and exaggerated according to the dynamic features in apo data (Figure 3). Each element of an input vector to the autoencoders represents the distance between the centers of mass of side-chain atoms in a pair of residues at a time step, and the vector shows the time fluctuations of protein motions in 1500 time steps. An output vector in the inspection step shows the time fluctuations of protein motions modified and exaggerated

ACS Paragon Plus Environment

12

Page 13 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

according to the features in the apo data (Figure 3), and DIO is the difference between the input and output vectors, as described above. Thus, DIOs in holo forms represent subtle changes in side-chain dynamics of the protein by ligand binding, while those in apo forms are almost zero (Figure 4A) because of the function of the autoencoder. This implies that DIOs should be valuable for analyzing dynamic allostery resulting from ligand binding and is accompanied by only subtle changes in side-chain conformations. Clustering of DIOs in apo and holo forms To extract unique patterns of correlative motions between residue pairs observed only in the holo form, we clustered DIOs for all the residue pairs in both the apo and holo forms based on the cosine similarity between the DIO vectors, i.e., the correlativity of motion pattern between the residue pairs. We performed the clustering of both apo and holo data together to confirm whether these data were separated or not. Clustering was executed for each of the four combinations of apo and holo datasets. The first was the dataset of DIOs obtained from the inspection of apo1 and holo1 by the apo1-trained autoencoder (apo1–holo1 dataset). The second was the dataset of apo1 and holo2 by the apo1-trained autoencoder (apo1–holo2). The third was the dataset of apo2 and holo1 by the apo2-trained autoencoder (apo2–holo1). The fourth was the dataset of apo2 and holo2 by the apo2-trained autoencoder (apo2–holo2). The result of clustering, i.e. dendrograms, for the four datasets of apo and holo DIOs are shown in Figures 4B and S2A-S2D. The division process of residue pairs from two to seven

ACS Paragon Plus Environment

13

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 39

groups are summarized in Figure S2E. Tables 1 and S1 summarize the essential protein and ligand residues in each group in each dataset. The residues whose frequency of appearance in the group were higher than 80%, shown in Table 1 and in the third column of Table S1, may lead to the correlative motion characteristics in the group, referred to as “leading” residues. Residues whose frequency was from 60% to 80% in the fourth column of Table S1 accompany the leading residues and must be related to dynamic allostery. We compared the clustering results of the four apo and holo datasets. The four clustering results showed that the residue pairs in the apo form and those in the holo form were separated into different groups in the first division. In addition, the results of the following clustering, from the second to sixth divisions, featured similar sets of the leading protein and ligand residues in the holo form. However, several leading resides were differently treated in every dataset as follows: Residue set 1: Leading protein residues were T28, S29, V30, R31, H32, and E67. Protein residues from T28 to H32 were located in the β-turn region between the β2 and β3 strands, which were close to the N-terminal end of the ligand. Protein residue E67 was located between β5 and α2 in the opposite side of the ligand (Figure 5A). Protein residues from T28 to R31 had direct contact with the N-terminal ligand residues E3 and Q4. Two residues have a “direct contact” when any atoms in a residue are located within 4 Å from atoms in the other residue. Although this criterion is rather strict, it can cover hydrogen bonds.

ACS Paragon Plus Environment

14

Page 15 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Protein residues T28, S29, and V30 were featured as leading residues along with protein residue H32 and R31 or E67 in clustering of the apo1–holo1 and apo2–holo1 datasets, although they were not separated from the largest cluster containing the remaining residue pairs, during the six-times clustering in the apo1–holo2 and apo2–holo2 datasets. However, the leading protein residue R31 was featured in all datasets and accompanied ligand residues E3 and Q4 in the apo1–holo1 dataset and the protein residue H32 in the other datasets (apo1–holo2, apo2– holo1, and apo2–holo2). Residue set 2: Leading protein residues K72, Q73, E76, R79, and N80 were located on the α2 helix and were close to the ligand (Figure 5B). Protein residues K72 and R79 had direct contact with ligand residues Q4 and S6 and S6, A7, and V8, respectively. Protein residues K72, E76, and R79 were featured in the apo1–holo2, apo2–holo1, and apo2–holo2 datasets along with the ligand residues E3 and Q4, as leading residues. These protein residues were not separated from the largest cluster only in the apo1–holo1 dataset. Residue set 3: Leading protein residues N14, D15, and N16 were present in the β-turn region between β1 and β2 strands, Q43 (close to the N-terminal of α1 helix), K54 (between α1 and β4), and Q83 (close to the N-terminal of β6 strand) (Figure 5C). These residues led to correlative fluctuations accompanying surrounding residues, such as the protein residues A12 and K13 (on the β1 strand and close to the C-terminal of β1); S17, L18, and G19 (in the β-turn region between β1 and β2); I20, S21, and V22 (on β2); K38, A39, and V40 (on β3); I41 and

ACS Paragon Plus Environment

15

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 39

P42 (in the linker between β3 and α1); G44, A45, A46, E47, and S48 (on α1); N62 (close to the C-terminal of β4); T81 and G82 (in the linker between α2 and β6); V84 (on β6); and ligand residue A7 (Figure 5D). Protein residues K13, N16, S17, L18, G19, I20, S21, and V22 had direct contact with the C-terminal of the ligand, particularly ligand residues A7 and V8. Protein residue N16 was featured along with N14 and Q83 in the apo1–holo1, apo1– holo2, and apo2–holo1 datasets as a leading residue, whereas residue N16 was grouped with protein residues K72, E76, and R79 and ligand residues E3 and Q4 in the apo2–holo2 dataset as an accompanying residue. Residue set 4: Leading ligand residues E3, Q4, and V5 were observed in N-terminal region of the ligand (Figure 5B), which were featured as leading residues in all datasets. However, ligand residue V8 was not featured in any clustering, i.e., the ligand residue V8 did not lead to correlative motions in any groups in any of the datasets. The ligand residues E3 and Q4 lead to the correlative motions together with protein residue R31 in the apo1–holo1 dataset, whereas they were featured along with protein residues K72, E76, and R79 in other datasets (apo1–holo2, apo2–holo1, and apo2–holo2). Residue set 5: Leading protein residues P1, K2, P3, Q93, and S94 and the accompanying residue D5 were located in N- and C-terminal flexible regions of the protein (Figure 5E). Regarding the N- and C-terminal protein regions, we did not mention the feature of their motions because we assumed that these motions occurred owing to their high flexibility, rather

ACS Paragon Plus Environment

16

Page 17 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

than ligand binding and dynamic allostery.

4. Discussion Clustering of DIOs extracted four sets of essential representative residues that led to the respective correlative motions, viz., protein residues T28, S29, V30, R31, H32, and E67 (residue set 1); protein residues K72, Q73, E76, R79, and N80 (residue set 2); protein residues N14, D15, N16, Q43, K54, and Q83 (residue set 3); and ligand residues E3 and Q4 (residue set 4). The clustering results also implied that ligand binding altered the protein dynamics as observed in the holo forms and reorganized the residue pairs into at least two large dynamic clusters in which the residue pairs fluctuated correlatively. One cluster contained the leading protein residues in residue sets 1 and 2, which were particularly affected by the binding of Nterminal ligand residues (residue set 4). Another cluster consisted of residues in residue set 3 that were particularly affected by the binding of C-terminal ligand residues. Dynamic cluster 1 Protein residues K72, E76, and R79 (residue set 2) were most affected by the binding of the N-terminal ligand residues E3 and Q4, indicating that the fluctuations involving these protein residues often correlated with those of the N-terminal region of the ligand. However, protein residues from T28 to H32, particularly R31 (residue set 1), were directly affected by the N-terminal ligand residues E3 and Q4. The fluctuations originating from these protein

ACS Paragon Plus Environment

17

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 39

residues correlated with those from ligand residues when protein residues K72, E76, and R79 fluctuated correlatively with those particularly involving the ligand residue V5. These observations suggest that the correlative fluctuations between residue pairs triggered by the binding of N-terminal ligand residues can be interchanged between the above two sets of residues, which may lead to dynamic allostery (Figure 5F): when the fluctuations arising from the protein residue R31 correlate with those of N-terminal ligand residues, protein residues from T28 to V30 and fluctuations from H32 lead to other correlative fluctuations along with protein residue E67, which is located on the opposite side of the ligand. At that point, fluctuations involving protein residues K72, E76, and R79 were affected by binding of the ligand residue V5. However, when fluctuations from protein residues K72, E76, and R79 were directly affected by the binding of the N-terminal region of the ligand, protein residue R31 led to correlative fluctuations along with protein residues from T28 to V30, and particularly H32. Protein residues from T28 to V30 could lead to other fluctuations along with protein residues on the opposite side of the ligand, such as E67, which may also be involved in dynamic allostery. Dynamic cluster 2 Protein residues in residue set 3 seem to form a large “structural cluster” that contains a three-stranded β-sheet that includes C-terminal ligand residues along with the α1 helix, which overlaps with a part of the hydrophobic core as described above (Figure 5D).

ACS Paragon Plus Environment

18

Page 19 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Therefore, they fluctuate correlatively. Protein residue N14 along with N16 and Q83 (residue set 3) would have the most effect on the correlative motion in residue set 3 (Figure 5C). Residues N14 and Q83 are located close to the C-terminus of the ligand but do not come in direct contact with the ligand. However, protein residue N16 makes direct contact with ligand residue V8, and the fluctuations involving residue N16 can correlate with those involving protein residues K72 and R79 (residue set 2) and the N-terminal ligand residues E3 and Q4 (residue set 4) in the apo2–holo2 dataset. Residues K72 and R79 had direct contact with ligand residues Q4 and V8, respectively, and their motion patterns were affected by the N-terminal region of the ligand, as described above. The fluctuations of ligand residue V8 were clustered into the group containing the protein residues that make direct contact. These observations suggest that protein residues N16 and R79 and ligand residue V8 contribute to the change in dynamics, as is the case with protein residues R31 and K72 and ligand residue E3 (Figure 5F). We believe that these changes in dynamics would lead to dynamic allostery. Comparisons with other methods We also analyzed the correlative motions in the PDZ2 domain by other methods, such as principal component analysis (PCA)26 and dynamical cross-correlation matrix (DCCM),27,28 based on our MD trajectories. Figures S5A and S5B show the results of the PCA in apo1 and holo1 datasets, respectively. Since our new method searches the relative motion in each residue pair, differently from the PCA that finds the collective motion of the whole structure, it cannot

ACS Paragon Plus Environment

19

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 39

directly compare the result by PCA with that by our method. Therefore, we examined the relationship between the “average Cα-Cα distances between two residues” and the “variances of the distance” in the snapshots from 1401 to 1500 frames (time steps) that are included in the cluster colored yellow in Figures S5A and S5B. Figures S5C and S5D show the distance-variance relationships in apo1 and holo1 datasets, respectively, where most of the residue pairs with a large variance comprise the residues in the flexible N-terminal or C-terminal regions (residue numbers from 1 to 5 or from 93 to 94, respectively). In apo forms, the residue pairs involving the residues with a large side-chain conformation in the ligand binding site, such as Lys72 and Arg79 (cross marks in Figure S5C), also have large variances. It suggests that the PCA of our datasets focuses on the residues with large fluctuations. Figures S6A and S6B show the DCCMs of the motions of Cα atoms in apo1 and holo1 datasets, respectively. The results overall indicate the higher correlation of atomic motions in holo form than those in apo form. As described in the legend of Figure S6 in detail, the thirty-three residues included in the blue-colored annotations on the figures, are located in the ligand binding site, and fluctuate correlatively in respective annotated regions, together with the ligand if it’s in holo form. Among these residues, our new method detected all the thirty-three residues (thirteen leading residues: D15, N16, S21, N27, T28, S29, V30, R31, K38, K72, Q73, E76, and R79, and twenty accompanying residues: S17, V22, T23, G24,

ACS Paragon Plus Environment

20

Page 21 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

G25, V26, G34, I35, Y36, V37, R57, V58, G68, A69, T70, H71, A74, V75, T77, and L78) as important residues which lead to the correlative fluctuations in the respective groups, listed in Tables 1 and S1. This comparison suggests that several outstanding residues that fluctuate together with a high correlation, cooperatively lead other residues to fluctuate correlatively in a group-specific fluctuation pattern. Thus, our analyses detected not only the residue pairs fluctuating with a high correlation, as with the DCCM analysis, many of which were detected as leading and accompanying (i.e. outstanding) residues, but also the patterns of correlative fluctuations that are led by the outstanding residues. In addition, differently from the PCA and DCCM analyses, our method does not require the least-square fitting of the MD trajectories. The least-square fitting may generate severe uncertainty in the results and lead to the loss of important information. These observations indicate that our new method has an advantage in the MD simulation-based analysis of dynamic allostery particularly produced by subtle changes of side-chain conformations. The study by Fuentes et al.8 is one of the fascinating studies to determine dynamic allostery in the PDZ2 domain based on the NMR analysis. The authors identified twenty residues, which displayed significant changes in ps-ns side-chain methyl dynamics parameters in ligand-bound form, suggesting that these residues are involved in dynamic allostery. In our analyses, all the twenty residues (eight leading residues: D15, N16, N27, T28, V30, R31, V64, and V85, and twelve accompanying residues L18, I20, V22, V26, G34, A39, V40, V61, L66, A69, L78, and

ACS Paragon Plus Environment

21

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 39

T81), were detected as important residues as shown in Tables 1 and S1. It is noteworthy that our approach is basically different from the Fuentes’s study; they have correctly measured the fluctuations of residues in ps-ns time scale, while we detected the correlativeness in the fluctuation patterns of residue pairs with respect to the timefluctuation of the distance between two residues. The comparison result suggested the possibility that the residues fluctuating on a long time scale lead other residues to fluctuate correlatively in a specific pattern and our new method detects these important residues, as leading and accompanying residues.

It should be noted that the characteristics of the four input datasets should not be identical, even between apo1 and apo2 and between holo1 and holo2, because they were obtained from four independent runs of MD simulations. Therefore, it resulted in various dividing processes and clustering results of DIOs among the four datasets (Figures 4B and S2). However, as described above, all four clustering results featured similar sets of residues. For confirmation, we tried clustering of input and output vectors instead of DIO vectors under the same conditions. This resulted in no division of the apo and holo data, as shown in Figures 4C, 4D, S3, and S4. These observations indicate that DIOs can represent subtle conformational and motion changes caused by ligand binding, and suggest that the analysis of DIOs may detect changes in dynamics and dynamic allostery.

ACS Paragon Plus Environment

22

Page 23 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

In conclusion, the autoencoder-based analysis of the side-chain dynamics that detected correlative fluctuations among residue pairs in the apo and holo forms of the PDZ2 domain, demonstrating that ligand binding altered the side-chain dynamics and reorganized correlatively fluctuating residue pairs into (at least) two dynamic clusters in the holo forms. The clusters contained the residue pairs affected by the N- or C-terminal ligand residues. Several protein and ligand residues that led to the respective correlative motions, referred to as leading residues, were identified in each dynamic cluster. In addition, motion signals resulting from ligand binding could be interchanged between the leading protein residues R31 and K72 in the first cluster and between N16 and R79 in the second cluster, which may lead to different correlative motions, i.e. dynamic allostery (Figure 5F). This conception may be similar to “allosteric communication networks” of (permanent and transient) ligand binding pockets revealed through the analysis of pocket crosstalk of the component atoms observed in the MD simulations.29 The comparisons with the conventional methods, such as PCA, DCCM and NMR analyses, indicated that our method detected the residues that fluctuate in a long time scale and with a high correlation, as leading or accompanying residues, and these residues cooperatively lead several other residues to fluctuate correlatively in respective specific patterns, which were detected by the clustering of DIOs. Thus, our new method extracted important clues to understand dynamic allostery that occurs as a result of ligand binding. The success relied on the use of distance matrices of time fluctuations of protein motions without

ACS Paragon Plus Environment

23

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 39

any averaging of structural coordinates, and the use of the autoencoder-based regression method for the detection of subtle conformational and dynamic changes in the comparison of the fluctuation data in the apo and holo forms.

ACS Paragon Plus Environment

24

Page 25 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figures

MD simulations twice for apo forms

Training of autoencoder by apo1 dataset

Inspections by apo1-trained autoencoder for (1) apo1, (2) holo1, (3) holo2 datasets

MD simulations twice for holo forms

Training of autoencoder by apo2 dataset

Inspections by apo2-trained autoencoder for (1) apo2, (2) holo1, (3) holo2 datasets

Clustering of DIOs in (1) apo1-trained apo1 holo1, (2) apo1-trained apo1 holo2 (3) apo2-trained apo2 holo1, (4) apo2-trained apo2 holo2 datasets

Figure 1. Architecture of the autoencoder-based detection method for determining dynamic allostery MD simulations were performed twice for the apo and holo forms. Two sets of time fluctuation data obtained from the MD simulations of the apo form (apo1 and apo2 datasets) were used as input data to train autoencoders. The autoencoder trained based on apo1 dataset was used for the inspection of time fluctuation data in the first and second holo datasets (holo1 and holo2) and the apo1 dataset. Similarly, the autoencoder trained based on apo2 dataset was used for the inspection of the holo1, holo2, and apo2 datasets. The clustering of DIOs in the apo1 and holo1 datasets, DIOs in apo1 and holo2 datasets obtained by the apo1trained autoencoder, and DIOs in apo2 and holo1 datasets and DIOs in apo2 and holo2 datasets obtained by the apo2-trained autoencoder, were executed to classify protein–protein and protein–ligand residue pairs according to the correlative fluctuations.

ACS Paragon Plus Environment

25

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Encoder part

Decoder part ynr-1

xnr-1 xnr-2

Page 26 of 39

ynr-2

hnr-1

hnr-2

● ●





● ●

xnr-nt



hnr-nh





hidden layer

input layer

ynr-nt output layer

input vector: vinr = {xnr-1, …, xnr-nt}

output vector: vonr = {ynr-1, …, ynr-nt}

nr : residue pair#, nt : #time step, nh: #neuron in hidden layer

Figure 2. Encoder–decoder architecture of autoencoder In an autoencoder, the encoder part learns and represents an input model via lower-dimensional feature in the hidden layer and the decoder part reconstructs the original input model using the low-dimensional feature. In this study, the encoder and decoder consist of four layers, each of which contains 1500, 1000, 500, and 200 neurons in the encoder and 200, 500, 1000 and 1500 neurons in the decoder. The hidden layer consists of 100 neurons. The input (vi) or output (vo) vector consists of 1500 elements, which are the distance data in 1500 time steps.

ACS Paragon Plus Environment

26

Page 27 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

A)

B)

C)

apo dataset

apo-trained autoencoder

apo dataset

holo dataset

holo-trained autoencoder

holo dataset

holo dataset

apo-trained autoencoder

holo’ dataset

input data (vi)

output data (vo)

Figure 3. Inspection of holo data by apo-trained autoencoder The autoencoder trained by apo data (apo-trained autoencoder) was used to inspect both the holo and apo data. A) The inspection of apo data by the apo-trained autoencoder outputted the data that was almost the same as the original apo data. B) If the autoencoder trained by holo data (holo-trained autoencoder) is used for the inspection of holo data, it will output almost the holo data. C) However, the inspection of holo data by the apo-trained autoencoder outputted the data that showed the motion patterns modified and exaggerated according to the feature in the apo data.

ACS Paragon Plus Environment

27

Journal of Chemical Information and Modeling

apo1-apo1

apo1-holo1

1.0

1.0

0.8

0.8

relative frequency

relative frequency

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 39

0.6 0.4

0.2 0.0

0.6 0.4

group1

g2 g3 g4 g5 g6

group7

0.2 0.0

-3 -2 -1

0

1

2

3

DIO

-3 -2 -1

0

1

2

3

DIO

A) histograms of DIOs

B) DIO clustering

C) input clustering

D) output clustering

Figure 4. Clustering results in the apo1-holo1 dataset A) Histograms of elements of DIOs based on apo1-trained autoencoder in apo1 (left) and holo1 (right) data. B) The clustering of DIOs in the apo1–holo1 dataset separated the protein– protein, protein–ligand, and ligand–ligand residue pairs into seven groups according to the correlativity of the motion patterns. The order of separations, i.e. group IDs (group1, group2 (g2), group3 (g3), group4 (g4), group5 (g5), group6 (g6), and group7), was decided according to the dendrogram. C-D) Dendrograms of the clustering of input and output vectors in apo1–holo1 dataset, respectively. The dendrograms and the detailed descriptions of the clustering of DIO, input and output vectors in all the datasets are summarized in Figures S2, S3 and S4, respectively.

ACS Paragon Plus Environment

28

Page 29 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

A)

B)

C) Interchange of dynamics

Dynamic cluster1

Interchange of dynamics Dynamic cluster2

D)

E)

F)

Figure 5. Leading protein and ligand residues in the holo form Leading protein and ligand residues are shown as ball and stick models on the crystal structure of the PDZ2 domain with the ligand (PDBID: 3lny). The structure of PDZ2 is colored gradually from blue at the N-terminus to red at the C-terminus. The ligand residue is colored grey. A) Leading protein residues from T28 to H32 (on the β turn) and E67 (back side of the ligand) in residue set 1. B) Leading protein residues K72, Q73, E76, R79, and N80 (on α2 helix) in residue set 2 and ligand residues E3 and Q4 (CPK colored upper and lower residues, respectively) in residue set 4. C) Leading residues N14, D15, N16 (on the β turn), Q43 (close to α1 helix), K54 (close to β4 strand), and Q83 (behind N16, colored orange), and ligand residue V8 (colored CPK). D) Structure cluster formed by the residues in residue set 3. E) Leading protein residues P1, K2, and P3 (colored blue) and Q93 and S94 (colored red). F) Interchanges of dynamics. The motion signals resulting from ligand binding can be interchanged between the leading protein residues R31 and K72, and N16 and R79.

ACS Paragon Plus Environment

29

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 39

Table 1. Leading residues of correlative motions based on DIO clustering Datasets 1

Group 2

Group 3

Group 4

Group 5 2

Group 6 2

apo1-holo1

T28, S29, V30, H32, E67

P1, K2, Q93, S94

N14, N16, Q83

R31 Lig E3, Q4

Q43

apo1-holo2

P1, K2, P3, Q93, S94

N14

N16, Q43, K54, Q83

R31, H32

K72, E76, R79 Lig E3, Q4, V5

apo2-holo1

T28, S29, V30, R31, H32

P1, K2, Q93, S94

N14, N16

Q83

K72, Q73, R79, N80 Lig E3, Q4

apo2-holo2

P1, K2, Q93, S94

-

N14, D15

R31, H32

K72, E76, R79 Lig E3, Q4

1) Four combinations of the apo–holo datasets. The residues whose frequency of appearance was higher than 80% in a group according to the dendrogram of the clustering result, which are designated as leading residues, are shown in each dataset. All the residues in apo forms were clustered in group 1, which had a frequency higher than 80%, in all the datasets. No residues with frequency of higher than 80% were clustered into group 1 in the holo forms. The remaining residue pairs at the sixth-time clustering were clustered into group 7. The information of groups 1 and 7 are shown in Table S1. 2) The residues appeared after “Lig”, such as E3 or Q4, belong to the ligand molecule.

ACS Paragon Plus Environment

30

Page 31 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Corresponding author Dr. Yuko Tsuchiya Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology, 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan Tel: +81-3-3599-8227 E-mail: [email protected]

Author contributions Y.T. designed and performed the molecular dynamics simulations and the autoencoder, analyzed the data, and wrote the manuscript. K.T. designed the autoencoder and wrote the manuscript. Y.Y. revised the manuscript and supervised all of the research.

Funding Sources This work was in part supported by Grants-in-Aid for Scientific Research from the Japan Society for the Promotion of Science (Grant Numbers 17K00405 to Y.T. and 15K07037, 17K05620 to Y.Y.). This work was partly supported by “Structural studies of the proteins working in novel cell membrane potential signal” in JST CREST “Structural Life Science and

ACS Paragon Plus Environment

31

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 39

Advanced Core Technologies for Innovative Life Science Research” Grant Number JPMJCR14M3. We also thanks to the HPC computing center in the Kindai University for allowing us to use computing resource.

Acknowledgments This work was in part supported by Grants-in-Aid for Scientific Research from the Japan Society for the Promotion of Science (Grant Numbers 17K00405 to Y.T. and 17K05620 to Y.Y.). We thank Prof. H. Wako and Dr. Y. Eguchi for advice on the data analyses.

Abbreviations MD, molecular dynamics; RMSD, root-mean-square deviations; NMR, nuclear magnetic resonance

Competing interests No competing interests declared.

ACS Paragon Plus Environment

32

Page 33 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Supporting Information Supporting Information Available: List of leading and accompanying residues in all the clusters (Table S1), Histograms of DIOs (Figure S1), Clustering results of DIO, input and output vectors (Figures S2-S4), PCA and DCCM results (Figures S5-S6) (PDF)

ACS Paragon Plus Environment

33

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 39

References (1)

Cooper, A.; Dryden, D. T. F. Allostery without Conformational Change - A Plausible Model. Eur. Biophys. J. 1984, 11 (2), 103–109. https://doi.org/10.1007/BF00276625.

(2)

Pandini, A.; Fornili, A.; Fraternali, F.; Kleinjung, J. Detection of Allosteric Signal Transmission by Information-Theoretic Analysis of Protein Dynamics. FASEB J. 2012, 26 (2), 868–881. https://doi.org/10.1096/fj.11-190868.

(3)

Dubay, K. H.; Boman, G. R.; Geissler, P. L. Fluctuations within Folded Proteins: Implications for Thermodynamic and Allosteric Regulation. Acc. Chem. Res. 2015, 48 (4), 1098–1105. https://doi.org/10.1021/ar500351b.

(4)

Guo, J.; Zhou, H. X. Protein Allostery and Conformational Dynamics. Chem. Rev. 2016, 116 (11), 6503–6515. https://doi.org/10.1021/acs.chemrev.5b00590.

(5)

Liu, J.; Nussinov, R. Allostery: An Overview of Its History, Concepts, Methods, and Applications. PLoS Comput. Biol. 2016, 12, e1004966. https://doi.org/10.1371/journal.pcbi.1004966.

(6)

Greener, J. G.; Sternberg, M. J. Structure-Based Prediction of Protein Allostery. Curr. Opin. Struct. Biol. 2018, 50, 1–8. https://doi.org/10.1016/j.sbi.2017.10.002.

(7)

Motlagh, H. N.; Wrabl, J. O.; Li, J.; Hilser, V. J. The Ensemble Nature of Allostery. Nature 2014, 508, 331–339. https://doi.org/10.1038/nature13001.

(8)

Fuentes, E. J.; Der, C. J.; Lee, A. L. Ligand-Dependent Dynamics and Intramolecular

ACS Paragon Plus Environment

34

Page 35 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Signaling in a PDZ Domain. J. Mol. Biol. 2004, 335 (4), 1105–1115. https://doi.org/10.1016/j.jmb.2003.11.010. (9)

Zhang, J.; Sapienza, P. J.; Ke, H.; Chang, A.; Hengel, S. R.; Wang, H.; Phillips, G. N.; Lee, A. L. Crystallographic and Nuclear Magnetic Resonance Evaluation of the Impact of Peptide Binding to the Second PDZ Domain of Protein Tyrosine Phosphatase 1E. Biochemistry 2010, 49 (43), 9280–9291. https://doi.org/10.1021/bi101131f.

(10)

Kozlov, G.; Banville, D.; Gehring, K.; Ekiel, I. Solution Structure of the PDZ2 Domain from Cytosolic Human Phosphatase HPTP1E Complexed with a Peptide Reveals Contribution of the Β2-Β3 Loop to PDZ Domain-Ligand Interactions. J. Mol. Biol. 2002, 320 (4), 813–820. https://doi.org/10.1016/S0022-2836(02)00544-2.

(11)

Dhulesia, A.; Gsponer, J.; Vendruscolo, M. Mapping of Two Networks of Residues That Exhibit Structural and Dynamical Changes upon Binding in a PDZ Domain Protein. J. Am. Chem. Soc. 2008, 130 (28), 8931–8939. https://doi.org/10.1021/ja0752080.

(12)

Ho, B. K.; Agard, D. A. Conserved Tertiary Couplings Stabilize Elements in the PDZ Fold, Leading to Characteristic Patterns of Domain Conformational Flexibility. Protein Sci. 2010, 19 (3), 398–411. https://doi.org/10.1002/pro.318.

(13)

Kalescky, R.; Zhou, H.; Liu, J.; Tao, P. Rigid Residue Scan Simulations Systematically Reveal Residue Entropic Roles in Protein Allostery. PLoS Comput.

ACS Paragon Plus Environment

35

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 36 of 39

Biol. 2016, 12 (4), e1004893. https://doi.org/10.1371/journal.pcbi.1004893. (14)

Botlani, M.; Siddiqui, A.; Varma, S. Machine Learning Approaches to Evaluate Correlation Patterns in Allosteric Signaling: A Case Study of the PDZ2 Domain. J. Chem. Phys. 2018, 148 (24), 241726. https://doi.org/10.1063/1.5022469.

(15)

Xu, J.; Xiang, L.; Liu, Q.; Gilmore, H.; Wu, J.; Tang, J.; Madabhushi, A. Stacked Sparse Autoencoder (SSAE) for Nuclei Detection on Breast Cancer Histopathology Images. IEEE Trans. Med. Imaging 2016, 35 (1), 119–130. https://doi.org/10.1109/TMI.2015.2458702.

(16)

Deng, L.; Fan, C.; Zeng, Z. A Sparse Autoencoder-Based Deep Neural Network for Protein Solvent Accessibility and Contact Number Prediction. BMC Bioinformatics 2017, 18, 569. https://doi.org/10.1186/s12859-017-1971-7.

(17)

Wang, Y.; Mao, H.; Yi, Z. Protein Secondary Structure Prediction by Using Deep Learning Method. Knowledge-Based Syst. 2017, 118, 115–123. https://doi.org/10.1016/j.knosys.2016.11.015.

(18)

Wang, L.; You, Z.-H.; Chen, X.; Xia, S.-X.; Liu, F.; Yan, X.; Zhou, Y.; Song, K.-J. A Computational-Based Method for Predicting Drug–Target Interactions by Using Stacked Autoencoder Deep Neural Network. J. Comput. Biol. 2017, 25 (3), 361–373. https://doi.org/10.1089/cmb.2017.0135.

(19)

Cerveri, P.; Belfatto, A.; Baroni, G.; Manzotti, A. Stacked Sparse Autoencoder

ACS Paragon Plus Environment

36

Page 37 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Networks and Statistical Shape Models for Automatic Staging of Distal Femur Trochlear Dysplasia. Int. J. Med. Robot. Comput. Assist. Surg. 2018, 14 (6), e1947. https://doi.org/10.1002/rcs.1947. (20)

Katsuki, T.; Ono, M.; Koseki, A.; Kudo, M.; Haida, K.; Kuroda, J.; Makino, M.; Yanagiya, R.; Suzuki, A. Risk Prediction of Diabetic Nephropathy via Interpretable Feature Extraction from EHR Using Convolutional Autoencoder. Stud. Health Technol. Inform. 2018, 247, 106–110. https://doi.org/10.3233/978-1-61499-852-5-106.

(21)

Chen, W.; Ferguson, A. L. Molecular Enhanced Sampling with Autoencoders: On-theFly Collective Variable Discovery and Accelerated Free Energy Landscape Exploration. J. Comput. Chem. 2018, 39 (25), 2079–2102. https://doi.org/10.1002/jcc.25520.

(22)

Chicco, D.; Sadowski, P.; Baldi, P. Deep Autoencoder Neural Networks for Gene Ontology Annotation Predictions. Proc. 5th ACM Conf. Bioinformatics, Comput. Biol. Heal. Informatics - BCB ’14 2014, 533–540. https://doi.org/10.1145/2649387.2649442.

(23)

Tan, J.; Ung, M.; Cheng, C.; Greene, C. S. Unsupervised Feature Construction and Knowledge Extraction from Genome-Wide Assays of Breast Cancer with Denoising Autoencoders. Biocomput. 2015 2014, 132–143. https://doi.org/10.1142/9789814644730_0014.

ACS Paragon Plus Environment

37

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(24)

Page 38 of 39

Toxvaerd, S.; Heilmann, O. J.; Dyre, J. C. Energy Conservation in Molecular Dynamics Simulations of Classical Systems. J. Chem. Phys. 2012, 136 (22), 224106. https://doi.org/10.1063/1.4726728.

(25)

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria. 2018.

(26)

Stock, G.; Hamm, P. A Non-Equilibrium Approach to Allosteric Communication. Philos. Trans. R. Soc. B Biol. Sci. 2018, 373 (1749), 20170187. https://doi.org/10.1098/rstb.2017.0187.

(27)

Morra, G.; Verkhivker, G.; Colombo, G. Modeling Signal Propagation Mechanisms and Ligand-Based Conformational Dynamics of the Hsp90 Molecular Chaperone FullLength Dimer. PLoS Comput. Biol. 2009, 5 (3), e1000323. https://doi.org/10.1371/journal.pcbi.1000323.

(28)

Rivalta, I.; Lisi, G. P.; Snoeberger, N. S.; Manley, G.; Loria, J. P.; Batista, V. S. Allosteric Communication Disrupted by a Small Molecule Binding to the Imidazole Glycerol Phosphate Synthase Protein-Protein Interface. Biochemistry 2016, 55 (47), 6484–6494. https://doi.org/10.1021/acs.biochem.6b00859.

(29)

La Sala, G.; Decherchi, S.; De Vivo, M.; Rocchia, W. Allosteric Communication Networks in Proteins Revealed through Pocket Crosstalk Analysis. ACS Cent. Sci. 2017, 3 (9), 949–960. https://doi.org/10.1021/acscentsci.7b00211.

ACS Paragon Plus Environment

38

Page 39 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table of Contents graphic

apo data holo data

apo-based Autoencoder

apo data holo’ data

Clustering

Dynamic Interchange Allostery of dynamics

holo-clusters

apo-cluster

ACS Paragon Plus Environment

39