Protein Dynamics and Contact Topology Reveal ... - ACS Publications

Oct 10, 2016 - Taiwan. ‡. Bioinformatics Program, Taiwan International Graduate Program and. ∥. Institute of Biomedical Sciences, Academia Sinica,...
0 downloads 9 Views 2MB Size
Subscriber access provided by CORNELL UNIVERSITY LIBRARY

Article

Protein dynamics and contact topology reveal protein-DNA binding orientation Aravind Chandrasekaran, Justin Chan, Carmay Lim, and Lee-Wei Yang J. Chem. Theory Comput., Just Accepted Manuscript • DOI: 10.1021/acs.jctc.6b00688 • Publication Date (Web): 10 Oct 2016 Downloaded from http://pubs.acs.org on October 11, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Theory and Computation is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Protein dynamics and contact topology reveal protein-DNA binding orientation Aravind Chandrasekaran1, 6, Justin Chan1, 2, Carmay Lim3,4,*, and Lee-Wei Yang1,2,5,* 1

Institute of Bioinformatics and Structural Biology, National Tsing Hua University, Hsinchu 30013, Taiwan 2

Bioinformatics Program, Institute of Information Science, Taiwan International Graduate Program, Academia Sinica, Taipei 11574, Taiwan 3

Department of Chemistry, National Tsing Hua University, Hsinchu 30013, Taiwan 4

5

6

Institute of Biomedical Sciences, Academia Sinica, Taipei 11574, Taiwan

Physics Division, National Center for Theoretical Sciences, Hsinchu 30013, Taiwan,

Present address: Department of Chemistry and Biochemistry, University of Maryland, College Park, MD 20742, USA

*

To whom correspondence should be addressed. Tel: +88635742467; Fax +88635715934;

Email: [email protected] Correspondence may also be addressed to Carmay Lim Tel: +8862 2652-3031; Fax: +8862 2788-7641; Email: [email protected] Keywords DNA binding proteins, FT-Dock, protein dynamics, Elastic Network Model, Gaussian Network Model, Anisotropic Network Model, Intrinsic Dynamics Domains

1 Environment ACS Paragon Plus

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 36

Abstract

Structure-encoded conformational dynamics are crucial for biomolecular functions. However, there is insufficient evidence to support the notion that dynamics plays a role in guiding protein-nucleic acid interactions. Here, we show that protein-DNA docking orientation is a function of protein intrinsic dynamics but the binding site itself does not display unique patterns in the examined spectrum of motions. This revelation is made possible by a novel technique that locates ‘dynamics interfaces’ in proteins across which protein parts are anticorrelated in their slowest dynamics. A striking statistic is that such interfaces intersect the DNA in 97% of the 104 examined cases. These findings were then used to screen decoys generated by rigid-body docking of DNA molecules onto DNA-binding proteins. Using our method, the chance to discern near-native poses from non-native decoys increased by 2.5 and 1.6-fold, as compared to a random guess and methods based on surface complementarity, respectively. Hence, dynamically allowed protein-DNA docking orientations can work as new filters to cull and re-rank docking poses and therefore enhance the predictability of DNA-binding sites that themselves do not have distinct dynamics features. Computer software

implementing

the

method

can

http://dyn.life.nthu.edu.tw/IDD/DNA.htm

2 Environment ACS Paragon Plus

be

accessed

via

Page 3 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

1.

Introduction

Interactions between proteins and DNA are essential for regulating the transcription, degradation, and packaging of genes.1–3 Predicting the binding sites and orientations of interacting DNA-protein complexes is thus of great scientific and medicinal interest. Structures of protein-DNA complexes accrued over the past few decades have enabled a better understanding of protein-DNA binding. Knowledge-based principles obtained from statistical analysis of protein-DNA structures4–7 have been used to predict DNA-binding sites in proteins based solely on the protein sequence8 or coupled with structure information.9,10 Another way to search for potential DNA-binding sites in proteins is to exhaustively search for native docking poses between proteins and DNA on the basis of shape/electrostatic complementarity, as exemplified in FT-Dock,11 which accelerates such a search translationally and rotationally by fast Fourier transform and convolution theory. DP-Dock and ParaDock build upon this idea using a better scoring scheme and protein flexibility.12,13 DP-Dock translates and rotates the protein on the DNA, and the resulting ~104 complexes are filtered on the basis of knowledge-based potentials and clustering.12 ParaDock,13 on the other hand, accounts for DNA flexibility using solutions from rigid body docking of several DNA conformers to determine the bound conformation. In addition, several packages such as HADDOCK use knowledge-based potentials to drive correct docking of DNA to a protein12,14 taking into account the DNA flexibility in the refinement stage.15 Thus, there are two key approaches to predicting DNA-binding sites. The first is a onebody problem: Given that the protein of interest binds DNA, DNA-binding sites are predicted on the basis of the physicochemical, dynamical properties of the protein alone. The second is a two-body problem: For a given protein and DNA molecule, the binding sites and the docking orientation are predicted based on an exhaustive search of the geometric and energetic complementarity between the two bodies. This study falls in the former category,

3 Environment ACS Paragon Plus

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

taking into account the intrinsic dynamics of the protein. Soft normal modes have been used to guide docking of small molecules onto DNA.16 However, little attention has been paid to dynamics features of binding sites, except for a study based on elastic network models (ENMs) suggesting that the DNA molecules are anchored preferentially by rigid residues.10 ENMs have been applied to predict (i) conformational changes,17 (ii) enzyme active sites,18 (iii) ratcheting motion of ribosome,19,20 (iv) ensemble motions within RNA structures,21 (v) maturation of the bacteriophage HK97 capsid,22,23 and (vi) conformational transitions of 19 DNA-dependent polymerases.24 They have also been used to study the process by which DNA ‘selects’ an engaging conformation of a vibrating/rotating apoprotein.25 However, ENMs have not been used to identify correct docking poses, to our knowledge. In this study, we examine whether intrinsic protein dynamics can reveal possible DNAprotein binding orientations and DNA-binding sites using physical models26,27 that well describe protein conformational changes upon binding DNA. We devise a new dynamics descriptor, “domain (D)-plane”, as the interface between two anti-correlated groups of residues in the apoprotein moving in opposite directions. Such a D-plane is found to traverse the DNA in 97% of all DNA-protein complexes, but the DNA-binding sites themselves are not strictly distributed near these D-planes. Thus, intrinsic protein dynamics can reveal DNAprotein binding orientations, but not DNA-binding residues, whose slow/fast dynamics cannot be distinguished from those of non-binding residues. Using D-plane as a filter to discern near-native poses from 76 protein-DNA decoy sets, we show a statistically significant enrichment of chances in finding the near-native poses, as compared to conventional methods. As explained in the discussion, we argue that DNA binding occurs in two stages – first, the protein non-specifically lands on a DNA with its surface residues near the D-plane, followed by a search for specific DNA-binding site via intrinsic and collective motions of rigid-body rotation and vibrational ‘clamping’ about axes lying on the D-plane.

4 Environment ACS Paragon Plus

Page 4 of 36

Page 5 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

2.

Materials and Methods

2.1 Dataset of DNA-bound and DNA-free proteins. We started from a reported nonhomologous set comprising 55 pairs of DNA-bound and DNA-free crystal structures9 (see Supporting Information). The functional oligomeric states of the 55 proteins were carefully determined by literature survey or from the number of chains in biological assembly files28 of DNA-bound structures in the Protein Data Bank (PDB).29 Biological assemblies are obtained from experimental evidence, which when not found, are determined by the author or determined computationally.30,31 Only DNA-binding proteins that bind as monomers were considered to avoid complexities arising from protein-protein interactions. This yielded 22 nonredundant proteins in the dataset. Proteins undergo structural changes upon binding of particular DNA sequences32 and cofactors,33 hence a apoprotein can have more than one bound form in complex with different DNAs, which are all ≥5 bases in this study. For example, the DNA-free photolyase structure (PDB ID 1owm) has only one monomer, whereas the DNA-bound structure (PDB ID 1tez) has four protein monomers that bind independently to four DNA molecules in the asymmetric unit (Figure S1). From the 22 apoproteins, we obtained a dataset of 104 DNAbound protein structures (see Supporting Information Table S1). The functional diversity of these proteins is summarized in Supporting Information Table S2. 2.2 Gaussian network model (GNM). GNM is a coarse-grained model based on a harmonic approximation of protein dynamics around the equilibrium state corresponding to X-ray structures or energy-minimized conformations. The coarse-grained nodes (Cα atoms in the protein) within a cut-off distance  are connected by Hooke’s springs, which assume a quadratic potential. The dynamics, in terms of atoms’ positional covariance, can be described as a superposition of normal modes. The modes constituted by small eigenvalues are the slow

5 Environment ACS Paragon Plus

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 36

(low frequency) modes, describing collective changes of residue positions, which are considered crucial to biological functions.25 The size of residue fluctuations around the protein’s equilibrium conformation can be obtained from the Gaussian network model (GNM)34 35 whose potential takes the form,          = ∑ ,  −   .  −   Θ( −   )

= Δ !×

 ("⨂I) ×  Δ ×

(1)

where % is the force constant, assumed to be the same for all springs, independent of the residue type, N is the number of residues in a protein, Rij and R0ij are the instantaneous and

equilibrium distance vectors between residues i and j, respectively, and Θ is the Heaviside   is less than the cutoff distance  and zero  step function, which is unity when   

otherwise. Thus, the summation in Eq 1 involves only those residue pairs with a native

separation less than the threshold  . Δ represents deviations from the equilibrium states in the x, y, and z directions for the N residues, arranged in a 3N−dimensional vector; I is a 3×3 identity matrix, and " is the Kirchhoff connectivity matrix, which is given by, −1 -. - ≠ 0 123 4 ≤  Γ = + 0 -. - ≠ 0 123 4 >  − ∑ ,8 Γ -. - = 0

(2)

The N×N variance-covariance matrix 9: containing cross-correlation between residue

fluctuations is given by, 9 =

;< =

(" > )× =

;< =

> = ∑ ;@ Λ; ?; ?;

(3)

where AB is the Boltzmann constant, C is the temperature, while Λk and Uk are the eigenvalue and eigenvector corresponding to the mode k obtained by diagonalizing ", respectively. The eigenvalue and eigenvector representing the frequency and shape of a GNM mode, respectively, were obtained using cutoff distances of 6.5, 7.5, and 10 Å in this study.

6 Environment ACS Paragon Plus

Page 7 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

2.3 Anisotropic Network Model (ANM) and conformational changes upon DNA binding. The potential energy function of ANM penalizes changes of the Euclidean distance between two residues i and j from the equilibrium state: !      D = ∑ , E  E −    Θ( −   ) = Δ ×

 F ×  Δ ×

(4)

H is the Hessian (force constant) matrix, comprising second derivatives of the potential function with respect to the x, y, and z coordinates. It is readily obtained given the atomic

coordinates of protein at the equilibrium state.27,36 The cutoff distance  is 15 Å for ANM.

The 3N×3N covariance matrix 9G: that provides fluctuation correlations between residue

pairs can be derived as, 9



=

;< =

(F > )

× 

=

;< =

 = ∑;@J λ> ; I; I;

(5)

where λk and Vk are the non-zero eigenvalue and corresponding eigenvector of mode k obtained from diagonalizing F. The conformational change of protein predicted by the kth ANM mode is given as (ΔK); =



;< = IL

MNL

(6)

The correlation coefficient, Ik, between the experimentally observed displacement upon binding DNA (∆rexp) and the displacement specified by the kth ANM mode is given as,

Ik =

(7)

|∆ rexp .∆ rk | |∆ rexp || ∆ rk |

where ΔOPQR is equal to the difference between the superimposed DNA-free and DNA-

bound protein structures. To compare with the GNM results, the norm of each of the N U;

values (the kth GNM mode) is compared with the corresponding magnitude of ΔOPQR ; i.e., MΔ] + Δ_ + Δz .

2.4 Intrinsic dynamics domains, IDDs. IDDs, introduced by Li et al.,37 are characterized by clustering residues that have correlated motion into ‘domains’. The clustering is carried

7 Environment ACS Paragon Plus

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 36

out in reduced dimensions of essential motions extracted from a set of experimentally determined structures, MD simulation trajectories, or ENM results. The intuition to treat proteins as a set of coupled domains has been explored from MD trajectories38 and through crystal structure contact maps.39 Residues from two different IDDs of a protein move anticorrelated to each other. GNM-based IDDs were determined from the slowest mode; i.e., the first mode with a non-zero eigenvalue (see Supporting Information). 2.5 Determining GNM-based IDDs and D-plane. Each GNM mode has N elements, corresponding to N coarse-grained nodes or residues. Two residues with the same sign (+ or – ) move concomitantly in this mode, whereas those with opposite signs move in opposite directions. Based on the signs of residues, residues in a protein are classified into Res+ (red) or Res– (blue) (Figure 1b). A ‘dynamics domain’ (IDD) grows from any two residues with the same dynamics attribute with at least a pair of heavy atoms within 4 Å (Figure 1b). Several domains (usually 2 to 4 for the slowest GNM mode) could be formed in a protein. Using the softest GNM mode, we examined only the two largest IDDs with opposite dynamics traits. The dividing ‘interface’ that best separates the largest two domains, referred to as the D-plane, can be found by linear discriminant analysis37,40. The normal of the plane is a vector onto which the projection of atom coordinates from the two dynamics domains can be maximally separated. This means that the variance (CB) between projections of the two sets of residues (namely, Res+ and Res–) can be maximized on this vector given the variance within each set (CW) being held constant. Therefore, the objective is to maximize the function F(n) derived from Eqs S1 through S6. a (b) = bc9 d b bc 9 b

(8)

e

The vector n that maximizes the F(n) is the normal of the D-plane (Figure 1c). Mathematical details on how the D-plane normal was derived can be found in the Supporting Information section. If two residues with opposite dynamics attributes are sequence

8 Environment ACS Paragon Plus

Page 9 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

neighbors, their mid-point defines a transition point (the stars in Figure 1d). Both the D-plane and its normal pass through the mean of all the identified transition points (Figure 1e).

Figure 1. Determining the D-plane of a free protein and corresponding DNA-bound protein. (a) DNA-free structure represented as a cartoon. The Cα network of the protein and the connectivity matrix are obtained (see Materials and Methods). The slowest GNM mode V1 is derived from diagonalization of the connectivity matrix. (b) Based on the sign of the coarse-grained node displacement (+/–), dynamics properties are assigned to protein residues, which belong to either Res+ (red) or Res– (blue). Only the largest two clusters with opposite signs were considered for analyses. (c) To define a plane that best separates the two clusters, linear discriminant analysis (see Supporting Information) was used to find an axis (green

9 Environment ACS Paragon Plus

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

arrowhead) onto which the projections of Res+ (red) and Res– (blue) can be best separated, as compared to other possible axes; e.g., projections onto the axis with red arrowhead overlap. (d) The transition points (yellow stars) are the centers of any two consecutive residues with different dynamics properties. (e) The axis found in (d) called the ‘domain (D)plane normal and the D-plane (light green) are required to go through the geometric center of the transition points (black sphere). (f) The corresponding DNA-bound protein is then superimposed onto the free protein to yield the position of the DNA relative to the D-plane. 2.6 Generation of decoy sets. The docking orientation favored by the protein’s intrinsic dynamics was used as a criterion to enrich near-native poses in protein-DNA docking decoys. FTDock41 was used to perform rigid-body docking of the apoprotein onto DNA obtained from DNA-bound proteins. Since the charge allocation scheme in FT-Dock is devised for double-stranded DNA (dsDNA),14 we restricted our docking study to the 76 out of 104 cases where the protein binds to single molecule of dsDNA (Table S1 shows details of 5 proteins interacting with multiple molecules of dsDNA). FTDock performs extensive search by translating and rotating one of the two molecules (mobile molecule) on the surface of other (static molecule), and evaluates docking quality based on both surface complementarity and electrostatics11. We generated two sets of decoy poses – one treating DNA as the static molecule with a translating/rotating protein (referred to as the sDNA decoy set) and the other treating protein as the static molecule with a mobile DNA (referred to as the mDNA decoy set). Molecules were initially docked using the default parameters (grid size 0.7 Å, rotation step of 12°). The first 10,000 poses, ranked according to surface complementarity (SCrank), were retained. 2.7 Scoring decoys to determine native poses. The native interface for each of the 76 native protein/dsDNA complexes is defined by the set of amino acids and nucleotides whose heavy atoms are within 5 Å of each other. We used two criteria to delineate near-native poses from the non-native ones based on similarity to or retained fraction of the native contact

10 Environment ACS Paragon Plus

Page 10 of 36

Page 11 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

interface: the root mean square deviation of interface atoms (I-RMSD) and the percentage of native contacts retained (%NC).11,12,14 The I-RMSD of the DNA base C2, ribose C4’, and P atoms and the protein Cα atoms in a given decoy from those in the native interface was computed (see Figures S2 and S3 for details). The native contact distance was defined as the distance between a Cα protein atom and the C1’ DNA ribose atom. If a contact in a decoy deviated from the contact distance observed in the crystal structure by ≤4 Å, it is considered as ‘native’. The fraction of such native contacts retained in a decoy is defined as %NC. Using the two measures, two sets of near-native complexes were identified. A decoy is considered as a IRMSD-hit if the I-RMSD is ≤ 4.3 Å or a %NC-hit if %NC is ≥ 65%.11 As explained in Figure S3, I-RMSD is more sensitive to the translation and rotation at the interface than %NC, which considers only the fraction of native contacts retained. Therefore, two decoys with a minor difference in the orientation from the native structure might differ significantly in I-RMSD values but less so in %NC. Ten out of 76 proteins undergo large conformational changes upon DNA binding (Cα rmsd > 4 Å), so rigid-body docking of these apoproteins returned no hits. Out of the remaining 66 cases (Cα rmsd < 4 Å), six and four in the ‘sDNA’ and mDNA decoy sets, respectively, required a docking protocol with a finer grid size of 0.5 Å and a smaller rotation step of 9° to obtain hits (see Table S3 for details on the PDB IDs). The 10,000 decoys with the highest surface complementarity scores were obtained and considered for further analyses. 2.8 Enrichment factor of hits, EFH. EFHD,SC is the enrichment of chances for finding hits in the decoy set when the decoys were filtered by (i) surface complementary scores and (ii) whether DNA fragments are dissected by D-plane.

fag(h)i,jk =

lmnop (jkqrs;tu)/llwxyz(jkqrs;tu) mnop /lwxyz

(9)

In Eq. 9, X ranges from 1,000 to 10,000 in increment of 1,000; NDDecoy(SCrank≤X) refers to the top X decoys based on surface complementary scores whose DNAs are traversed by D-

11 Environment ACS Paragon Plus

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

planes and NDhits(SCrank≤X) is the number of hits among NDDecoy. NDecoy is the total number of decoys and Nhits is the number of hits among all NDecoy decoys.

3

RESULTS

3.1

Slow GNM modes predict conformational change upon DNA binding. The

correlations (Ik) between mode k, one of the slowest 30 normal modes, and the experimentally determined conformational changes from free to DNA-bound structures were calculated using GNM (rc =7.5 Å) and ANM potentials (Eq. S10). Using GNM, the slowest modes consistently have the highest Ik with an average of 0.79±0.01 for all but one of the 104 structure pairs, whereas using ANM, one of the five slowest modes has the highest Ik in only 91 of the 104 cases with a lower average of 0.50±0.02.42 Thus, GNM was chosen for describing the dynamics of DNA-binding proteins (see Supporting Information). 3.2 Folding core or flexible residues in a protein are not co-localized in DNA-binding sites. Squared displacement profiles (∆ri2) (see Figure 2 for illustrative examples) and the distance fluctuations between residues i and j (10) were calculated from the fastest normal modes for the 22 apoproteins. In the fastest GNM mode, the peaks of the ∆ri2 profiles suggest folding cores.43 In the slowest modes (describing global or collective motions), the local maxima in the ∆ri2 profiles correspond to flexible residues, whereas the local minima form key mechanical sites, which have been found co-localized with enzyme active sites44 and metal-binding sites.45. From the %sensitivity data shown in Table 1, it is clear that about 75% to 80% of the DNA-binding residues are located neither at the peaks nor the troughs of squared displacements assembled by the slowest or the fastest few modes. Only a few folding core residues and flexible residues in a protein were found in DNA-binding sites. Thus, DNA-binding sites have indiscernible patterns in the slowest and fastest dynamics of proteins.

12 Environment ACS Paragon Plus

Page 12 of 36

Page 13 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Figure 2. Fluctuation profiles of DNA-binding residues (open circles) in the slowest and fastest GNM modes. Fluctuations of DNA-free proteins 1enk (137aa), 1q0s (241aa) and 1owm (474aa) and 1xhx (571aa) in the slowest and fastest GNM modes along the protein sequence. The profiles have been normalized according to Eq. S11. Blue circles in slow mode profiles correspond to residues that are within 5% rank from the D-plane (see Section 3.4).

13 Environment ACS Paragon Plus

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 36

Table 1. % Sensitivity, specificity, precision, accuracy, and Mathew’s correlation coefficient (MCC) calculated as defined in Equation S14 for different sets of GNM modes and cut-off distance when DNA-binding residues in the peaks or troughs are considered as positives (P)

Mode

P

% sensitivity

% specificity

% precision

% accuracy

MCC

slowest

Peak Trough

8.7 11.6

91.3 90.5

12.3 14.5

81.2 80.9

0.00 0.02

1-3 Peak slowest Trough

7.7 11.8

90.4 89.5

10.0 13.6

80.3 80.0

–0.22 0.01

Peak Trough

8.1 10.9

90.5 88.3

10.5 11.5

80.4 78.8

0.00 –0.01

1-3 Peak slowest Trough

9.4 15.0

89.2 88.0

10.8 14.5

79.4 79.0

–0.02 0.03

GNM 6.5 Å

5 fastest

Peak

0.9

97.6

5.0

85.7

–0.03

DBPa

c



3.5

96.1

11.0

84.7

–0.01

DBP10b

c



11.1

87.6

16.1

76.6



Method

GNM 7.5 Å

GNM 10 Å

a

slowest

DBP denotes DNABINDPROT program employing GNM with a cutoff distance of 6.5 Å

on our dataset of 22 apoproteins. aDBP(10) denotes DNABINDPROT program and method reported in ref. 10. cThe fastest mode describing highest-frequency molecular motions is given by the eigenvector corresponding to the largest eigenvalue. DNABINDPROT results make use of squared displacement between nodes i and j (Equation S14). To find residues



〉 fluctuations in the fastest mode, 〈∆4 〉 was arranged in descending order with high 〈∆4

and the value C at which the profile plateaued was determined; residue pairs with values greater than 1.1% of C are considered as DNA-binding residues.

3.3 D-planes of the free proteins intersect DNA from DNA-bound proteins. Although intrinsic protein dynamics do not seem to dictate the DNA-binding site, could they play a role in determining the docking orientation of the DNA to a protein? To address this question, a ‘D-plane’ separating two groups of residues moving in opposite directions was found for each apoprotein. The proteins in the 104 DNA-bound and free structures were superimposed 14 Environment ACS Paragon Plus

Page 15 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

to yield the position of the DNA relative to DNA-free protein (see Figure 3, S4). In a striking 101 out of 104 (97%) cases, an average of 1.7% of DNA atoms was found within 0.3 Å of the protein’s D-plane, suggesting that the planes intersect the DNA (see Figure 3, S4).

Figure 3. D-planes intersect superimposed DNA fragments, as exemplified by four proteins. Res+/Res– domains of the protein are in red and blue, respectively, and dsDNA is in black. The PDB and chain IDs with and without parentheses correspond to the DNAbound and free proteins, respectively. The protein in the bound structure is superimposed onto that in the apoprotein structure. D-plane normal is in red and the two orthogonal vectors spanning the D-plane are in blue and green. Cα atoms shown as yellow spheres are the set union of all binding residues obtained from each of the corresponding DNA-bound structures in our dataset. For example, in the case of 1enk chain A, each of the 37 Cα atoms shown interacts with DNA in at least one of the two DNA-bound complexes (2fcc chains A and B, Table S1).

The three exceptions are meiosis specific transcription factor NDT80 (PDB ID:1m6u), HCV RNA helicase (8ohm) and origin binding domain of large T antigen (2itj), where the closest DNA atom is 8.7 Å (1mnn, Chain A), 3.3 Å (3kqn, Chain A) and 3.3 Å (2itl, Chain B) from the D-planes, respectively. Although the D-plane derived from HCV apoprotein (8ohm)

15 Environment ACS Paragon Plus

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

did not intersect the DNA in the 3kqn chain A structure, it intersected the DNAs in the other 12 DNA-bound HCV helicase structures (see PDB entry 8ohm in Table S1). Similarly, the DNAs derived from other crystal forms of origin binding domain of large T antigen (2itl chain A, 2nl8 chain A) are indeed intersected by D-planes. Since there is only one DNAbound structure for meiosis specific transcription factor NDT80 (PDB ID: 1mnn), we searched for homologous DNA-bound structures (with sequence identity ≥50% using pBLAST46) from other organisms that are not in our database. No structures of meiosis specific transcription factor homologs could be found in the PDB.47 To evaluate the statistical significance of the observed 97% chance for D-planes to cut DNAs, we computed the probability that a random plane going through the protein’s mass center would intersect the DNA. We placed a randomly oriented vector (for details on its generation, see Supporting Information) at an apoprotein’s mass center to define ‘random planes’ and checked if any atom of the superimposed DNA was within 0.3 Å from the random planes. The probability for such an event after 1,000 trials for each of the 104 free proteins is 69.5±1.5%. It drops further to 57.9±1.4% if the random vector was placed on a randomly chosen Cα atom of the free structure. The null hypothesis that the D-plane behaves as a random plane passing through a randomly chosen Cα was rejected with a p-value of 0.003. Thus our finding that the D-plane intersects the DNA is statistically significant. Since the D-plane nearly always intersects the DNA in protein-DNA complexes, does it mean that the protein long axis partially aligns with the DNA helical axis to maximize the number of possible protein-DNA contacts? To answer this question, we computed the acute angle θ between the first principal component (PC1) of the free protein Cα positions and the PC1 of the superimposed DNA heavy atom coordinates (see Supplementary Methods for determination of PC1). We found that the protein and its DNA partner are not aligned in parallel: the angle θ (57.8±2.1°) is close to the angle between two randomly chosen axes

16 Environment ACS Paragon Plus

Page 16 of 36

Page 17 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(57.3°). Thus, the superimposed DNA appears to be randomly oriented with respect to the apoprotein long axis. Since the apoprotein’s long axis is at an angle of 9.4±0.9° with respect to the D-plane normal in our dataset, we examined whether the DNA can be intersected by a plane perpendicular to the apoprotein’s long axis defined by PC1, which we term as the ‘Splittingplane’ or S-plane. When the S-plane was centered at a Cα atom closest to the apoprotein’s mass center, it intersected the superimposed DNA in 88/104 (84.6%) cases (Table S5). However, when it was placed at the same center used for the D-plane; viz., the mean of all the transition points defined in Figure 1e (see Methods), it intersected the superimposed DNA in 99/104 (95.2%) cases (Table S5). This underscores the importance of placing the plane at the center of ‘transition points’ defined by GNM’s slowest dynamics, which in our dataset, is on average 6.1±0.4 Å from the protein’s mass center. In contrast, a random plane placed at the center of the transition points in a protein intersected the bound DNA with a probability (68.4±2.1%) close to that obtained when it is placed at the apoprotein’s mass center (69.5%). The difference between D- and S-planes in intersecting DNA was even more pronounced for shorter DNA fragments. As shown in Table S5, for dsDNA < 12 base pairs, the S-planes intersected dsDNA in 60% of the cases, whereas the D-planes always intersected the dsDNA in accord with earlier findings that small ligands tend to stay at the interface of two groups of nodes undergoing anti-correlated motions, as entropic contributions dominate binding.17 Intuitively, the longer the DNA, the more likely it would be intersected by a plane placed anywhere in a protein. Indeed, with increasing DNA length, a random plane centered on a randomly chosen amino acid intersected DNA with monotonically increasing probability, whereas a S-plane placed at the apoprotein’s mass center intersected DNA with increasing then decreasing probability (Table S5). As the DNA becomes longer, enthalpic effects start to dictate DNA-protein interactions due to enhanced electrostatic interactions.

17 Environment ACS Paragon Plus

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

3.4 Are DNA-binding residues close to D-planes? Although nearly all the DNAs in our dataset are traversed by D-planes, the DNA-binding residues are distributed along the Dplane normal. Thus, we measured how far a DNA-binding residue was from the D-plane by the distance between its Cα atom and the D-plane normal: As the proteins have different numbers of residues, the residues in each protein were ranked such that a residue on the Dplane was ranked 0%, while that farthest from the D-plane was ranked 100%. For the 22 proteins, ~58 out of 769 DNA-binding residues (7.5±1.7%) were located within 5% rank from D-plane (Figure 4, left panel). Although ~71% of the DNA-binding sites exhibit 75% rank from the D-plane.

Figure 4. % DNA-binding residues, % cumulative binding residues and overall enrichment ratio variations based on the % rank. % rank of a residue depends on its distance from D-plane. The profiles are aggregated from 22 apoproteins, whose DNAbinding sites comprise an average of 35 residues. In the case of 1vsr (very short patch repair endonuclease) with 134 residues and 26 DNA-binding residues, 7 residues are located within 5% distance from D-plane, out of which two bind DNA. Thus 2/26 = 7.7% DNA-binding residues are located within the first 5% rank. This data from 22 proteins are collected and represented by the first bar in the left panel (7.5±1.7% between 0 and 5% rank). The enrichment ratio for 1vsr for 5% rank is given by (2/26)/(7/134) ~ 1.5.

18 Environment ACS Paragon Plus

Page 18 of 36

Page 19 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

3.5 IDD-based restraints enrich near-native poses. Although searching for DNA-binding residues along the D-plane normal is only moderately effective, can the D-plane be used to enrich hits obtained from FTDOCK, a popular tool to dock DNA and protein together? Using FTDOCK, we performed rigid-body docking of DNAs to 76 apoproteins to obtain decoys and ‘hits’ (docking poses with contact interfaces similar to those observed experimentally). Two criteria, I-RMSD and %NC (see definitions in section 2.7), were used to define whether a decoy is a hit. For each apoprotein, 10,000 decoys were obtained, totaling 7.6×105 decoys for the test set. Among them, 746 IRMSD (1,506 %NC) hits were obtained for the mDNA decoy set (derived from mobile DNAs translating/rotating on the surface of static proteins), whereas 868 IRMSD (1,882 %NC) hits were obtained for the sDNA decoy set (proteins translating/rotating on the surface of static DNAs). Compared with I-RMSD, a higher number of hits was obtained using %NC, as %NC is less sensitive to the different interface arrangements than I-RMSD. Allowing proteins to move freely to dock onto static DNAs resulted in more hits, suggesting that protein mobility may be more important than DNA mobility in searching for binding sites on relatively rigid and large dsDNA molecules. To see whether the above hits would be enriched if the docking orientation favored by the protein’s intrinsic dynamics was used, both decoy sets were filtered based on whether DNA atoms were found within 0.3 Å from the protein D-planes. Decoys were discarded if the closest DNA atom was >0.3 Å from the corresponding D-plane. Out of a total of 7.6×105 decoys, 46.8±0.9% in sDNA (46.9±0.8% in mDNA) were filtered out. The resulting sDNA (mDNA) decoy set retained 84.5 (85.5) %NC-hits and 82.7% (78.3%) IRMSD-hits. Enrichment ratio calculations suggest a notable enrichment in hits compared with the default result from FTDOCK (Figure 5 and Tables 1, S4): The default filtration in FTDOCK is to filter out decoys with surface complementarity scores lower than a certain SCrank cutoff. Thus, we computed EFHs from filtering decoys at three commonly used SCrank

19 Environment ACS Paragon Plus

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 36

cutoff values of 1,00014, 4,00011, 10,00012 yielding EFH1,000, EFH4,000, EFH10,000, respectively. A EFH10,000 enrichment ratio of 1 signifies that the odds of finding a near-native pose in the filtered decoy set is as good as a random search without filtration (Table 2). EFH1000 and EFH4000 are > 1, indicating that filtering decoys with surface complementarity ranking ≤ 4000 enriched the hits in decoys. Next, we filtered decoys based on whether DNA atoms were within 0.3 Å from the D-plane of the free protein in addition to the SCrank filter. Enrichment due to this coupled filtration is benchmarked against background chances of finding hits in the entire decoy set (EFHD,SC, Figure 5). When the coupled filter is used, EFH improved considerably (Table 2); e.g., the %NC/IRMSD-hits in the sDNA decoy set using a SCrank ≤ 1000 increased significantly from a EFH1000 of 1.6 to 2.5 using the coupled filter (the p value using a two sample t-test is < 0.05). This trend was independent of the choice of static molecule (mDNA vs sDNA decoy sets) or criteria to define hits (IRMSD vs %NC).

Table 2. EFHs obtained when decoys were filtered based on SCrank and closeness of DNA fragments to D-plane

Enrichment for hits

a

factor

mDNA decoy set %NC-hits

IRMSD-hits

sDNA decoy set %NC-hits

IRMSD-hits

EFH1,000a

1.35

1.50

1.57

1.62

EFH4,000a

1.37

1.13

1.19

1.20

EFH10,000a

1.0

1.0

1.0

1.0

EFHD,SC≤1,000b

2.12

2.27

2.46

2.47

EFHD,SC≤4,000b

1.80

1.78

1.91

1.84

EFHD,SC≤10,000b

1.59

1.56

1.61

1.47

Both %NC-hits and IRMSD-hits were filtered based on SCrank (1000, 4000, 10000) and the

resulting EFHs are denoted as EFH1000, EFH4000, and EFH10000, respectively. bThese are compared with the EFH when both SCrank and D-plane are used as filters (EFHD, SC ≤1000, EFHD,SC≤4000, EFHD,SC≤10000). Using D-plane as a filter means that only decoys whose 20 Environment ACS Paragon Plus

Page 21 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

DNAs were intersected by D-planes were retained. See Table S4 for EFHs of proteins that undergo small conformational changes (< 4Å) upon binding DNA.

Figure 5. Plot of EFHD, SC with increasing SCrank cutoff. Definition of each EFH is given in the Methods section. The enrichments for %NC-hits and IRMSD-hits in sDNA and mDNA decoy sets are computed.

4

DISCUSSION 4.1 Enthalpic considerations disfavor folding core residues from binding DNA.

Experimentally found folding cores have been found co-localized with residues having pronounced fluctuations in the fastest few modes43. In a previous study,10 a few residues with high 〈Δ 〉 in the fast modes were deemed to be DNA-binding residues or their neighbors. According to our dataset, the chance to locate a DNA-binding residue is 0.9% for the top five highest GNM modes and 3.5% for the fastest mode from DNABINDPROT using distance fluctuations (see Table 2) – this probability is even slimmer than a random chance to find these binding residues (769/6282×100% = 12.2%). This is likely because residues exhibiting the fastest dynamics (shown by the peaks of the fastest modes) have the highest number of contact neighbors; hence, they will be forced to break intramolecular contacts before binding to DNA, thus incurring a higher enthalpic cost of binding.

21 Environment ACS Paragon Plus

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

4.2 DNA stays close to protein regions with minimal angular momentum and low vibrational entropy. Binding entropy includes changes in translational, rotational, and (sidechain) conformational entropy.48–50 When DNA binding near the D-planes is compared with that occurring at the “tips” of protein, far away from the D-plane, the translational component of the binding entropy ∆Stran can be similar. This assumes the binding is similarly tight (hence similar Stran of protein being in the binding site) and the DNA is much bigger than the protein so that the Stran of the complex is dominated by the DNA.51 Taking any vector on the D-plane as the rotation axis, we had previously shown that the protein could rotate with maximal moment of inertia (I) about this axis, as compared to its rigid rotation about a randomly chosen axis.37 Since equipartition theorem suggests that

rotation over a particular degree of freedom has a kinetic energy of AB C⁄2 such that € =

AB C⁄2 ( is angular velocity), the protein rotates the slowest when the rotation axis lies on the D-plane because I would be among the largest, as compared to other choices of a random rotation axis;37 i.e., residues closest to the D-plane rotate the slowest and have the lowest perresidue angular momentum (€  = ‚  , where i denotes the ith residue). Thus, binding near the D-plane could yield minimal entropy loss since the rovibrational entropy near Dplane is already low,52 in contrast to binding occurring at the ‘tip’ of a protein. A residue/domain near the protein’s center of mass is already low in rotational and vibrational entropy,52 and will therefore have minor loss in side-chain/domain conformational entropy53 upon DNA binding. In contrast, a residue/domain far from the mass center would require freezing significant side chain entropy/domain vibrational entropy to bind DNA. As D-plane is closer to protein center of mass (6.1 Å) than a residue at the ‘tip’, 37 the vibrational entropy loss in the protein would be smaller if the DNA binds to residues near the D-plane. Indeed, more binding residues can be found near the D-plane (where entropy is already low before DNA binding) than far away from it (see Figure 4).

22 Environment ACS Paragon Plus

Page 22 of 36

Page 23 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

4.3

A plausible role for dynamics in determining binding orientations. Because

DNA-binding sites do not lie exactly at the entropically favored D-plane, a possible role for residues on the D-plane could be to engage in non-specific, electrostatically-driven interactions with the DNA54–58 in searching for specific DNA-binding sites. Such search process heavily involves sliding, whereby proteins diffuse on DNA along its major/minor grooves59,60. Surface residues near the D-plane are likely involved in sliding and facing the DNA grooves61. Highly collective protein motions including vibrational ‘clamping’ and ‘rotation’ about an axis on the D-plane can facilitate simultaneous anchoring of several residues at the correct sequence and being constantly rejected by the ‘wrong’ sites during the sliding with a minimal expense of energy. When only rigid body rotation effects are considered, a plane perpendicular to the protein’s long axis passing through the Cα atom closest to the apoprotein’s mass center intersects the DNA in ~85% of the DNA-bound proteins examined. However, when ro-vibrational mechanisms are considered, the D-planes intersect DNA in ~97% cases. This could reflect the ‘final’ snapshots of DNA sliding and anchoring processes captured by X-ray crystallography.

5

CONCLUSION

This study carefully analyzes monomeric DNA-binding proteins to obtain valuable insights on protein-DNA binding using simple physical models. Conformational changes predicted by GNM agree closely with experimentally observed conformational changes upon DNAbinding, validating the use of GNM in predicting protein conformational changes upon complexation with large ligands such as DNA. However, protein dynamics-based attributes could not be found for DNA-binding residues. Interestingly, DNA orientation can be identified by an ‘interface of dynamics domains’ or the ‘D-plane’ of apoproteins, which was found to intersect DNA in 97% of the protein-DNA complexes. Residues near the D-plane

23 Environment ACS Paragon Plus

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 36

exhibit minimal dynamics in the protein’s vibrational and rotational degrees of freedom and engage in non-specific DNA binding. Subsequent collective dynamics of the protein facilitate sliding

search

on

~30

base

pairs

for

a

specific

DNA-binding

site

before

dissociation/association to another site.62 Using ‘dynamically correct’ orientations of docking decoys as a filter yielded a 2.5- and 1.6-fold enrichment in near-native hits, compared to a random guess and standard FTDOCK, respectively. The D-plane filter introduced herein can be readily combined with other important DNA binding-site attributes such as sequence conservation and electrostatics63–66 to further improve predictions for DNA-binding sites and binding orientations. Supporting Information 4 Tables, 3 Figures, Supplemental Methods, Results for both PPD and enzyme catalytic residues studies are provided. This material is available free of charge via the Internet at http://pubs.acs.org. Author Information Corresponding author Lee-Wei Yang, Associate Professor, Institute of Bioinformatics and Structural Biology, National Tsing Hua University, 101, Sec 2, Kuang-Fu Rd., Hsinchu, Taiwan, R.O.C. Voice: +886-3-574-2467 ;

Fax: +886-3-571-5934.

Carmay

+886-2-2652-3031;

Lim

Tel:

Email: Fax:

[email protected]

+886-2-2788-7641;

[email protected] Notes The authors declare no competing financial interest. Author Contributions

24 Environment ACS Paragon Plus

Email:

Page 25 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

LWY and CL proposed the study. AC and LWY prepared theoretical tools for the study. AC carried out the study and JC prepared distributed software codes. The manuscript was jointly written by AC, CL and LWY. All authors have given approval to the final version of the manuscript. Funding Sources This work was supported by research grants from the Ministry of Science and Technology (grant number 102-2627-M-007-003- and 103-2627-M-007-001-), Taiwan. Acknowledgements We thank the National Center for High-performance Computing in Taiwan for computer time and facilities. We also thank Dr. Yao-Chi Chen for kindly providing the DNA-binding protein set and Mr. Yuan-Yu Chang for programming assistance. We also thank Dr. Pemra Ozbek for the valuable discussions on threshold definitions used in DNABINDPROT. Abbreviations DNA, Deoxyribonucleic acid; IDD, intrinsic dynamics domains; PDB, Protein Data Bank; GNM, Gaussian Network Model; ANM, Anisotropic Network Model; MD, molecular dynamics; D-plane, domain plane, EFH, enrichment factor for hits; ER, enrichment ratio; RMSD, root mean square deviation; I-RMSD, interface root mean square deviation; %NC, % native contacts.

REFERENCES (1)

Sekinger, E. A.; Moqtaderi, Z.; Struhl, K. Mol. Cell 2005, 18, 735–748.

(2)

Deplancke, B.; Mukhopadhyay, A.; Ao, W.; Elewa, A. M.; Grove, C. A.; Martinez, N.

25 Environment ACS Paragon Plus

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

J.; Sequerra, R.; Doucette-Stamm, L.; Reece-Hoyes, J. S.; Hope, I. A.; Tissenbaum, H. A.; Mango, S. E.; Walhout, A. J. M. Cell. 2006, 125, 1193–1205. (3)

Mott, M. L.; Berger, J. M. Nat. Rev. Microbiol. 2007, 5, 343–354.

(4)

Chen, Y. C.; Lim, C. Nucleic Acids Res. 2008, 36, 7078–7087.

(5)

Morozov, A. V; Havranek, J. J.; Baker, D.; Siggia, E. D. Nucleic Acids Res. 2005, 33, 5781–5798.

(6)

Marcovitz, A.; Levy, Y. Biophys. J. 2013, 104, 2042–2050.

(7)

Privalov, P. L.; Dragan, A. I.; Crane-Robinson, C.; Breslauer, K. J.; Remeta, D. P.; Minetti, C. a S. a. J. Mol. Biol. 2007, 365, 1–9.

(8)

Wang, L.; Brown, S. J. Nucleic Acids Res. 2006, 34, W243–W248.

(9)

Chen, Y. C.; Wright, J. D.; Lim, C. Nucleic Acids Res. 2012, 40, W249–W256.

(10)

Ozbek, P.; Soner, S.; Erman, B.; Haliloglu, T. Nucleic Acids Res. 2010, 38, W417– W423.

(11)

Aloy, P.; Moont, G.; Gabb, H. a; Querol, E.; Aviles, F. X.; Sternberg, M. J. Proteins 1998, 33, 535–549.

(12)

Gao, M.; Skolnick, J. PLoS Comput. Biol. 2009, 5, e1000341.

(13)

Banitt, I.; Wolfson, H. J. Nucleic Acids Res. 2011, 39, e135.

(14)

Parisien, M.; Freed, K. F.; Sosnick, T. R. PLoS One 2012, 7, e32647.

(15)

van Dijk, M.; van Dijk, A. D. J.; Hsu, V.; Boelens, R.; Bonvin, A. M. J. J. Nucleic Acids Res. 2006, 34, 3317–3325.

26 Environment ACS Paragon Plus

Page 26 of 36

Page 27 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(16)

Zacharias, M.; Sklenar, H. J. Comput. Chem. 1998, 20, 287–300.

(17)

Tobi, D.; Bahar, I. Proc. Natl. Acad. Sci. U. S. A. 2005, 102, 18908–18913.

(18)

Yang, L.-W.; Bahar, I. Structure 2005, 13, 893–904.

(19)

Wang, Y.; Rader, A. J.; Bahar, I.; Jernigan, R. L. J. Struct. Biol. 2004, 147, 302–314.

(20)

Li, H.; Chang, Y.-Y.; Yang, L.-W.; Bahar, I. Nucleic Acids Res. 2015, 44, 415–422.

(21)

Zimmermann, M. T.; Jernigan, R. L. RNA 2014, 792–804.

(22)

Chennubhotla, C.; Rader, A. J.; Yang, L.-W.; Bahar, I. Phys. Biol. 2005, 2, S173– S180.

(23)

Rader, A. J.; Vlad, D. H.; Bahar, I. Structure 2005, 13, 413–421.

(24)

Delarue, M.; Sanejouand, Y.-H. J. Mol. Biol. 2002, 320, 1011–1024.

(25)

Bahar, I.; Lezon, T. R.; Yang, L.-W.; Eyal, E. Annu. Rev. Biophys. 2010, 39, 23–42.

(26)

Bahar, I.; Atilgan, A. R.; Erman, B. Fold. Des. 1997, 2, 173–181.

(27)

Yang, L.-W. Biophys. J. 2011, 100, 1784–1793.

(28)

Rose, P. W.; Beran, B.; Bi, C.; Bluhm, W. F.; Dimitropoulos, D.; Goodsell, D. S.; Prlić, A.; Quesada, M.; Quinn, G. B.; Westbrook, J. D.; Young, J.; Yukich, B.; Zardecki, C.; Berman, H. M.; Bourne, P. E. Nucleic Acids Res. 2011, 39, 392–401.

(29)

Berman, H. M.; Battistuz, T.; Bhat, T. N.; Bluhm, W. F.; Philip, E.; Burkhardt, K.; Feng, Z.; Gilliland, G. L.; Iype, L.; Jain, S.; Fagan, P.; Marvin, J.; Padilla, D.; Ravichandran, V.; Thanki, N.; Weissig, H.; Westbrook, J. D. 2002, 899–907.

(30)

Krissinel, E.; Henrick, K. J. Mol. Biol. 2007, 372, 774–797.

27 Environment ACS Paragon Plus

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(31)

Henrick, K.; Thornton, J. M. Trends Biochem. Sci. 1998, 23, 358–361.

(32)

Badis, G.; Berger, M. F.; Philippakis, A. A.; Talukder, S.; Gehrke, A. R.; Jaeger, S. A.; Chan, E. T.; Metzler, G.; Vedenko, A.; Chen, X.; Kuznetsov, H.; Wang, C.-F.; Coburn, D.; Newburger, D. E.; Morris, Q.; Hughes, T. R.; Bulyk, M. L. Science 2009, 324, 1720–1723.

(33)

Slattery, M.; Riley, T.; Liu, P.; Abe, N.; Gomez-Alcala, P.; Dror, I.; Zhou, T.; Rohs, R.; Honig, B.; Bussemaker, H. J.; Mann, R. S. Cell 2011, 147, 1270–1282.

(34)

Bahar, I.; Atilgan, A. R.; Erman, B. Fold. Des. 1997, 2, 173–181.

(35)

Haliloglu, T.; Bahar, I.; Erman, B. Phys. Rev. Lett. 1997, 79, 3090–3093.

(36)

Atilgan, A. R.; Durell, S. R.; Jernigan, R. L.; Demirel, M. C.; Keskin, O.; Bahar, I. Biophys. J. 2001, 80, 505–515.

(37)

Li, H.; Sakuraba, S.; Chandrasekaran, A.; Yang, L.-W. J. Chem. Inf. Model. 2014, 54, 2275–2285.

(38)

Potestio, R.; Pontiggia, F.; Micheletti, C. Biophys. J. 2009, 96, 4993–5002.

(39)

Poornam, G. P.; Matsumoto, A.; Ishida, H.; Hayward, S. Proteins Struct. Funct. Bioinforma. 2009, 76, 201–212.

(40)

Fisher, R. Ann. Eugen. 1936, 7, 179–188.

(41)

Gabb, H. A.; Jackson, R. M.; Sternberg, M. J. J. Mol. Biol. 1997, 272, 106–120.

(42)

Atilgan, A. R.; Durell, S. R.; Jernigan, R. L.; Demirel, M. C.; Keskin, O.; Bahar, I. Biophys. J. 2001, 80, 505–515.

(43)

Demirel, M.; Atilgan, A.; Bahar, I. Protein Sci. 1998, 7, 2522–2532.

28 Environment ACS Paragon Plus

Page 28 of 36

Page 29 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(44)

Yang, L.-W.; Bahar, I. Structure 2005, 13, 893–904.

(45)

Dutta, A.; Bahar, I. Structure 2010, 18, 1140–1148.

(46)

Johnson, M.; Zaretskaya, I.; Raytselis, Y.; Merezhuk, Y.; McGinnis, S.; Madden, T. L. Nucleic Acids Res. 2008, 36, 5–9.

(47)

Montano, S. P.; Coté, M. L.; Fingerman, I.; Pierce, M.; Vershon, A. K.; Georgiadis, M. M. Proc. Natl. Acad. Sci. U. S. A. 2002, 99, 14041–14046.

(48)

Bahar, I.; Wallqvist, A.; Covell, D. G.; Jernigan, R. L. Biochemistry 1998, 37, 1067– 1075.

(49)

McQuarrie, D. A. Statistical Mechanics; University Science Books, 2000.

(50)

Schlitter, J. Chem. Phys. Lett. 1993, 2, 617–621.

(51)

Amzel, L. M. Proteins Struct. Funct. Bioinforma. 1997, 28, 144–149.

(52)

Chang, C. A.; Chen, W.; Gilson, M. K. Proc. Natl. Acad. Sci. U. S. A. 2007, 104, 1534–1539.

(53)

Bromberg, S.; Dill, K. A. Protein Sci. 1994, 3, 997–1009.

(54)

Jen-jacobson, L.; Engler, L. E.; Ames, J. T.; Kurpiewski, M. R.; Grigorescu, A. Supramol. Chem. 2000, 12, 143–160.

(55)

Ferreiro, D. U.; de Prat-Gay, G. J. Mol. Biol. 2003, 331, 89–99.

(56)

Winkler, F.; Banner, D.; Oefner, C. EMBO J. 1993, 12, 1781–1795.

(57)

Tzeng, S.-R.; Kalodimos, C. G. Nature 2012, 488, 236–240.

(58)

Kalodimos, C. G.; Biris, N.; Bonvin, A. M. J. J.; Levandoski, M. M.; Guennuegues,

29 Environment ACS Paragon Plus

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

M.; Boelens, R.; Kaptein, R. Science 2004, 305, 386–389. (59)

Gorman, J.; Greene, E. C. Nat. Struct. Mol. Biol. 2008, 15, 768–774.

(60)

Sun, J.; Viadiu, H.; Aggarwal, A. K.; Weinstein, H. Biophys. J. 2003, 84, 3317–3325.

(61)

Sakata-sogawa, K.; Shimamoto, N. Proc. Natl. Acad. Sci. 2004, 101, 14731–14735.

(62)

Gowers, D. Proc. Natl. Acad. Sci. 2005, 2005, 15883–15888.

(63)

Chen, Y. Y. C.; Wu, C. Y. C.; Lim, C. Proteins Struct. Funct. … 2007, 680, 671–680.

(64)

Ahmad, S.; Gromiha, M. M.; Sarai, A. Bioinformatics 2004, 20, 477–486.

(65)

Wang, L.; Brown, S. J. Nucleic Acids Res. 2006, 34, W243–W248.

(66)

Stormo, G. D. Bioinformatics 2000, 16, 16–23.

30 Environment ACS Paragon Plus

Page 30 of 36

Page 31 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Table of contents graphics.

Apo form structure of site specific DNA methyltransferase (PDB ID -1q0s) is shown in cartoon with dynamics domains (+ and -) colored in blue and red respectively. The D-plane separating the two is shown in orange plane and is cut through by DNA superimposed from bound form structure 1yfl.

31 Environment ACS Paragon Plus

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Determining the D-plane of a free protein and corresponding DNA-bound protein. (a) DNA-free structure represented as a cartoon. The Cα network of the protein and the connectivity matrix are obtained (see Materials and Methods). The slowest GNM mode V1 is derived from diagonalization of the connectivity matrix. (b) Based on the sign of the coarse-grained node displacement (+/–), dynamics properties are assigned to protein residues, which belong to either Res+ (red) or Res– (blue). Only the largest two clusters with opposite signs were considered for analyses. (c) To define a plane that best separates the two clusters, linear discriminant analysis (see Supporting Information) was used to find an axis (green arrowhead) onto which the projections of Res+ (red) and Res– (blue) can be best separated, as compared to other possible axes; e.g., projections onto the axis with red arrowhead overlap. (d) The transition points (yellow stars) are the centers of any two consecutive residues with different dynamics properties. (e) The axis found in (d) called the ‘domain (D)-plane normal and the D-plane (light green) are required to go through the geometric center of the transition points (black sphere). (f) The corresponding DNA-bound protein is then superimposed onto the free protein to yield the position of the DNA relative to the D-plane.

ACS Paragon Plus Environment

Page 32 of 36

Page 33 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Figure 1 84x119mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Fluctuation profiles of DNA-binding residues (open circles) in the slowest and fastest GNM modes. Fluctuations of DNA-free proteins 1enk (137aa), 1q0s (241aa) and 1owm (474aa) and 1xhx (571aa) in the slowest and fastest GNM modes along the protein sequence. The profiles have been normalized according to Eq. S11. Blue circles in slow mode profiles correspond to residues that are within 5% rank from the D-plane (see Section 3.4). Figure 2 127x190mm (600 x 600 DPI)

ACS Paragon Plus Environment

Page 34 of 36

Page 35 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

D-planes intersect superimposed DNA fragments, as exemplified by four proteins. Res+/Res– domains of the protein are in red and blue, respectively, and dsDNA is in black. The PDB and chain IDs with and without parentheses correspond to the DNA-bound and free proteins, respectively. The protein in the bound structure is superimposed onto that in the apoprotein structure. D-plane normal is in red and the two orthogonal vectors spanning the D-plane are in blue and green. Cα atoms shown as yellow spheres are the set union of all binding residues obtained from each of the corresponding DNA-bound structures in our dataset. For example, in the case of 1enk chain A, each of the 37 Cα atoms shown interacts with DNA in at least one of the two DNA-bound complexes (2fcc chains A and B, Table S1). Figure 3 84x90mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

% DNA-binding residues, % cumulative binding residues and overall enrichment ratio variations based on the % rank. % rank of a residue depends on its distance from D-plane. The profiles are aggregated from 22 apoproteins, whose DNA-binding sites comprise an average of 35 residues. In the case of 1vsr (very short patch repair endonuclease) with 134 residues and 26 DNA-binding residues, 7 residues are located within 5% distance from D-plane, out of which two bind DNA. Thus 2/26 = 7.7% DNA-binding residues are located within the first 5% rank. This data from 22 proteins are collected and represented by the first bar in the left panel (7.5±1.7% between 0 and 5% rank). The enrichment ratio for 1vsr for 5% rank is given by (2/26)/(7/134) ~ 1.5. Figure 4 28x9mm (600 x 600 DPI)

ACS Paragon Plus Environment

Page 36 of 36

Page 37 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Plot of EFHD, SC with increasing SCrank cutoff. Definition of each EFH is given in the Methods section. The enrichments for %NC-hits and IRMSD-hits in sDNA and mDNA decoy sets are computed. Figure 5 61x44mm (600 x 600 DPI)

ACS Paragon Plus Environment