The Renormalization Group and Its Applications to Generating Coarse

Feb 7, 2017 - Department of Computer Sciences and Genome Center, University of California, Davis, Davis, California 95616, United States. ‡ Departme...
0 downloads 9 Views 3MB Size
Subscriber access provided by University of Colorado Boulder

Article

The renormalization group and its applications to generating coarse-grained models of large biological molecular systems Patrice Koehl, Frédéric Poitevin, Rafael Navaza, and Marc H. Delarue J. Chem. Theory Comput., Just Accepted Manuscript • DOI: 10.1021/acs.jctc.6b01136 • Publication Date (Web): 07 Feb 2017 Downloaded from http://pubs.acs.org on February 11, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Theory and Computation is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 49

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

The renormalization group and its applications to generating coarse-grained models of large biological molecular systems Patrice Koehl,∗,† Fr´ederic Poitevin,‡ Rafael Navaza,¶ and Marc Delarue§ †Department of Computer Sciences and Genome Center, University of California, Davis, CA 95616, USA ‡Department of Structural Biology, Stanford University, CA, USA and Stanford PULSE Institute, SLAC National Accelerator Laboratory, Menlo Park, CA, USA ¶Platform of Crystallogenesis and Crystallography, CiTech, Institut Pasteur, 75015 Paris, France §Unit´e de Dynamique Structurale des Macromol´ecules, UMR 3528 du CNRS, Institut Pasteur, 75015 Paris, France E-mail: [email protected]

Abstract Understanding the dynamics of biomolecules is the key to understanding their biological activities. Computational methods ranging from all-atom molecular dynamics simulations to coarse-grained normal mode analyses based on simplified elastic networks provide a general framework to studying these dynamics. Despite recent successes in studying very large systems with up to a hundred million atoms, those methods are currently limited to studying small to medium size molecular systems due to computational limitations. One solution to circumvent these limitations is to reduce the

1

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

size of the system under study. In this paper, we argue that coarse-graining, the standard approach to such size reduction, must define a hierarchy of models of decreasing sizes that are consistent with each other, i.e. that each model contains the information of the dynamics of its predecessor. We propose a new method, Decimate, for generating such a hierarchy within the context of elastic networks for normal mode analysis. This method is based on the concept of the renormalization group developed in statistical physics. We highlight the details of its implementation, with a special focus on its scalability to large systems of up to millions of atoms. We illustrate its application on two large systems, the capsid of a virus and the ribosome translation complex. We show that highly decimated representations of those systems, containing down to 1% of their original number of atoms, still capture qualitatively and quantitatively their dynamics. Decimate is available as an OpenSource resource.

1

Introduction

A major goal of molecular biology is to understand at the atomic level the function of macromolecules and/or biological nano-machines, which is believed to be intimately related to the dynamics of their structures. Our current understanding of the dynamics of macromolecules is, however, largely incomplete. This arises as only a few experimental techniques are capable of collecting time-resolved structural data, and those that can are usually limited to a narrow time window. In parallel, state-of-art computational methods are limited in scope, both for time-scale (usually micro-seconds) and length-scale (with systems of up to hundred thousand atoms), because of limitations in computing power. The limitations in computing power are directly related to the size of the molecular system considered. Any simulations of molecular dynamics require numerous calculations of the energy of the system under study. Any such calculation is at least of order O(N log(N )), where N is the number of atoms included in the system (usually the sum of the numbers of atoms from the molecule and from the solvent). These calculations need to be repeated many

2

ACS Paragon Plus Environment

Page 2 of 49

Page 3 of 49

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

times: for example, 1012 times for a 1 ms molecular dynamics simulation with a time step ∆t that is equal to one to a few femtoseconds. For large N , in the hundreds of thousands, such calculations become prohibitive. There are two main options to alleviate those problems. One option is to work on the implementation of the software designed to perform these calculations. This includes the design of specific hardware tailored to the specific functions called by the software and the design of new algorithms that can harvest the power of this new hardware as well as the new computing infrastructures currently available. 1–7 Such approaches have enabled simulations of systems of up to a hundred million atoms. 8,9 The second option is to work on the representation of the system itself and develop models with reduced numbers of particles N to make the calculation more tractable. This is usually identified as coarse-graining and has attracted a lot of attention in the recent literature. It is the topic of this paper. Coarse-grained models have long been used for studying protein folding and aggregation. They enable the exploration of large length scales and time scales that are usually inaccessible for all-atom models in explicit solvent. 10,11 Combined with enhanced configuration search methods, these simplified models with various levels of granularity offer the possibility to determine equilibrium structures and to compare folding kinetics and thermodynamics quantities with the corresponding values obtained by experimental techniques. In their pioneering work from 1976, Levitt and Warshel 12 developed the foundation of coarse-graining for protein folding. Using a two-bead representation for each amino acid, namely the Cα and the centroid of the side chain, as well as an effective implicit solvent force field - such that the atoms of the solvent need not be considered explicitly - and successive minimizations and normal mode thermalization, they were able fold the 58-residue BPTI protein within 6.5 ˚ A from its experimental structure. 12 Since then, various levels of granularity have been developed, from lattice to multi-bead representations and from single atom to multiple-atom residue-level representations. 11 The positions of those beads are either defined by known atoms (usually the Cα), or by fitting their positions to capture the dynamics of the full

3

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

molecular systems. 13–15 In this paper we are mainly interested in methods that implement coarse-graining beyond the Cα-only model. The main difficulty in coarse-graining is to design potential energy functions or force fields that retain the physics of the all-atom explicit solvent system in terms of structure, thermodynamics, and dynamics. 16 Two very popular models for representing the energy of a coarse-grained system are the G¯o-like models 17,18 and the Elastic Network Model (ENM) 19 and its variants. 20 The G¯o potential mimics the semi-empirical potential used in traditional molecular mechanics (MM), with bonded and non-bonded energy terms, and a simplified representation of the molecule with either one or two atoms per residue. The significant difference between the G¯o potential and MM potential is that the native structure of the protein, as usually determined by X-ray crystallography, defines the ground state of the former, while standard stereochemistry is imposed by the latter. G¯o-like models have been used extensively for studying protein folding(for review, see for example Dill et al. 21 ). In the ENM, a residue in a protein is usually represented by a bead located at the position of its Cα atom. Each pair of Cα atoms within a given cutoff distance RC is connected by a harmonic spring with a uniform force constant, whose equilibrium conformation is set to the conformation found in the X-ray structure of the protein. The total potential energy of the Elastic Network is then expressed as the sum of the simple harmonic potentials of these springs. This simple quadratic form allows for the decomposition of the motions of the protein of interest into vibrational modes with different frequencies, known as normal modes. This approach has been widely used in computational studies of macromolecules since its introduction nearly two decades ago (for recent reviews, see Sanejouand, 22 Sinitskiy and Voth 23 and Lopez-Blanco and Charcon 20 ). By definition, the G¯o and EN potentials of a protein have one minimum, usually set to be the X-ray structure. As such, the are not directly applicable to study the transition between two states of a protein. Several groups, however, have expanded either the G¯o model or the ENM to describe a double-well energy surface. In those models, each well is characterized by either the G¯o or EN model of the

4

ACS Paragon Plus Environment

Page 4 of 49

Page 5 of 49

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

conformational state it represents, and the full energy surface is obtained by combining those two wells using simple mixing models, 24–27 or simply by considering the lower envelope of the energy functions representing the two wells. 28,29 While significant efforts have been made to guarantee that the coarse-grained model and its potential capture the complexity of the all-atom molecular system, 10,16,30,31 we note that much less has been done to generate a true multi-scale representation of this system, i.e. to define a hierarchy of coarse-grained models with an actual coupling between those models. 10 Our intention is to generate such a hierarchy; for this purpose, we will rely on the concept of renormalization group (RG) that is widespread in physics. 32 RG is an iterative coarse-graining procedure that is designed to tackle difficult problems in physics that involve multiple length scales. The central goal of RG is to extract relevant features of a physical system for describing phenomena at large length scales by integrating out short distance degrees of freedom. In practice, fluctuations are integrated out starting at the microscopic scale and then moving iteratively on to fluctuations at larger scales. Thus defined, RG is an ideal framework for hierarchically coarse-graining a supra-molecular system. We note that in general, it is impossible to carry out the renormalization procedure exactly. Therefore, a number of approximate RG procedures have been developed in the theoretical physics community. Of particular interest to this proposal are the decimation method of electron system proposed by Aoki, 33 the spectral method used by Amir et al. to study random matrices, 34 and the decimation method introduced by Garel and Monthus to study random elastic networks. 35 The approach we propose, named Decimate, is derived from these methods. We start by applying the method to myoglobin, a very simple protein, to establish the validity of the model and then switch to much larger biological systems. We use viral capsids as a first very large system to illustrate the performance of Decimate. Viral capsids are multiprotein assemblies that serve as containers for the genetic material of viruses. 36 One of the most important roles of a capsid is to protect its contents. The capsid also needs to change

5

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

conformation to release its content into a host cell during infection. Thus, the stability of the capsid and its dynamics are key issues in the life cycle of a virus. 36 Many viral capsids are known structurally. However, much less is known about their dynamics. The spatial and temporal resolution of experimental techniques are too low to capture the dynamics of individual capsid proteins. In addition, those capsids are usually too big for comprehensive analyses using molecular dynamics simulations. Advances related to the use of specialized hardware and better software for molecular dynamics simulations give hope that those large systems can be studied on a computer. 37–39 Those simulations rely on coarse-graining of the capsid. Methods have been proposed to generate the corresponding coarse-grained models that rely on an understanding of the geometry of the capsids. 40,41 None of these methods, however, combine consistently coarse-graining with force-field updates. This is the specific feature of Decimate. Our second very large test system is the ribosome, a large molecular machine that is made of structural RNA and proteins. The ribosome plays an essential role in the cell as it translates the genetic information carried by the messenger RNA into protein products. During protein synthesis, the ribosome-mRNA-tRNA assembly undergoes a series of necessary, highly correlated movements. Therefore, understanding the dynamics of the ribosome is essential for understanding its function. 42–44 Computational methods are expected to play an important role when studying these dynamics. 45–47 The ribosome, however, is a very large nano-machine. While there have been all-atom molecular simulations of full ribosome structures published, 48,49 significant efforts are currently being put into designing accurate coarse-grained representations of the ribosome. 14,45 We test the quality of the coarse-grained models generated with Decimate in the context of normal modes as well as for the generation of transition pathways between the initiation and the elongation states of the ribosome. 50 The paper is organized as follows. In the next section, we review the renormalization procedure and its application to generate coarse-grained representations of molecules in the context of normal mode analysis. The following section covers the details of the implemen-

6

ACS Paragon Plus Environment

Page 6 of 49

Page 7 of 49

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

tation of this method, focusing on the specific problems encountered with large molecular systems. The next section illustrates the process on two large molecular systems, the satellite Tobacco Mosaic Virus capsid, and the ribosome. We conclude with a discussion on future developments related to the normalization group and its applications to coarse-graining for biomolecules.

2

The strong disorder renormalization procedure and its applications to molecular systems

We follow the concept of strong disorder renormalization, using the implementation provided by Monthus and Garel. 35 Here we provide the details that are relevant to its application to decimate an elastic network model representing a molecular system.

2.1

The elastic network model

The ENM is defined by attaching elastic springs between selected pairs of atoms in the ground-state (X-ray) structure. A quadratic energy VEN M is then assigned to the corresponding network of springs that depend on the inter-atomic distances. Let B be a biomolecule containing N atoms. For two atoms i and j of B, we set rij and rij0 to be their Euclidean distances in any conformation C and in the ground-state conformation C 0 , respectively. VEN M is then defined as: N

VEN M =

N

1 XX H(rij0 )(rij − rij0 )2 2 i=1 j=1

(1)

where kij is the force constant of the spring between i and j, and H is an indicator function that indicates if the pair (i, j) is included in the ENM. There are currently two main alternative definitions of the function H. The traditional approach is to include a pair of atoms (i, j) if the distance between them in the X-ray structure is below a given cutoff Rc . There are 7

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

no theoretical reasons that validate a choice of value for Rc , and different implementations offer different options. The alternative is to construct first the Delaunay triangulation for all the atoms of the molecule, and select the edges of the resulting triangulation as the edges included in the ENM. This Delaunay-based ENM was shown to provide good fit between the dynamics described by its normal modes and the experimental B-factors. 51 In the original ENM introduced by Tirion, 19 the elastic constants kij are set to the same value for all pairs of atoms. In other models, kij vary for different pairs of atoms. For example, Ming and Wall employed an enhanced ENM in which the interactions of neighboring Cα atoms on the backbone were strengthened to reproduce the correct bimodal distribution of density-of-states from an all-atom model 52 . Kondrashov et al. used a strategy in which they classified residue interactions into several categories corresponding to different physical properties 53 . The elastic constants can also be adjusted to have the fluctuations of the atoms of the molecule computed from normal modes to match the atomic fluctuations captured experimentally and reported as B-factors. Many methods have been developed for that purpose (see for example Xia et al. 54,55 and references therein). Such refinement, however, may lead to overfitting. 56 We therefore follow a conservative approach and assign the same value for all kij

2.2

The renormalization procedure

Let G be the network representation of an ENM. We write G = (V, E), where V = {vi } and E = {eij } denote its vertices and edges, respectively. The network G is weighted, as each of its vertices is assigned a weight, mi , namely the mass of the atom it represents, and each of its edges eij is assigned a weight, kij , the elastic constant of the spring it represents. The key idea behind the renormalization of G is to identify its vertices and/or edges that are associated with high frequencies in the dynamics of the network of springs. If such a frequency is high, its associated mode will not be excited in the low-energy modes of the whole system. As those are the modes of interest to study the dynamics of a biomolecule, 8

ACS Paragon Plus Environment

Page 8 of 49

Page 9 of 49

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

one may then eliminate it. This observation has led to the following iterative renormalization procedure: a) Initialization. Set N (0) = |V| and N E(0) = |E| to be the initial numbers of vertices and edges in the ENM. For each edge eij , set Kij = kij For all pairs (i, j) that do not form an edge in G, set Kij = 0. b) Find the renormalization scale. At a step k of the procedure, there are N (k) remaining vertices vi with weight mi and N E(k) remaining edges eij with weights Kij . With each edge eij , we associate the frequency Ωij given by Ω2ij

= Kij



1 1 + mi mj



(2)

With each vertex vi , we associate the frequency Ωi given by N (k)

Ω2i =

X

Kij

j=1

mi

(3)

The renormalization scale Ω is defined as the highest frequencies among all vertices and edges,

Ω = max{Ωi , Ωij }

(4)

c) Decimation of the mode associated with the highest frequency Ω. i) If the highest frequency is associated with the edge ei0 j0 : the two vertices vi0 and vj0 are replaced by their center of mass C(i0 , j0 ) with mass mC = mi0 + mj0 . Any edge attached to i0 or j0 is replaced with an edge to the center of mass C(i0 , j0 )

9

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 49

with a weight:

KjC = Kji0 + Kjj0

(5)

Note that if j is attached to i0 and j0 , we only create one new edge between j and C(i0 , j0 ). ii) If the highest frequency is associated with the vertex vi0 : the vertex vi0 is removed, and the weights of the edge between two neighbors vi and vj of vi0 is renormalized according to:

Kij = Kij +

Kii0 Kji0 mi Ω2i0

(6)

The update of the weights is applied even if the edge (i, j) was not present in G, in which case it is added. After renormalization, the subset of V(k)that includes the neighbors of i0 forms a clique, i.e. its subgraph is complete. d) Iteration. Return to step b) to update the frequencies associated with the surviving vertices and to the surviving renormalized edges. The procedure described above allows for iterative removal of the high frequency vertices and edges in the network. In the special case of an ENM computed either from the Cα or from all atoms of a molecule, we note that the weights on the vertices (i.e. the masses) are very similar to each other. As a consequence, it is extremely unlikely that the renormalization scale is associated with an edge. In all the test cases described in the Result sections, we have not observed the removal of an edge.

2.3

Validation

The renormalization procedure described above is designed to remove the vertex or edge with the highest frequency at each iteration. We show here that if it is a vertex that is removed, 10

ACS Paragon Plus Environment

Page 11 of 49

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

the frequencies associated with the surviving vertices either remain constant or are reduced, which guarantees that the modes of the remaining network have lower frequencies. Let vi0 be the vertex identified in step c) of the renormalization procedure at step k. There are two types of vertices j in V(k) that differ from vi0 : those that are attached to vi0 (i.e. such that eji0 ∈ E(k)), and the others. All vertices j in the latter category will have their frequencies Ωj unchanged. Let us consider now a vertex vj of the first category. Once the vertex vi0 is removed, there are two updates that affect j: i) The edge eji0 is removed, which leads to Kji0 = 0. ii) All “edges” between vj and another neighbor vl of vi0 have their weights updated according to Equation 6. Note that these “edges” may be virtual before the update, i.e. they do not belong to V(k) and their weight Kjl is set to 0. After the update, however, they are part of the network, with a non zero weight. Based on those observations, we can update the frequency associated with j:

Ω2j

=

Ω2j

N (k) Kji0 1 X Kji0 Kli0 + − mj mj l=0;l6=j mi Ω2i0

N (k) X Kji0 Kji0 + = − Kli mj mj mi Ω2i0 l=0;l6=j 0   N (k) X Kji0 Kji0  = Ω2j − Kli0 − Kji0  + mj mj mi Ω2i0 l=0

Ω2j

 Kji0 Kji0 mi Ω2i0 − Kji0 + 2 mj m j m i Ωi 0 Kji2 0 Kji0 Kji0 = Ω2j − + − mj mj mj mi Ω2i0 Kji2 0 2 = Ωj − mj mi Ω2i0

= Ω2j −

(7)

Therefore Ωj is guaranteed to decrease. Note that Equation 7 provides a closed form solution to the update of the frequencies of the vertices of the network. 11

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

3

Algorithmic implementation

3.1

A general algorithm for decimation by renormalization

The renormalization procedure described above translates trivially into the Algorithm 1. For sake of clarity, we write it here only for the removal of vertices. Algorithm 1 Decimate: a renormalization procedure for Elastic Network Models of Biomolecules Input: G = (V, E) the elastic network. {mi } the masses of all vertices in V and {kij } the weights of all edges in E. Initialize: Let A be the adjacency matrix associated with G. Set A(i, j) = kij for all (i, j) such that eij ∈ E and A(i, j) = 0 otherwise. Set N = |V|. for k = 0, . . . do (1) Compute frequencies of all remaining vertices: for i = 1, N do if mi > 0 then N X 1 Ω2(i) = mi A(i, l) l=1

end if end for (2) Find vertex with highest frequency: i0 = argmin Ω2(i) i

(3) Define Li0 = {j | A(i0 , j) 6= 0} (4) Update A: for i ∈ Li0 do for j ∈ Li0 j 6= i do 0 )A(j,i0) A(i, j) = A(i, j) + A(i,i mi0 Ω2(i0 ) end for end for (5) Remove i0 : for i = 1, N do A(i, i0 ) = 0 A(i0 , i) = 0 end for Set mi0 = 0; Set V(k + 1) = V(k) − {vi0 } end for Output: The new set of vertices V(k); the matrix A that contains the updated elastic constants kij .

12

ACS Paragon Plus Environment

Page 12 of 49

Page 13 of 49

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

3.2

A data structure adapted to decimating very large networks

Algorithm 1 is an efficient method to handle the network (graph) defined by the ENM as it relies on the adjacency matrix A. As such, the search for and removal of an edge in the network are both operations with O(1) (constant) complexity, i.e. optimal. The drawback, however, of using the adjacency matrix is the memory requirement. As written, the procedure requires the storage of the matrix A of size N × N , where N is the number of vertices in the initial network. Even considering that this storage can be reduced to (N (N − 1))/2, as A is symmetric, it becomes prohibitively expensive for large N (N = 40, 000 for example would require more than 6 GB of memory). The usual alternative to adjacency matrices for large networks is to use an adjacency list. The adjacency list is efficient in memory as it only stores the actual edges of the network. As such, its memory complexity is of order O(|E|). The gain is significant when the network is sparse, but only limited when the network is dense. The use of an adjacency list for Decimate, however, would render the algorithm significantly more complex. Indeed, the network is not static as it is updated iteratively, with vertices being removed, and edges being removed and added at each iteration. To handle such a dynamic network, we have implemented a different data structure based on an array of hash tables. To define which data structure may work best for Decimate, we make first two observations. First, in steps 1 and 3 of the algorithm 1 described above, we need to find the list Li of all vertices vj that are connected with an edge to a given vertex vi in the network. Second, in step 4, we need to check all pairs of vertices vi and vj in the list Li to see if they already define edges. If they do, we modify their weight, while if they don’t, we create the edge and define its weight. From the first observation, it is clear that we need an efficient way to access information about vertices. As we start the procedure with all vertices, the simplest solution is to create an array of pointers to data structures, where the array is of size N , the number of vertices. There are thus several options for the data structure specific to a given vertex vi . We need to store in that data structure the list of neighbors of vi . We 13

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

need to search, insert, and delete from this list. The best data structure for the combination of those operations is a hash table. Such a hash table is characterized by three elements: its size, its hash function, and the procedure put into place for dealing with possible collision. We discuss all three elements below. The size (defined as SIZE in the implementation) for a hash table is usually the result of a compromise between efficiency in computing time and efficiency in memory requirement. A large table has the advantage of limiting its load and reducing the number of collisions, but it does require more storage. For the molecular systems studied here, we found that setting this size to 50 led to an efficient implementation. The choice of a good hash function is probably even more important than the size of the hash table. A poor hash function will lead to many collisions and therefore increase the computing cost. In the case of the molecular systems considered in this study, we note that the index given to the hash function is an atom number. From the format of PDB files, we know that the sequential order of atoms in the file map to spatial proximity. As such, the indices of atoms that are close to each other in 3D space are often consecutive. Consequently, we need a hash function that handles well the lower digits of the key. The best choice for such a hash function is usually the remainder function, i.e. the key associated to an index i is the remainder of the division of i by SIZE. Unless the hash function is the identity function, there is always a risk of a collision when using hash tables, i.e. that two indices lead to the same key. There are many options to avoid/handle such collisions, such as linear probing or linked lists. We have implemented the latter.

3.3

Avoiding network saturation

Step 4 of the algorithm described above leads to the possible addition of many edges in the network, since the subgraph formed by the vertices linked to the vertex being removed becomes complete. In theory, this would lead to the whole network becoming complete, in 14

ACS Paragon Plus Environment

Page 14 of 49

Page 15 of 49

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

which case the advantage of the data structure described above would disappear. The whole procedure would then become limited to small systems because of memory requirements. To avoid this behavior, we monitor the correction term defined in Equation 6. If this correction term becomes smaller than a cutoff ǫ (set to 1 × 10−5 in our implementation), we do not include any new edge that would be assigned this correction term. Application of this cutoff keeps the procedure tractable even for large networks. The whole procedure was implemented into a C++ program, Decimate, which is available under OpenSource format upon request.

4

Molecular dynamics based on normal modes

4.1

Normal mode analysis based on elastic network models

In the normal mode framework, the potential VEN M given by Equation 1 is approximated with a second-order Taylor expansion in the neighborhood of the ground state X 0 : 1 VEN M (X) ≈ VEN M (X 0 ) + ∇VEN M (X 0 )T (X − X 0 ) + (X − X 0 )T H(X − X 0 ) 2

(8)

where ∇VEN M and H are the gradient and Hessian of VEN M , respectively. Note that based on Equation 1, VEN M (X 0 ) = 0 and ∇VEN M (X 0 ) is the null vector (i.e. X 0 is a global minimum of VEN M by definition). The ENM energy is then simply 1 VEN M (X) ≈ (X − X 0 )T H(X − X 0 ) 2

(9)

In Cartesian coordinates, the equations of motion defined by the potential VEN M are derived from Newton’s equation: d2 X = −H(X − X 0 ) dt2 15

ACS Paragon Plus Environment

(10)

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 49

Writing the solution to this equation as a linear sum of intrinsic motions (the “normal modes” of the system),

Xj =

3N X

Ajk αk cos(ωk t + δk )

(11)

k=k0

we get a standard eigenvalue problem,

HA = M AΩ

(12)

The eigenfrequencies ω are given by the elements of the diagonal matrix Ω, namely ωi2 = Ω(i, i). The eigenvectors are the columns of the matrix A, and the amplitudes and phases, αk and δk , are determined by initial conditions. The matrix M is a diagonal matrix containing the masses of the atoms. We note that the first six eigenvalues in Ω are equal to 0, as they correspond to global translations and rotations of the biomolecule. To characterize the internal motions of the biomolecule, the sum in Equation 12 runs then from k0 = 7 up to 3N , the number of degrees of freedom of the system.

4.2

Atomic fluctuations computed from normal modes

From the normal modes of the ENM, it is possible to compute the mean square fluctuations of the positions of the atoms according to: < ∆X2i >=

m kB T X A2ik mi k=7 ωk2

(13)

where ∆Xi and mi are the displacement vector and mass of vertex i, respectively, kB is the Boltzmann’s constant, T is the temperature considered, Aij is the i-th component of the j eigenvector Aj of the Hessian, and ωi is its associated eigenvalue. The summation should run over all the modes of the system (excluding the six modes for rigid body transformation); it is truncated here to the first m = 100 modes, as those low frequency modes are usually 16

ACS Paragon Plus Environment

Page 17 of 49

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

responsible for most of the atomic fluctuations (see above).

4.3

Correlated motions within a biomolecule

The Boltzmann distribution for the approximate potential of the ENM (see Equation 9) is described by a multivariate Gaussian distribution with a covariance matrix proportional to the inverse of the Hessian H. Because of the six rigid motions captured by the six normal modes with 0 frequencies, the inverse of H is in fact not properly defined. We can, however, compute a pseudo-inverse by ignoring those zero energy modes; this pseudo-inverse can be regarded as a covariance matrix of internal deformation: M X 1 C= A AT 2 k k ω k=7 k

(14)

where ωk and Ak are the k − th eigenvalues and eigenvectors, respectively. Note that C is a 3N × 3N matrix. The summation extends from k = 7, the first non-zero mode, to M , the highest mode considered (up to 3N ). To obtain a scalar quantification of the correlation of the motions of two atoms i and j, a correlation matrix P is computed, following: 57 tr(Cij ) Pij = p tr(Cii )tr(Cjj )

(15)

where Cij is a 3×3 block matrix extracted from the overall covariance matrix C, corresponding to atoms i and j. The values Pij range from −1 to +1, with a negative value indicating an anticorrelated motion, and a positive value identifying a correlated pattern of dynamics between the two atoms considered. These values are stored into a cross-correlation matrix CCM that is used to visualize correlations of motion within the molecule under study.

17

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

4.4

Page 18 of 49

Relative motions of two subunits

The relative motions of two subunits SI and SJ of a molecular system are computed as the orientation correlation between the center of mass of the displacements of the two subunits. 58 Let ai for i ∈ [1, |SI |] be the set of atoms corresponding to the subunit SI , with |SI | being the cardinality of SI . The center of mass displacements for a mode k are defined as:

∆RI (k) =

|SI | X

Ak (ai )

(16)

i=1

where Ak is the k-th eigenvector of the Hessian, and Ak (ai ) is the vector of the three components of Ak corresponding to atom ai . The orientation correlation between the two displacement vectors ∆RI (k) and ∆RJ (k) for mode k is computed as:

CIJ (k) =

∆RI (k) · ∆RJ (k) |∆RI (k)||∆RJ (k)|

(17)

where |R| is the norm of vector R. The orientation correlation is then averaged over a subset of all modes of the elastic network considered using the weighted sum:

< CIJ >= PM

1

k=7

ωk−2

 M  X CIJ (k) k=7

ωk2

(18)

The values CIJ range from −1 to +1, with a negative value indicating an anti-correlated motion, and a positive value identifying a correlated pattern of dynamics between the two subunits considered.

18

ACS Paragon Plus Environment

Page 19 of 49

106

100

Completeness (%)

20

105

# of links

15 10 5 0 0

C)

B)

A) Maximum frequency

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

4

10

3

10

2

10 200

400

600

800 1000 1200

101 0

200

400

600

800 1000 1200

# of atoms removed

# of atoms removed

80 60 40 20 0 0

200

400

600

800 1000 1200

# of atoms removed

Figure 1: Decimating the elastic networks of myoglobin. The structure of myoglobin (PDB code 1A6M) includes 1204 heavy atoms. Starting with the elastic network constructed with all those atoms and a cutoff of 10 ˚ A, we decimated iteratively this network using the renormalization procedure described in Section II. We monitor the highest frequency in the network (A), the number of edges in the network (B), and its completeness (C), i.e. the ratio (in %) of the number of edges over the total number of possible edges.

5

Results

A simple system: myoglobin To illustrate the application of Decimate, we consider a small globular protein: sperm whale myoglobin. Many structures of this myoglobin are available in the Protein Data Bank (PDB, 59 ); we have used the structure with PDB code 1A6M, as it is a high resolution structure (1 ˚ A). 60 We isolated from the PDB all 1204 heavy atoms. We then generated an elastic network with all those atoms using a cutoff procedure, with the cutoff set to 10 ˚ A. The corresponding network contains 1204 vertices and 81810 edges. We then ran the Decimate algorithm presented above on this network, removing atoms iteratively until only 0.5% of the atoms remain, i.e. 6 atoms. Characteristics of the decimation process are illustrated in Figure 1. As expected, the highest frequency in the network decreases as the decimation proceeds (Figure 1A). As a vertex is removed from the network, the updates of the edges lead to the subgraph formed by the vertices attached to that vertex to be complete. It is therefore expected that the total number of edges in the network increases. However, as the number of vertices keep decreasing, the number of edges does eventually decrease. This is

19

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

indeed observed for the decimation of 1A6M (Figure 1B). The whole network, however, fills up fast and is already complete when 50% of the vertices remain, i.e. 600 atoms. (Figure 1C). All Atoms

DEC50

DEC10

Figure 2: Three snapshots of the coarse-grained representations of myoglobin. The vertices (top) and networks (bottom) are shown at the beginning (ALL), at 50% decimation, i.e. when only 50% of the atoms remain (DEC50), and at 10% decimation (DEC10). These networks contain 1204, 602, and 120 atoms, and 81810, 180536, and 7140 edges, respectively. In Figure 2, we show examples of the elastic networks for myoglobin as we go from the initial network down to a network with only 10% of the atoms. Clearly, the whole geometry and shape of the molecule is conserved through the decimation process. Interestingly, in both the DEC50 and DEC10 networks, the remaining atoms tend to cluster at the surface of the protein; this is expected as myoglobin is globular and has a roughly constant density and its dynamics can be captured from information on its surface. The main advantage of the method presented here is that the decimated networks are expected to maintain the low energy modes for the dynamics of the system. We tested this property by computing the normal modes associated with the elastic networks at different levels of decimation. From those normal modes we computed the mean square fluctuations 20

ACS Paragon Plus Environment

Page 20 of 49

Page 21 of 49

of the positions of the vertices (atoms) according to Equation 13. We computed these fluctuations for the 120 atoms in the DEC10 network, for the three elastic networks ALL, DEC50, and DEC10. Results are given in Figure 3. The correlation coefficients between the atomic fluctuations computed from the ALL network and the fluctuations computed from the DEC50 and DEC10 networks are 0.98 and 0.90, respectively, and 0.91 between DEC10 and DEC01. These high correlation values indicate that the dynamics of the myoglobin are well preserved during the decimation process. In parallel, the correlation coefficients between the experimental B-factors for myoglobin and the computed atomic fluctuations are 0.44, 0.43, and 0.38 for the ALL, DEC50, and DEC10 networks, respectively. Those values are modest. We note that it would be possible to obtain significantly better correlations if the elastic constants kij assigned to the links of the networks were fitted to improve the match between B-factors and computed fluctuations. This is not the focus of this paper. 10 9

All atoms DEC50 DEC10

8

Atomic Fluctuation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

7 6 5 4 3 2 1 0 0

20

40

60

80

100

120

Atom index

Figure 3: Atomic fluctuations in myoglobin. The atomic fluctuations computed using Equation 13 for the three elastic networks ALL, DEC50, and DEC10 are given for the 120 atoms included in DEC10.

21

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Coarsening the Satellite Tobacco Mosaic Virus The satellite tobacco mosaic virus is a small, icosahedral plant virus that worsens the symptoms of infection by the tobacco mosaic virus (TMV). The entire STMV particle consists of 60 identical copies of a single protein that make up the viral capsid, and a 1063-nucleotide single-stranded RNA genome which codes for the capsid and one other protein of unknown function. Its high-resolution structure was recently obtained by X-ray crystallography; it is, in fact, the highest resolution native virus structure currently known. The structure is available in the PDB with the identifier 4OQ9. The PDB file 4OQ9 provides the conformation of one quarter of the full viral capsid, namely 15 out of the 60 copies of its coat protein. After generating the structure of the full capsid using crystallographic symmetry, and after selecting one of the positions for atoms with alternate conformations (i.e. the conformation with the highest occupancy), we isolated from the corresponding reconstruction all heavy atoms of the viral capsid. While the coat protein is 159 residue long, we note that the positions of residues 1-15 are not provided in the original PDB file. Therefore, there are 8640 residues present in the full capsid, for a total of 67080 heavy atoms. We generated first the all-atom elastic network for the virus capsid using a cutoff procedure, with the cutoff set to 6 ˚ A. The corresponding network contains 67080 vertices and 1397150 edges. We then ran the Decimate algorithm on this network, removing atoms iteratively until only 1% of the atoms remain, i.e. 670 atoms. In Figure 4, we show examples of the vertices and networks for the STMV as we go from the initial network down to a network with only 1% of the atoms. Clearly, the whole geometry of the molecule is conserved through the decimation process. On all three examples, the limits of the capsid envelop are clearly visible. We computed the hundred lowest frequency normal modes associated with three elastic networks of STMV: the all-atom network (ALL), the network with 90% of the atoms removed, DEC10, and the network with 99% of the atoms removed, DEC01. To measure the similarity 22

ACS Paragon Plus Environment

Page 22 of 49

Page 23 of 49

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Full capsid (67080 atoms)

Full elastic network

Decimated capsid 10% (6708 atoms)

Decimated network (10%)

Decimated capsid 1% (670 atoms)

Decimated network (1%)

Figure 4: Three snapshots of the coarse-grained representations of the capsid of the satellite tobacco mosaic virus (STMV). The vertices (top) and networks (bottom) are shown at the beginning (full), at 10% decimation, i.e. when only 10% of the atoms remain, and at 1% decimation. These networks contain 1397150, 2839884, and 211297 edges, respectively. between two normal modes i and j derived from two of those different models, we computed the overlap O(i, j) between their directions:

O(i, j) =

|ai . · bj | |ai ||bj |

(19)

where ai and bj are the i-th and j-th normalized eigenvectors of the Hessian matrix. Note that O(i, j) corresponds to the absolute value of the cosine of the angle between the two vectors ai and bj . This value is between 0 and 1, with a large value indicating a high similarity of two modes. We computed the overlaps between modes 7 to 11 of the ALL elastic network with the two decimated networks over all atoms present in the smallest network. Results are given in Table 1. From Table 1, we observe that the first five non-zero modes in the ALL network (modes 7-11) match with the same five first non-zero modes in the decimated networks. Modes 8, 23

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 49

Table 1: Overlaps between the normal modes of the full elastic network and decimated elastic networks of STMV Network Modes 7 c) 8 9 10 11

7 a) 0.116 d 0.972 0.001 0.197 0.000

8 0.649 0.224 0.000 0.725 0.001

DEC10 9 10 0.000 0.752 0.000 0.044 0.178 0.000 0.001 0.656 0.983 0.001

11 0.000 0.001 0.983 0.000 0.178

Sum b) 1.000 0.997 0.997 0.994 0.997

7 a) 0.001 0.136 0.004 0.957 0.042

8 0.415 0.028 0.285 0.042 0.824

DEC01 9 10 0.198 0.854 0.065 0.001 0.920 0.070 0.003 0.019 0.224 0.455

11 0.002 0.954 0.071 0.132 0.015

Sum b) 0.940 0.934 0.939 0.936 0.937

a)

Mode number based on the decimated network. Sum of the squared overlap values over the first five modes. c) Mode number based on the full network ALL. d) Overlap between the two modes, computed using Equation 19. The highest values are highlighted in bold. b)

9, and 11 in the full network overlap significantly (> 0.9, most even > 0.95) with modes 7, 11, and 9 in the DEC10 network, respectively, and with modes 11, 9, and 7 in the DEC01 network, respectively. Such shuffling of the modes between two models is not unusual. Interestingly, modes 7 and 10 in the ALL network appear in two modes both in the DEC10 network (modes 8 and 10) and as single modes in the DEC01 network. These results indicate that the eigenvectors corresponding to the 5 lowest frequency modes of the Hessian for ENM are conserved during the decimation process. Those eigenvectors define the directions of the motions in the network. From the normal modes of the elastic networks, we compute also the mean square fluctuations of the positions of the atoms according to Equation 13. We computed these fluctuations for the 670 atoms in the DEC01 network, for the three elastic networks ALL, DEC10, and DEC01. We compared those fluctuations to the experimental B-factors available in the PDB file. Results are given in Figure 5. There are significant similarities between the experimental B-factors and the computed atomic fluctuations for all three networks. The corresponding correlation coefficients are 0.59, 0.05, and 0.15 for the ALL, DEC10, and DEC01 networks, respectively. We note, however, that the two latter values are low due to the presence of a few large atomic fluctuations. Those values are also large in the B-factors, but with a much smaller range. If we remove these outliers (28 of them), the correlation coefficients between the B-factors and 24

ACS Paragon Plus Environment

Page 25 of 49

70 Experimental 60

50 FULL

B-factors

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

40

DEC10

30

20

DEC01 10

0 0

1

2

3

4

Atom #

5

6

7 # 10

4

Figure 5: Atomic fluctuations in the capsid of STMV. The experimental B-factors and the corresponding atomic fluctuations computed using Equation 13 for the three elastic networks FULL, DEC10, and DEC01 are given for the 670 atoms included in DEC01. The different curves have been shifted for clarity. See text for details. the computed atomic fluctuations are 0.59, 0.35, and 0.21 for the ALL, DEC10, and DEC01 networks, respectively. Those values are modest. More significant, however, is the fact that the correlation coefficients between the atomic fluctuations computed from the ALL network and the fluctuations computed from the DEC10 and DEC01 networks are 0.87 and 0.68, respectively, and 0.90 between DEC10 and DEC01. These high correlation values indicate that the range of the dynamics of the atoms of SMTV is well captured by the decimated elastic network models. In Figure 6 we compare the correlations of the atomic fluctuations of the 670 atoms in the DEC01 network computed from the first hundred non-zero normal modes of the three elastic networks ALL, DEC10, and DEC01, using Equation 15. This is a more stringent test than simply looking at atomic fluctuations, as it assesses the extent to which collective

25

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

A)

B)

ALL

Page 26 of 49

DEC10

1

1

0.8

0.8

100

100 0.6

0.6

200

200

0.4

0.4

0.2

0.2

300

300 0

400

0

-0.2

400

-0.2

-0.4

500

-0.4

500 -0.6

600

-0.8

100

200

C)

300

400

500

600

-0.6

600

-1

-0.8

100

200

300

400

500

600

-1

D)

DEC01 1 0.8

100 0.6

200

0.4 0.2

300 0

400

-0.2 -0.4

500 -0.6

600

-0.8

100

200

300

400

500

600

-1

Figure 6: Correlated motions in the capsid of STMV. Cross Correlation Matrices (CCM) obtained from the 100 first non-zero modes computed from the ALL (A), DEC10 (B), and DEC01 (C) elastic networks of the capsid to STMV. Those plots show correlations between the motions of atoms in each elastic network considered. Both axes of a matrix are the atom indices. Only the 670 atoms included in the DEC01 network are represented. Each cell in a matrix shows the correlation between the motions of two atoms in the capsid on a range from -1 (anti-correlated, blue) to 1 (correlated, red), with 0 conferring no correlation. D) The full virus capsid is shown in cartoon mode, with the positions of the 670 atoms of the DEC01 network superposed as blue spheres. Panel D was generated using Pymol (http://www.pymol.org). motions are maintained. The three cross correlation matrices (CCM) corresponding to the three elastic networks show identical patterns. These results confirm that the decimation is conservative in that it retains the overall dynamics of the viral capsid.

26

ACS Paragon Plus Environment

Page 27 of 49

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Coarsening the ribosome A) Ribosme - tRNA complex

C) DEC10

B) ALL

D) DEC02

E) DEC01

Figure 7: Snapshots of the coarse-grained representations of the ribosome-mRNAtRNA complex. A) The full translation complex is shown in cartoon mode (for the 50S and 30S) and space-filling mode (and the tRNAs), with the following color code: 50S in orange, 30S in green, A-tRNA in red, P-tRNA in purple, and E-tRNA in blue. The mRNA is too small to be seen at this scale. (B) to (E): the vertices of the ALL, DEC10, DEC02, and DEC01 networks, with the same color coding as in panel (A). Note that while the overall shape of the complex is maintained in the DEC10 and DEC02 networks, the 30S unit becomes underrepresented in the DEC01 network. The ribosome is a large nano-machine formed with two large subunits, named 50S and 30S, that are themselves composed of large ribosomal RNA molecules bound to a large collection of proteins. The ribosome-mRNA-tRNA assembly undergoes a series of necessary, highly correlated movements during protein synthesis. This complex is a harder challenge than the STMV capsid considered above, as it is not symmetric and it includes subunits of significantly different sizes. We have considered the recently published all-atom highresolution structure (R=2.6 ˚ A) of this complex from T. thermophilus obtained by X-ray crystallography. 61 It is available in the PDB with the identifier 5J4B. 27

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 49

Table 2: Number of atoms in the different elastic networks of the ribosome 50S 30S mRNA A-tRNA P-tRNA E-tRNA Total a)

ALL 91157 (62.2 %) a 50545 (34.5 %) 277 (0.2%) 1438 (1.0 %) 1543 (1.1 %) 1438 (1.0%) 146397

DEC10 8916 (60.9%) 5052 (34.5%) 42 (0.3%) 222 (1.6%) 198 (1.3%) 209 (1.4%) 14639

DEC02 1833 (62.6 %) 864 (29.5 %) 9 (0.3%) 81 (2.7%) 61 (2.1%) 79(2.7%) 2928

DEC01 1138 (77.8%) 141 (9.6%) 5 (0.3%) 67 (4.6%) 42 (2.9%) 70 (4.8%) 1463

number of atoms, and, in parenthesis, percentage with respect to the total number of atoms for the elastic network considered

We isolated from the PDB all heavy atoms of the ribosome complex. As mentioned above, it includes 6 different units, the 50S and 30S subunits of the ribosome itself, a short mRNA, and three tRNA at positions A, P, and E (see Figure 8D) . We generated first the all-atom elastic network for the complex using a cutoff procedure, with the cutoff set to 5 ˚ A. The corresponding network, ALL, contains 146397 vertices and 1884434 edges. We then ran the Decimate algorithm on this network, removing atoms iteratively until 10% 2%, or 1% of the atoms remain, i.e. 14639, 2928, and 1463 atoms, respectively. The corresponding decimated networks are referred to as DEC10, DEC02, and DEC01. In Figure 7 we show visualizations of the full and decimated networks, and in Table 2 we compare the relative distributions of vertices in the six units of the translation complexes for the four elastic networks. The Decimate procedure maintains remarkably well those distributions, until at least 98% removal of atoms. As the decimation goes further to only 1% of the atoms remaining, the proportions of atoms in the 50S and 30S units change significantly, from 62% and 34% in the ALL and DEC10 networks, to 78% and 10% in the DEC01 network. The 30S becomes highly under-represented, as observed in Figure ??E. The reasons for this effect remain mostly unclear. We suspect that it reflects the differential nature of the motions experienced by the 30S compared to the 50S. 58 It is possible however that this effect is due to the fact that our implementation of the decimation process is only an approximation, as we only include new edges when their force constants are larger than a given cutoff.

28

ACS Paragon Plus Environment

Page 29 of 49

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

A)

ALL

B)

DEC10

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

-0.2

-0.2

-0.4

-0.4

-0.6

-0.6

-0.8

-0.8

-1

50S

C)

30S A P E

D)

DEC02

50S

-1

50S

30S A P E

DEC01

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

-0.2

-0.2

-0.4

-0.4

-0.6

-0.6

-0.8

-0.8

-1

-1

50S

30S A P E

30S A P E

Figure 8: Correlated motions in the ribosome. Cross Correlation Matrices (CCM) obtained from the 100 first non-zero modes computed from the ALL (A), and 40 first non zeros modes computed from the DEC10 (B), DEC02 (C), and DEC01 (D) elastic networks of the ribosome-mRNA-tRNA complex. Those plots show correlations between the motions of atoms in each elastic network considered. Both axes of a matrix are the atom indices. The positions of the different units of the complex are marked, with a color coding matching the color coding in Figure 7. Only the 1463 atoms included in the DEC01 network are represented. Each cell in a matrix shows the correlation between the motions of two atoms on a range from -1 (anti-correlated, blue) to 1 (correlated, red), with 0 conferring no correlation. In Figure 8 we compare the correlations of the atomic fluctuations of the 1463 atoms in the DEC01 network computed from the four elastic networks ALL, DEC10, DEC02, and DEC01, using Equation 15. The three cross correlation matrices (CCM) corresponding to the three elastic networks ALL, DEC10, and DEC02 show very similar patterns, both within each unit of the translation complex, and between the different units. Network DEC01 was 29

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

found to be different in its compositions of vertices in the different units of the translation complex (see Table 2). Interestingly however, most of the correlations of atomic fluctuations remain the same. The only noticeable difference is found in the region 710-790 within the 50S unit; the dynamics of this region are found to be strongly anti-correlated to the rest of the ribosome-mRNA-tRNA complex in the DEC01 network, and only moderately anticorrelated in the ALL and DEC10 network. Interestingly, this region corresponds to helices 76 to 78 of the 23S rRNA, a region involved in directing tRNA movement. 62 This region is part of the L1 stalk, a highly mobile protuberance of the 23S rRNA. It is therefore not surprising that it shows higher dynamics in the highly decimated network. Table 3: Correlations of the motions of the different components of the ribosome over the first 100 non-zero normal modes for the ALL and DEC10 networks 50S 50S 30S -0.926 b mRNA -0.747 A-tRNA -0.505 P-tRNA -0.554 E-tRNA -0.352

30S -0.842 0.719 0.421 0.457 0.216

a

mRNA A-tRNA P-tRNA E-tRNA -0.466 -0.396 -0.320 -0.258 0.645 0.516 0.472 0.251 0.849 0.877 0.694 0.579 0.906 0.631 0.667 0.756 0.780 0.480 0.382 0.566

a)

Upper triangle part: correlations computed with Equation 18 for the 100 first non-zero normal models of the ALL elastic network b) Lower triangle part (in italic): correlations computed with Equation 18 for the 100 first non-zero normal models of the DEC10 elastic network

Jernigan and colleagues studied the dynamics of the ribosome-mRNA-tRNA complex using normal modes of a coarse grained network that includes the Cα of the protein residues and the phosphate atoms P of the nucleic acids. 58,63 Using orientation correlation between the center of mass of the displacement of the six different units of the complex, they were able to illustrate the ratchet motion between the 50S and 30S units. We repeated their analysis using the four elastic networks ALL, DEC10, DEC02, and DEC01. Comparative results between ALL and DEC10, and between DEC02 and DEC01 are given in Tables 3 and 4, respectively. All four networks reproduce well the ratchet, anti-correlated motion

30

ACS Paragon Plus Environment

Page 30 of 49

Page 31 of 49

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Table 4: Correlations of the motions of the different components of the ribosome over the first 100 non-zero normal modes for the DEC02 and DEC01 networks 50S 50S 30S -0.530 b mRNA -0.327 A-tRNA -0.285 P-tRNA -0.240 E-tRNA -0.416

30S -0.771 0.301 0.067 0.141 -0.005

a

mRNA A-tRNA P-tRNA E-tRNA -0.296 0.023 -0.016 -0.166 0.324 -0.167 -0.060 -0.083 0.227 0.438 0.041 0.262 0.444 -0.125 0.553 0.475 -0.074 0.352 -0.063 0.041

a)

Upper triangle part: correlations computed with Equation 18 for the 100 first non-zero normal models of the DEC02 elastic network b) Lower triangle part (in italic): correlations computed with Equation 18 for the 100 first non-zero normal models of the DEC01 elastic network

between the 30S and 50S subunits, albeit with a lower correlation for the DEC01 network, as expected from its misrepresentation of the 30S unit (see above). Similar to the results reported by Jernigan and colleagues, 58,63 we find positive correlations of motion among the tRNAs and mRNA and a positive correlation of the motion of the mRNA with respect to the motion of the 30S unit.

Structural transitions of the ribosome The functions of many bio-molecules strongly correlate with conformational changes in their structure space, a process usually referred to as their activations. This process is very much at the core of enzymatic activity, as an enzyme and its substrate usually go through structural transitions that favor the chemical reaction. Many computational methods have been developed for studying such transitions and define the transition state along its path. Given two conformations for a bio-molecule, those methods attempt to find a plausible path between these conformations along the energy surface for the molecule, where plausible refers to a path with minimal frustration, either the minimum energy path, or minimum action path, depending on the type of frustration that is considered. 64,65 The computational complexities of these methods are directly related to the complexity of the potential energy

31

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

for the system of interest. While some studies have considered all-atom empirical energy functions, the use of such energy functions becomes computationally inefficient with an increasing size of the system and have therefore been limited to small proteins. 66–68 To reduce the computational costs, many recent studies have considered simplified, coarsegrained representations of the molecular system of interest. Here we are interested in testing how such coarse-grained representations compare to a full-atom representation within our own approach for computing transition paths, MinActionPath. 28,69 With this method, we compute the minimum action path between two conformations of a protein on a simplified, two-well free energy surface derived from the elastic network models of the two conformations. The energy in each well is computed as a second order Taylor expansion of the energy of the corresponding elastic network, and the combined energy at any conformation is defined as the minimum of the energy functions derived from the two wells. Using this energy functional, the equations of motion corresponding to the minimum action path are solved analytically in each well, and continuity conditions at the crossing between the two wells define the transition point. MinActionPath has proved useful in characterizing the structural reaction path of a tryptophanyl-tRNA synthetase that involves three conformationally distinct states. 70–72 Here we use MinActionPath to generate transition paths between the initiation and elongation complexes of the ribosome of Thermus Thermophilus, 50 using elastic networks based on all atoms representations of the complexes, as well as decimated networks with different levels of coarse-graining. The high-resolution structures of the Thermus Thermophilus ribosome forming the initiation complex (PDB codes 3I9D and 3I9E for the 30S and 50S subunits, respectively) and forming the elongation complex (PDB codes 3I8H and 3I8I for the 30S and 50S subunits, respectively) were determined by X-ray crystallography. 50 As we are interested in changes of conformations of the ribosome itself, we isolated from the PDB all heavy atoms of the 30S and 50S subunits for both complexes. As MinActionPath requires a consistent number of atoms in the two complexes, we trimmed them accordingly; the resulting complexes contain

32

ACS Paragon Plus Environment

Page 32 of 49

Page 33 of 49

B) ALL vs DEC08 2

2

1.5

1.5

DEC08

CA

A) ALL vs CA

1

1

0.5

0.5

0

0

ALL

ALL

C) ALL vs DEC02

D) DEC02 vs DEC08 2

2

1.5

1.5

DEC02

DEC02

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

1

1

0.5

0.5

0

0

ALL

DEC08

Figure 9: A comparison of computed trajectories between the initiation and elongation complexes of the Thermus Thermophilus ribosome. All trajectories were generated using MinActionPath, 28,69 using four different coarse-grained models for the conformations of the ribosome, namely all atoms (ALL), CA-only (CA), and the all-atom elastic network simplified with Decimate such that only 8 % (DEC08) or 2 % (DEC02) of the atoms remain. Two trajectories are compared by generating the contour plot of the betweenstructure cRMS values along the trajectories. The cRMS are computed using the atoms of the smaller model considered.

33

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

143,297 atoms. We generated first the all-atom elastic network for the two complexes using a cutoff procedure, with the cutoff set to 5 ˚ A. The corresponding networks are referred to as ALL. The simplest associated coarsened networks are obtained by selecting only one atom per residue, namely the Cα atom for amino acids and the phosphate atom P for nucleotides. We generated elastic networks for those simplified representations of the complexes, with a cutoff set to 14 ˚ A. The corresponding networks contain 19293 atoms. Finally, we ran the Decimate algorithm on the all-atom network, removing atoms iteratively until 8% or 2% of the atoms remain, i.e. 10288, and 2865 atoms, respectively. To ensure that the same atoms remain in the initiation and elongation complexes, the decimation processes were run simultaneously, using the initiation network to guide the choices of atoms to be removed. We terminated the decimation at 8% as it leads to networks with approximately the same size as the CA networks, and at 2% based on the results of the previous section. The corresponding decimated networks are referred to as DEC08 and DEC02. We note that all networks have a consistent representation of the 30S and 50S subunits: the 30S include 36 %, 38 %, 37 %, and 35 % of all atoms in the ALL, CA, DEC08, and DEC02 networks, respectively. MinActionPath trajectories were generated for all four types of elastic networks. Those trajectories include 100 conformations equally spaced with a dt = 0.5 for the ALL and CA networks, and dt = 0.01 for the DEC08 and DEC01 networks. Note that those time values are arbitrary. We performed a systematic comparison of the corresponding trajectories. For two trajectories T 1 = {a1 , . . . , an } and T 2 = {b1 , . . . , bn }, where ai and bj are conformations at discrete time values along the transition, we compute all cRM S(ai , bj ) and report the values as contour plots. Results for the four networks are reported in Figure 9. The trajectories generated from the four elastic networks ALL, CA, DEC08, and DEC02, share many similarities. These are very encouraging results. First, the DEC08 elastic network, with only 8% of the atoms, enables the generation of a trajectory that resembles the all-atom trajectory (Figure 9B). This result by itself is interesting, but not sufficient as similar results are obtained with the CA network that corresponds to the same level of

34

ACS Paragon Plus Environment

Page 34 of 49

Page 35 of 49

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

coarse-graining (Figure 9A). We do find, however, that the DEC02 trajectory mimics nearly exactly the DEC08 trajectory (Figure 9D). This is much more significant, as DEC02 only contains one fourth the number of atoms of DEC08. It is unclear how the CA elastic model could have been further simplified to the same level, and still enables the generation of a trajectory that remains consistent with those generated with more detailed networks.

Computing time Decimate is based on an iterative procedure in which vertices are removed one by one, with a reorganization of the elastic network in the vicinity of the vertex that is removed at each step. As discussed in the Algorithmic Implementation section, this procedure is efficient, of order O(N 2 ) if the full adjacency matrix can be stored. For large systems, however, storing this matrix is prohibitively expensive in memory. We have relied on specific data structures, namely arrays of hash tables, to circumvent this difficulty. The time complexity of the corresponding algorithm is more difficult to establish. We have therefore experimented with systems of varying size to estimate this complexity. We have applied Decimate on STMV, the ribosome complex, and 10 virus capsids varying in size from one hundred and eighty thousand atoms to one million atoms; the structures for these capsids were downloaded from the web enabled relational database VIPERdb2. 73 All these capsids are highly symmetric (icosahedral) and repetitive; we do not make use of these symmetries. For all capsids, we extracted all the heavy atoms, computed an all-atom elastic network with a cutoff of 5˚ A, and coarse grained this network down to 1% using Decimate. All those experiments were performed on a Macbook Pro laptop with a 3.1 GHz Intel Core I7 processor, with 16 GB of memory. The computing times for Decimate are plotted against the initial numbers of atoms and edges in the all-atom elastic networks in Figure 10. We tested two implementations of the Decimate program, one with the parameter SIZE (i.e. the size of the hash tables, see Implementation) set to 50, the other with SIZE set to 500. For relatively small systems, i.e. with less than 250,000 atoms, the two implementations 35

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

A) 9

x 104 SIZE: 50 SIZE: 500

CPU time (seconds)

8 7 6 5 4 3 2 1 0 0

2

4

6

8

10 x 105

# of atoms in structure B) 10

x 104 SIZE: 50 SIZE: 500

CPU time (seconds)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 36 of 49

8

6

4

2

0 0

2

4

6

8

# of edges in initial EN

10

12 6 x 10

Figure 10: Running time for Decimate. The running time of the decimation procedure is plotted against the initial number of atoms (A), and the initial number of edges in the corresponding all-atom elastic network (B). The timings are computed on a single Intel Core I7 processor running at 3.1 GHz with 16 GB of RAM. Two settings for the hash tables implemented in Decimate are considered, with the size of the table set to 50 (red curves) or 500 (blue curves). Missing points on the blue curve correspond to cases where the calculations could not be completed because of memory limitations. The dashed lines show the best fit to a quadratic polynomial.

36

ACS Paragon Plus Environment

Page 37 of 49

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

perform similarly. For larger systems, setting the hash table size to a larger value improves the running time, however, at a large memory requirement cost. For systems with more than 450,000 atoms, computations with SIZE set to 500 could not complete due to memory limitation. The computation with SIZE set to 50 always completed. We find large fluctuations for the computing times of Decimate for systems with a similar number of atoms, see Figure 10A; those fluctuations are much smaller when we consider the number of edges in the initial elastic network as a variable, see Figure 10B. Both curves can be fitted well with a quadratic polynomial, indicating that the running time of Decimate is likely to be O(N 2 ), at least for systems of up to one million atoms.

6

Conclusion

Computational methods from all-atom molecular dynamics simulations to coarse-grained normal mode analyses based on simplified elastic networks as introduced by Tirion 19 provide a general framework to study the dynamics of biomolecular systems. 74,75 Despite recent successes in studying vary large systems with up to hundred million atoms, 8,9 those methods are currently limited to studying small to medium size molecular systems due to computational limitations. The hope to circumvent those limitations rests on improved computational resources and architecture and on the design of coarse-graining methods that reduce the size of the systems under study. In this paper, we have argued that coarse-graining must generate a consistent hierarchy of systems of reduced sizes, such that the system at one level contains the dynamics information of its parent level. We have described and implemented a new method, Decimate, that generates such a hierarchy in the context of elastic networks for normal mode analysis. This method is based on the concept of the renormalization group that is widespread in physics. We discuss in details its implementation and parametrization, focusing on issues that arise with very large systems, with millions of atoms. We have illustrated its application to study the dynamics of a large virus capsid and of the

37

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ribosome-mRNA-tRNA translation complex. In both cases, we have shown that Decimate generates highly simplified coarse grained representations that still capture the dynamics of the all-atom systems. Decimate is based on a specialized data structure that allows us to incrementally update a network that is stored in sparse format. For small systems, this approach is as efficient as a direct method that is based on the storage of the full adjacency matrix of the network. For large systems, the latter becomes prohibitively expensive in terms of memory requirement. Decimate still performs well on such large systems, with some limitations on the choice of its parameters, such as the size of the hash tables that store the neighbor lists in the networks. Decimate, however, has an apparent time complexity that is O(N 2 ), where N is the number of edges in the all atom elastic network of the system under study. For large systems, with more than five hundred thousand atoms, this leads to significant computing time in the order of hours. While this is not by itself a deterrent to the application of Decimate - as it only needs to be used once on a system- there is definitely room for improvement. For symmetric systems such as viral capsids, their symmetric organization may provide simplification. 76 More generic systems require aggressive optimizations of the current implementation of Decimate. We reserve those improvements for future studies. We have shown applications of Decimate to generate coarsened representations of the ribosome complex to study structural transitions between two conformations of this complex. We have found that the transitions generated with the decimated networks show strong similarities with the transition generated with an all atom model. This is very encouraging for the development of methods that predict structural transition paths for bio-molecular systems. Indeed, such methods are computationally intensive and often require simplifications of the molecular systems. For medium size systems, simple simplifications based on atoms names (i.e. selecting for example a unique representative per residue) have been found to work well. 69,77 For very large systems, it is unclear how these coarse representations can be further simplified. The renormalization procedure described here provides a framework for

38

ACS Paragon Plus Environment

Page 38 of 49

Page 39 of 49

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

generating hierarchies of such coarse representations that are consistent with each other. Finally, we anticipate that the procedure implemented in Decimate may prove useful for predicting allosteric networks. As the weights of the edges between the vertices are modified during the decimation process, we expect that preferred paths between two sites of the molecules will be revealed. Those paths are expected to map to allosteric paths. There has been considerable interest in the field of structural molecular biology to identify such allosteric pathways, with several new methods being published recently (for a recent review, see Feher et al. 78 ). We will test the hypothesis that Decimate reveals allosteric pathways in future extensions of this work.

7

Acknowledgements

Patrice Koehl acknowledges support from the Ministry of Education of Singapore through Grant Number: MOE2012-T3-1-008. Marc Delarue and Fr´ed´eric Poitevin acknowledge support from a Bioinfo:BipBip grant from the Agence Nationale de la Recherche (France). Fr´ed´eric Poitevin also thanks the National Institute of Health (NIH), Grant Number 1R01-GM115749-01 (PI: Michael Levitt). We thank YinChen Hsieh for carefully editing the manuscript.

References (1) Shaw, D.; Deneroff, M.; Dror, R.; Kuskin, J.; Larson, R.; Salmon, J.; Young, C.; Batson, B.; Bowers, K.; Chao, J.; Eastwood, M.; Gagliardo, J.; Grossman, J.; Ho, C.; Ierardi, D.; Kolossv´ary, I.; Klepeis, J.; Layman, T.; McLeavey, C.; Moraes, M.; Mueller, R.; Priest, E.; Shan, Y.; Spengler, J.; Theobald, M.; Towles, B.; Wang, S. Anton, a Specialpurpose Machine for Molecular Dynamics Simulation. Commun. ACM 2008, 51, 91–97.

39

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(2) Stone, J.; Hardy, D.; Ufimtsev, I.; Schulten, K. GPU-accelerated molecular modeling coming of age. J. Molec. Graph. Modelling 2010, 29, 116–125. (3) Wang, Y.; Harrison, C.; Schulten, K.; McCammon, J. Implementation of accelerated molecular dynamics in NAMD. Comput. Sci. Discov. 2011, 4, 015002. (4) Pierce, L.; Salomon-Ferrer, R.; C.A.F. de Oliveira, J. M.; Walker, R. Routine access to millisecond time scale events with accelerated molecular dynamics. J. Chem. Theori. Comput. 2012, 8, 2997–3002. (5) Sweet, J.; Nowling, R.; Cickovski, T.; Sweet, C.; Pande, V.; Izaguirre, J. Long Timestep Molecular Dynamics on the Graphical Processing Unit. J. Chem. Theori. Comput. 2013, 13, 3267–3281. (6) Shaw, D. E.; Grossman, J. P.; Bank, J. A.; Batson, B.; Butts, J. A.; Chao, J. C.; Deneroff, M. M.; Dror, R. O.; Even, A.; Fenton, C. H.; Forte, A.; Gagliardo, J.; Gill, G.; Greskamp, B.; Ho, C. R.; Ierardi, D. J.; Iserovich, L.; Kuskin, J. S.; Larson, R. H.; Layman, T.; Lee, L. S.; Lerer, A. K.; Li, C.; Killebrew, D.; Mackenzie, K. M.; Mok, S. Y. H.; Moraes, M. A.; Mueller, R.; Nociolo, L. J.; Peticolas, J. L.; Quan, T.; Ramot, D.; Salmon, J. K.; Scarpazza, D. P.; Schafer, U. B.; Siddique, N.; Snyder, C. W.; Spengler, J.; Tang, P. T. P.; Theobald, M.; Toma, H.; Towles, B.; Vitale, B.; Wang, S. C.; Young, C. Anton 2: Raising the Bar for Performance and Programmability in a SpecialPurpose Molecular Dynamics Supercomputer. SC14: International Conference for High Performance Computing, Networking, Storage and Analysis. 2014; pp 41–53. (7) Eastman, P.; Pande, V. OpenMM: A Hardware Independent Framework for Molecular Simulations. Comput. Sci. Eng. 2015, 12, 34–39. (8) Zhao, G.; Perilla, J.; Yufenyuy, E.; Meng, X.; Chen, B.; Ning, J.; Ahn, J.; Gronenborn, A.; Schulten, K.; Aiken, C.; Zhang, P. Mature HIV-1 capsid structure by cryoelectron microscopy and all-atom molecular dynamics. Nature 2013, 497, 643–646. 40

ACS Paragon Plus Environment

Page 40 of 49

Page 41 of 49

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(9) Sener, M.; Strumpfer, J.; Singharoy, A.; Hunter, C.; Schulten, K. Overall energy conversion efficiency of a photosynthetic vesicle. Elife 2016, 5 . (10) Saunders, M.; Voth, G. Coarse-graining Methods for Computational Biology. Annu. Rev. Biophysics 2013, 42, 73–93. (11) Kmiecik, S.; Gront, D.; Kolinski, M.; Wieteska, L.; Dawid, A.; Kolinski, A. Coarsegrained protein models and their applications. Chem. Rev. 2016, 116, 7898–7936. (12) Levitt, M.; Warshel, A. Computer simulation of protein folding. Nature 1976, 253, 694–698. (13) Zhang, Z.; Lu, L.; Noid, W.; Krishna, V.; Pfaendtner, J.; Voth, G. A systematic methodology for defining coarse-grained sites in large biomolecules. Biophys. J. 2008, 95, 5073–5083. (14) Zhang, Z.; Sanbonmatsu, K.; Voth, G. Key intermolecular interactions in the E. coli 70S ribosome revealed by coarse-grained analysis. J. Am. Chem. Soc. 2011, 133, 16828– 16838. (15) Li, M.; Zhang, J.; Xia, F. A new algorithm for construction of coarse-grained sites of large biomolecules. J. Comp. Chem. 2016, 37, 795–804. (16) Riniker, S.; Allison, J.; vanGunsteren, W. On developing coarse-grained models for biomolecular simulation: a review. Phys. Chem. Chem. Phys. 2012, 14, 12423–12430. (17) G¯o, N.; Abe, H. Noninteracting local-structure model of folding and unfolding transition in globular proteins. I. Formulation. Biopolymers 1981, 20, 991–1011. (18) Abe, H.; G¯o, N. Noninteracting local-structure model of folding and unfolding transition in globular proteins. II. Applications to two-dimensional lattice proteins. Biopolymers 1981, 20, 1013–1031.

41

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(19) Tirion, M. Large amplitude elastic motions in proteins from a single parameter, atomic analysis. Phys. Rev. Lett. 1996, 77, 1905–1908. (20) Lopez-Blanco, J.; Chacon, P. New generation of elastic network models. Curr. Opin. Struct. Biol. 2016, 37, 46–53. (21) Dill, K.; Ozkan, S.; Shell, M.; Weikl, T. The protein folding problem. Annu. Rev. Biophys. 2008, 37, 289–316. (22) Sanejouand, Y. Elastic network models: theoretical and empirical foundations. Methods Mol. Biol. 2013, 914, 601–616. (23) Sinitskiy, A.; Voth, G. Coarse-graining of proteins based on elastic network models. Chem. Phys. 2013, 422 . (24) Maragakis, P.; Karplus, M. Large amplitude conformational change in proteins explored with a plastic network model: adenylate kinase. J. Mol. Biol. 2005, 352, 807–822. (25) Zheng, W.; Brooks, B.; Hummer, G. Protein conformational transitions explored by mixed elastic network models. Proteins: Struct. Func. Bioinfo. 2007, 69, 43–57. (26) Chu, J.; Voth, G. Coarse-grained free energy functions for studying protein conformational changes: A double-well network model. Biophys. J. 2007, 93, 3860–3871. (27) Yang, S.; Roux, B. Src kinase conformational activation: Thermodynamics, pathways, and mechanisms. PLoS Comput. Biol. 2008, 4, e1000047. (28) Franklin, J.; Koehl, P.; Doniach, S.; Delarue, M. MinActionPath: maximum likelihood trajectory for large-scale structural transitions in a coarse grained locally harmonic energy landscape. Nucl. Acids. Res. 2007, 35, W477–W482. (29) Das, A.; Gur, M.; Cheng, M. H.; Jo, S.; Bahar, I.; Roux, B. Exploring the conformational transitions of biomolecular systems using a simple two-state anisotropic network model. PLoS Comput. Biol. 2014, 10, e1003521. 42

ACS Paragon Plus Environment

Page 42 of 49

Page 43 of 49

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(30) Zhang, Z. Systematic methods for defining coarse-grained maps in large biomolecules. Adv. Exp. Med. Biol. 2015, 827, 33–48. (31) Na, H.; Jernigan, R.; Song, G. Bridging between NMA and Elastic Network Models: Preserving All-Atom Accuracy in Coarse-Grained Models. PLoS Comput. Biol. 2015, 11, e1004542. (32) Wilson, K. The renormalization group: critical phenomena and the Kondo problem. Rev. Mod. Phys. 1975, 47, 773–840. (33) Aoki, H. Real-space renormalisation-group theory for Anderson localisation : Decimation method for electron systems. J. Phys. C 1980, 13, 3369–3386. (34) Amir, A.; Oreg, Y.; Imry, Y. Localization, anomalous diffusion, and slow relaxations: A random distance matrix approach. Phys. Rev. Lett. 2010, 105, 070601. (35) Monthus, C.; Garel, T. Random elastic networks: a strong disorder renormalization approach. J. Physics A: Math. Theor. 2011, 44, 085001. (36) Hagan, M. Modeling virla capsid assembly. Adv. Chem. Phys. 2014, 155, 1–68. (37) Ode, H.; Nakashima, M.; Kitamura, S.; Sugiura, W.; Sato, H. Molecular dynamics simulation in virus research. Front. Microbiol. 2012, 3, 258. (38) Huber, R.; Marzinek, J.; Holdbrook, D.; Bond, P. Multiscale molecular dynamics simulation approaches to the structure and dynamics of viruses. Prog. Biophys. Molec. Biol. 2016, in press. (39) Perilla, J.; Hadden, J.; Goh, B.; Mayne, C.; Schulten, K. All-Atom Molecular Dynamics of Virus Capsids as Drug Targets. J. Phys. Chem. Lett. 2016, 7, 1836–1844. (40) Arkhipov, A.; Freddolino, P.; Schulten, K. Stability and dynamics of virus capsids described by coarse-grained modeling. Structure 2006, 14, 1767–1777. 43

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(41) Polles, G.; Indelicato, G.; Potestio, R.; Cermelli, P.; Twarock, R.; Micheletti, C. Mechanical and assembly units of viral capsids identified via quasi-rigid domain decomposition. PLoS Comput. Biol. 2013, 9, e1003331. (42) Frank, J.; Gonzalez Jr, R. Structure and dynamics of a processive Brownian motor: the translating ribosome. Annu. Rev. Biochem. 2010, 79, 381–412. (43) Moore, P. How should we think about the ribosome? Annu. Rev. Biophys. 2012, 41, 1–19. (44) Frank, J. Whither Ribosome Structure and Dynamics Research? (A Perspective). J. Mol. Biol. 2016, 428, 3565–3569. (45) Sanbonmatsu, K. Computational studies of molecular machines: the ribosome. Curr. Opin. Struct. Biol. 2012, 22, 168–174. (46) Aqvist, J.; Lind, C.; Sund, J.; Wallin, G. Bridging the gap between ribosome structure and biochemistry by mechanistic computations. Curr. Opin. Struct. Biol. 2012, 22, 815–823. (47) Perilla, J.; Goh, B.; Cassidy, C.; Liu, B.; Bernardi, R.; Rudack, T.; Yu, H.; Wu, Z.; Schulten, K. Molecular dynamics simulations of large macromolecular complexes. Curr. Opin. Struct. Biol. 2015, 31, 64–74. (48) Whitford, P.; Onuchic, J.; Sanbonmatsu, K. Connecting energy landscapes with experimental rates for aminoacyl-trna accommodation in the ribosome. J. Am. Chem. Soc. 2010, 132, 13170–13171. (49) Whitford, P.; Blanchard, S.; Cate, J.; Sanbonmatsu, K. Connecting the Kinetics and Energy Landscape of tRNA Translocation on the Ribosome. PLoS Comput. Biol. 2013, 9, e1003003.

44

ACS Paragon Plus Environment

Page 44 of 49

Page 45 of 49

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(50) Jenner, L.; Demeshkina, N.; Yusupova, G.; Yusupov, M. Structural rearrangements of the ribosome at the tRNA proofreading step. Nature Struct. Mol. Biol. 2010, 17, 1072–1078. (51) Xei, F.; Tong, D.; Lifeng, W.; Dayong, H.; Steven, C.; Koehl, P.; Lu, L. Identifying essential pairwise interactions in elastic network model using the alpha shape theory. J. Comp. Chem. 2014, 35, 1111–1121. (52) Ming, D.; Wall, M. Allostery in a Coarse-Grained Model of Protein Dynamics. Phys. Rev. Lett. 2005, 95, 198103. (53) Kondrashov, D.; Cui, Q.; Jr., G. P. Optimization and evaluation of a coarse-grained model of protein motion using X-ray crystal data. Biophys. J. 2006, 91, 2760–2767. (54) Xia, F.; Tong, D.; Lu, L. Robust Heterogeneous Anisotropic Elastic Network Model Precisely Reproduces the Experimental B-factors of Biomolecules. J. Chem. Theory Comput. 2013, 13, 3704–3714. (55) Xia, F.; Tong, D.; Yang, L.; Wang, D.; Doi, S.; Koehl, P.; Lu, L. Identifying essential pairwise interactions in elastic network model using the alpha shape theory. J. Comp. Chem. 2014, 35, 1111–1121. (56) Fuglebakk, E.; Reuter, N.; Hinsen, K. Evaluation of Protein Elastic Network Models Based on an Analysis of Collective Motions. J. Chem. Theory Comput. 2013, 9, 5618– 28. (57) Ichiye, T.; Karplus, M. Collective motions in proteins: a covariance analysis of atomic fluctuations in molecular dynamics and normal mode simulations. Proteins: Struct. Func. Genet. 1991, 11, 205–217. (58) Wang, Y.; Rader, A.; Bahar, I.; Jernigan, R. Global ribosome motions revealed with elastic network model. J. Struct. Biol. 147, 302–314. 45

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(59) Berman, H.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T.; Weissig, H.; Shindyalov, I.; Bourne, P. The Protein Data Bank. Nucl. Acids. Res. 2000, 28, 235–242. (60) Vojtechovsky, J.; Chu, K.; Berendzen, J.; Sweet, R.; Schlichting, I. Crystal structures of myoglobin-ligand complexes at near-atomic resolution. Biophys. J. 1999, 77, 2153– 2174. (61) Melnikov, S.; Soll, D.; Steitz, T.; Polikanov, Y. Insights into RNA binding by the anticancer drug cisplatin from the crystal structure of cisplatin-modified ribosome. Nucl. Acids. Res. 2016, 44, 4978–4987. (62) Trabuco, L.; Schreiner, E.; Eargle, J.; Cornish, P.; Ha, T.; Luthey-Schulten, Z.; Schulten, K. The role of L1 stalk-tRNA interaction in the ribosome elongation cycle. J. Mol. Biol. 2010, 402, 741–760. (63) Zimmermann, M.; Jia, K.; Jernigan, R. Ribosome mechanics informs about mechanism. J. Mol. Biol. 428, 802–810. (64) E, W.; Vanden-Eijnden, E. Transition-path theory and path-finding algorithms for the study of rare events. Annu. Rev. Phys. Chem. 2010, 61, 391–420. (65) Vanden-Eijnden, E. Transition path theory. Adv. Exp. Med. Biol. 2014, 797, 91–100. (66) Eastman, P.; Gronbech-Jensen, N.; Doniach, S. Simulation of protein folding by reaction path annealing. J. Chem. Phys. 2001, 114, 3823. (67) Ghosh, A.; R, R. E.; Scheraga, H. An atomically detailed study of the folding pathways of protein A with the stochastic difference equation. Proc. Natl. Acad. Sci. (USA) 2002, 99, 10394–10398. (68) Faccioli, P.; Sega, M.; Pederiva, F.; Orland, H. Dominant pathways in protein folding. Phys. Rev. Lett. 2006, 97, 108101.

46

ACS Paragon Plus Environment

Page 46 of 49

Page 47 of 49

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

(69) Koehl, P. Minimum action transition paths connecting minima on an energy surface. J. Chem. Phys. 2016, 145, 184111. (70) Laowanapiban, P.; Kapustina, M.; Vonrhein, C.; Delarue, M.; Koehl, P.; Carter, C. W. Independent saturation of three TrpRS subsites generates a partially assembled state similar to those observed in molecular simulations. Proc. Natl. Acad. Sci. (USA) 2009, 106, 1790–1795. (71) Weinreb, V.; Li, L.; Chandrasekaran, S. N.; Koehl, P.; Delarue, M.; Carter, C. W. Enhanced amino acid selection in fully evolved tryptophanyl-tRNA synthetase, relative to its urzyme, requires domain motion sensed by the D1 switch, a remote dynamic packing motif. J. Biol. Chem. 2014, 289, 4367–4376. (72) Chandrasekaran, S.; Dhas, J.; Dokholyan, N.; Carter Jr, C. A modified PATH algorithm rapidly generates transition states comparable to those found by other well established algorithms. Struct. Dyn. 2016, 3, 012101. (73) Carrillo-Tripp, M.; Shephered, C.; Borelli, I.; Venkataram, S.; Lander, G.; Natarajan, P.; Johnson, J.; III, C. B.; Reddy, V. VIPERdb2: an enhanced and web API enabled relational database for structural virology. Nucl. Acids. Res. 2009, 37, D436– D442. (74) Ovchinnikov, V.; Karplus, M. Investigations of helix-sheet transition pathways in a miniprotein using the finite-temperature string method. J. Chem. Phys. 2014, 140, 175103. (75) Levitt, M. Birth and future of multiscale modeling for macromolecular systems (Nobel Lecture). Angew. Chem. Int. Ed. Engl. 2014, 53, 10006–10018. (76) Karalus, S.; Krug, J. Symmetry-based coarse-graining of evolved dynamical networks. Eur. Phys. Lett. 2015, 111, 38003.

47

ACS Paragon Plus Environment

Journal of Chemical Theory and Computation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(77) Weiss, D. R.; Levitt, M. Can morphing methods predict intermediate structures? J. Mol. Biol. 2009, 385, 665–674. (78) Feher, V.; Durrant, J.; Van Wart, A.; Amaro, R. Computational approaches to mapping allosteric pathways. Curr. Opin. Struct. Biol. 2014, 25, 98–103.

48

ACS Paragon Plus Environment

Page 48 of 49

Page 49 of 49

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Theory and Computation

Decimate

146397 atoms

2928 atoms

Figure 11: TOC graphic.

49

ACS Paragon Plus Environment