An Efficient Low Storage and Memory Treatment of Gridded

However, such grids may easily become large, and the storage and memory ... suited for molecular docking simulations,(2-11) free-energy calculations,(...
0 downloads 0 Views 6MB Size
Article pubs.acs.org/JCTC

An Efficient Low Storage and Memory Treatment of Gridded Interaction Fields for Simulations of Macromolecular Association Musa Ozboyaci,†,‡ Michael Martinez,† and Rebecca C. Wade*,†,¶,§ †

Heidelberg Institute for Theoretical Studies (HITS), Schloss-Wolfsbrunnenweg 35, 69118 Heidelberg, Germany Heidelberg Graduate School of Mathematical and Computational Methods for the Sciences, Heidelberg University, INF 205, 69120 Heidelberg, Germany ¶ Zentrum für Molekulare Biologie der Universität Heidelberg, DKFZ-ZMBH Allianz, INF 282, 69120 Heidelberg, Germany § Interdisciplinary Center for Scientific Computing, Heidelberg University, INF 205, 69120 Heidelberg, Germany ‡

S Supporting Information *

ABSTRACT: Computer simulations of molecular systems often make use of regular rectangular grids with equidistant spacing to store information on their molecular interaction fields, e.g., electrostatic potential. These grids provide an easy way to store the data as they do not require any particular specification of the structure of the data. However, such grids may easily become large, and the storage and memory demands may become so high that calculations become infeasible. To overcome this problem, we show here how the data structure DT-Grid can be adapted and applied to efficiently represent macromolecular interaction grids by exploiting the nonuniformity of information on the grid; at the same time, this data structure ensures fast random data access. We demonstrate use of the DT-Grid data structure for potential of mean force and Brownian dynamics simulations of protein-surface binding and macromolecular association with the SDA software. We further demonstrate that the DT-Grid structure enables systems of large size, such as a viral capsid, and high resolution grids to be handled that are beyond current computational feasibility.

1. INTRODUCTION Atomically detailed rigid-body models of macromolecules, such as proteins, and nanoparticles have long been employed in molecular association studies.1 Owing to the greatly reduced number of degrees of freedom in comparison with fully flexible modeling, rigid-body modeling makes simulations of longer time- and length-scales feasible. A number of simulation algorithms and protocols for rigid-body simulation of molecules have been developed. Examples include techniques suited for molecular docking simulations,2−11 free-energy calculations,12 as well as Brownian13,14 and Langevin15 dynamics simulations. A major advantage of rigid-body modeling is that the interaction field of a solute experienced by the other solutes in a system can be computed once and saved on a threedimensional grid centered on that solute before running a simulation. This procedure speeds up calculations that would otherwise be demanding for flexible models, which rely on summing up particle−particle interactions rather than computing a single particle−grid interaction for determining the forces and energies for each particle.16 In addition to molecular interaction fields (MIFs) describing interactions such as the long-range electrostatic or short-range van der Waals (VDW) interactions, structural information (e.g., excluded volume, predicted binding cavities on a protein surface) about solutes can be precomputed and stored on grids. Differences in the underlying interaction functionals (e.g., the Poisson−Boltz© XXXX American Chemical Society

mann equation for solvated electrostatics or Lennard-Jones (LJ) term for VDW interactions) lead to different requirements for the grids. For instance, the grid of a long-range electrostatic potential of a macromolecule requires a large volume around the molecule, which is typically much larger than the volume occupied by the molecule itself. On the other hand, grids for short-range interactions, such as for the LJ interaction, require higher resolution, especially close to the molecular surface. Owing to their simplicity and ease of use, regular (Cartesian) grids are commonly preferred in grid-based molecular simulations. Regular grids can be likened to N-dimensional matrices in that each entry (vertex) of a grid is addressed by its index (i1, i2, ..., iN) with coordinates ((i1 − 1)*d1, (i2 − 1)*d2, ..., (iN − 1)*dN), where dN stands for the grid spacing in the Nth dimension. A three-dimensional regular interaction grid therefore has the shape of a rectangular prism. Macromolecules and their short-range interaction fields, on the other hand, usually have irregular shapes. Storing these interaction fields on regular grids therefore leads to sparse rectangular grids, particularly for the LJ MIF. Similarly, storing excluded volume data on a large grid results in two uniform regions on the grid with only the points at the boundary of the molecular surface providing important information. Sparse regular grids waste Received: April 7, 2016

A

DOI: 10.1021/acs.jctc.6b00350 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Figure 1. Two-dimensional regular grid and the structure of its corresponding DT-Grid. A DT-Grid of an N-dimensional regular grid consists of N recursive DT-Grid structures. (a) The 2D grid in the (x, y) plane. The tubular region is shown in dark and light green colors with each square representing a grid point. The dark-colored squares represent the minimum and maximum coordinates of each of the connected components in the y direction, whereas the light-colored squares represent the coordinates inside the connected components. Some arbitrary values (φ) are stored on these grid points, and they are shown inside the squares. For simplicity, the φ are indexed in a lexicographic order from 1 to 20. (b) Structure of the corresponding 2D DT-Grid. The 2D DT-Grid consists of acc, value, and icoord arrays and the proj1D DT-Grid. proj1D holds pairs of position indices of points inside the value and icoord arrays that store the first value and the minimum coordinate (on the y axis) for every p-column. The acc array additionally stores pointers in the value array for quick access to each connected component. (c) 1D grid on the x axis. This grid corresponds to the 1st dimension of the DT-Grid and is the same as the proj1D. (d) Structure of the 1D DT-Grid. The grid consists of acc, value, and icoord arrays. The icoord array of the 1D DT-Grid stores the minimum and maximum coordinates on the x axis.

memory and storage resources. This limits the size of systems that can be simulated. Furthermore, when sparse grids are used, the number of interactions modeled and/or the resolution at which these interactions is modeled is limited by the computer memory on the available hardware.17 In numerical analysis, matrices are typically stored in regular arrays. To reduce the memory requirements for sparse matrices, which unlike dense matrices mostly contain zero entries, various other formats, e.g., compressed sparse row (CSR),18 have been employed to store their entries. Although these formats allow a substantial reduction in the storage requirements, they are designed for efficient matrix operations and are often not efficient for accessing matrix elements randomly.19 On the other hand, fast random access operations are vital for grid-based molecular simulations. In a step of a grid-based simulation, of all the values stored on an interaction grid of a molecule, only those lying on the points that directly surround the interacting particles (e.g., atoms) of another molecule within a defined distance cutoff are required to compute the interaction energy/force. The distance cutoff depends on the interaction type, being short for short-range LJ and long for long-range electrostatics. The number of grid points accessed in a simulation is, therefore, at most comparable to the number of interacting atoms/particles of the other molecules, which is typically on the order of thousands for proteins. This number often constitutes a very small portion of the total number of grid points, which is already on the order of millions for a regular cubic grid of over 1003 points. As a result, traversing all of the grid points in each time step of a simulation is computationally inefficient. The need for a fast random access operation therefore makes direct application of standard sparse matrix storage formats to molecular interaction grids inappropriate. An alternative to using regular grids in molecular simulations is to use adaptive Cartesian grids. In an adaptive grid, the

spacing varies in different regions of the problem space and is determined locally based on the resolution requirements. Adaptive grids are commonly employed for solutions of partial differential equations (PDEs) and applied to various different problems such as the electrostatic potentials of large biological molecules20 and the approximation of polar solvation energies21 as well as astrophysical fluid dynamics.22 Adaptive grids are often based on hierarchical octree-based data structures. These structures, however, have either large memory footprints or complex structures that lead to slow access times.23 Moreover, they may require complete traversal of grid data structures to perform a single operation.24 Furthermore, using adaptive grids instead of regular grids in finite difference approximations introduces another level of complexity: hanging nodes. Hanging nodes appear when one of the neighbor grid cells that share a face is divided. The new nodes on this face from the divided cell are called hanging nodes. Interpolating to a point from the values stored at grid entries surrounding that point necessitates modification of the interpolation stencil to increase the local accuracy.25 In summary, for a data structure for use in grid-based molecular simulations, the following requirements should be met: (i) fast random access, (ii) low memory footprint, and (iii) ease of adaptability to grid-based molecular simulation software. We found that none of the data structures mentioned above were able to satisfy these requirements efficiently for data sets with varying sparsity measures from very sparse (e.g., excluded volume data) to semisparse (short-range LJ interactions). However, a level set data structure originally developed by Nielsen and Museth23 for shape simulations matches our requirements optimally. The data structure, named dynamic tubular grid (DT-Grid), allows uniform sampling in a tubular region around a dynamic shape interface. The simplicity of the concept of DT-Grid makes it possible to define a closed region with an arbitrary shape using Cartesian coordinates within B

DOI: 10.1021/acs.jctc.6b00350 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

An N-dimensional DT-Grid consists of N-recursive DT-Grid structures. Each of these DT-Grid structures consists of 3 distinct arrays that store the values to compress a regular grid and to access its data. Further, each of these structures have the DT-Grid structure of one dimension lower nested inside, except for the 1D DT-Grid structure. The structure of a simple DT-Grid is given as

which data can be accessed. It was also shown that DT-Grid outperforms many different data structures in terms of both random access operation times and memory and storage requirements.23 Owing to its advantages, DT-Grid has been employed in level set simulations26,27 as well as efficient volumetric rendering of sparse data.28−30 In the present study, the DT-Grid data structure was applied to store LJ MIFs and excluded volume data for grid-based, rigid-body simulations of macromolecules. The results show that using DT-Grid reduces the memory requirements for MIF grids, as well as structural information, and that it is applicable to the simulation of the interactions of large molecular systems.

2. THEORETICAL METHODS 2.1. Dynamic Tubular Grid (DT-Grid). Before describing the structure of a DT-Grid, let us first look at the structure of a regular grid. A regular grid is usually stored in an array (with a defined data type, e.g., integer, Boolean, floating point number) with its constituent elements stored serially in computer memory. This way, the need for a complex mapping from a set of given indices to the values stored on the requested grid coordinates is unnecessary. An example of a two-dimensional 8 × 8 uniform regular grid is shown in Figure 1a. This grid demonstrates the case in which only the grid points that represent a certain geometric shape are of interest, and therefore, only the values stored in those points are required for later use. Note that, for simplicity, the values stored are represented by φi and can take any data types, e.g., integer or real. The DT-Grid keeps the dimension of the original grid and its spacing value, and it allows its elements to be accessed with the same indices as used for accessing the elements in the regular grid. For the DT-Grid to have the basic attributes that regular grids have, the grid structure should be efficiently compressed and stored and should be decompressed quickly when access to a data point is requested. In the following subsections, the details of the underlying data structure for the index compression and the basic data access algorithms required for efficient use of DT-Grid for macromolecular association simulations are described. A more detailed description of DT-Grid data structure and other algorithms that are not described here can be found in the original paper by Nielsen and Museth.23 2.1.1. Data Structure Components. An N-dimensional DTGrid consists of a number of projection columns (p-columns). A p-column XN−1 (x1, x2, ..., xN−1) in the Nth dimension is the set of grid points that projects to (XN−1, 1) (note that the indexing starts from 1) by an orthogonal projection onto the previous N − 1 dimensions. A p-column is always 1dimensional and consists of one or more connected components. A connected component is a set of grid points that are positioned consecutively within a p-column. To explain the structure and constituents of DT-Grid, let us consider the 2-dimensional grid in Figure 1a as an example. For the sake of simplicity, some arbitrary φ values with indices from 1 to 20 are stored in this grid in lexicographic order of the coordinates in (x, y). In this 2D grid, there are a total of 6 pcolumns, numbered from 1 to 6, and 8 connected components. The p-column number 4, for example, consists of two connected components, (5, 2:3) and (5, 6:7), each of which include grid points indexed (5, 2) and (5, 3) and those indexed (5, 6) and (5, 7), respectively. Each of these grid points stores a corresponding value from φ11 to φ14.

The constituting arrays, value, icoord, and acc as well as the constituent proj(N − 1)D are explained in detail below. value. The value array in the Nth level structure stores the data located at the coordinate points of a desired stencil or shape on a regular grid. The data type of the array depends on the values it stores and, therefore, can be character, integer, or real. For instance, for a molecular potential or energy grid, the data type of the value array is usually a floating point number. However, the data type can also be a user defined class, e.g., to store the interaction potentials for different atom types or length scales. The elements of the value array are stored in lexicographic storage order, and therefore, the values on connected components are located consecutively in computer memory. An example value array of a 2D DT-Grid is shown in Figure 1b with some arbitrary values stored in it. The value arrays in the (1, ..., N − 1)th level structures of the Ndimensional DT-Grid, on the other hand, store pairs of indices (IndexPair type) into icoord and value arrays of the (2, ..., N)th level structures and constitute the main components of the proj(N − 1)D arrays of the (2,..., N)th level structures (see below for the definition of proj(N − 1)D). icoord. The icoord array stores the indices of the minimum and maximum coordinates of the connected components in pcolumns in the Nth dimension. This array is used to look up whether a given coordinate lies inside a connected component upon an array indexing request. For a two-dimensional grid, two separate icoord arrays (see Figure 1b,d) exist (each of which is a member of one of the two levels of the two-dimensional DT-Grid), and they store coordinates for the y and x axes, respectively. acc. The acc array (see Figure 1b,d) stores a set of pointers to the corresponding entry in the value array for the first element of each of the connected components in the tubular grid. The acc array is used for fast random access operations. proj(N − 1)D. The proj(N − 1)D is defined recursively as a DT-Grid in one dimension lower and holds a set of pairs of indices into the value and icoord arrays for the first grid point in each of the p-columns. The proj(N − 1)D serves as a connector between the dimensions of an N-dimensional DT-Grid. Because of its connector role, no proj0D array is required for a complete DT-Grid structure. A proj1D is shown in Figure 1b that is defined as a one-dimensional DT-Grid (Figure 1c). Finally, the storage and memory requirements of DT-Grid are determined by the number of tubular grid points. If the number of grid points in an N-dimensional tubular grid of a DT-Grid structure is MN, then the memory and the storage C

DOI: 10.1021/acs.jctc.6b00350 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

DT-Grid, for instance, a set of vertex coordinates (x, y, and z) are input. The algorithm then searches for the values stored in 8 coordinates: (x, y, z), (x, y, z + 1), (x, y + 1, z), (x, y + 1, z + 1), (x + 1, y, z), (x + 1, y + 1, z), (x + 1, y, z + 1), and (x + 1, y + 1, z + 1) and returns their values in an array. The time complexity varies significantly for different algorithms of DT-Grid. Sequential access to grid points and to stencil grid points requires constant time operations, whereas random access to grid points and cells and neighbor access algorithms require logarithmic times. The logarithmic time complexity arises from the binary search operation performed within a p-column that consists of connected components. Noteworthy is that a constant time random access operation is not possible in theory due to the binary search operation. On the other hand, the logarithmic behavior will not be significant in practice when there are only a small number of connected components within a p-column. 2.2. General Workflow. We implemented the DT-Grid data structure and its algorithms, as well as the necessary tools for file conversion and generating topographies of the excluded volume of macromolecules and their interaction fields, and we modified the Simulation of Diffusional Association (SDA) software (version 7)31 to enable input and processing of DTGrid files. The code was written in the C++ language using the template feature of C++ for portability of the code for different (generic) data types. The general workflow of a simulation using DT-Grid is as follows: (i) a specified MIF is computed and written to a grid file in University of Houston Brownian Dynamics32,33 (UHBD) format, (ii) the UHBD format file is converted into DT-Grid format and written to a formatted file in disk storage, and (iii) the DT-Grid file in disk storage is input to SDA and the DTGrid access functions are called from SDA. Although the MIF grids should first be computed and then converted to DT-Grid format, excluded volume information can be computed using DT-Grid tools and written directly as a DT-Grid file to hard disk. SDA and several other computational software suites, e.g., APBS,34 read input and/or write output data grid files in UHBD format. Furthermore, molecular visualization programs such as VMD,35 PyMOL (PyMOL Molecular Graphics System; Schrödinger, LLC), and Chimera36 can read grid files in UHBD format. Therefore, two separate programs were implemented to convert a regular grid stored in a UHBD format file into a DTGrid stored in a file format similar to that of UHBD and vice versa (UHBD2dtgrid and dtgrid2UHBD, see Figure 2). The DT-Grid file format shares the same header structure as the UHBD file format except that one of the redundant flag values is reserved for the identity of the DT-Grid file. This way, the file type is recognized and distinguished from the UHBD format. Conversion from UHBD to DT-Grid file format requires not only encoding the data structure but also that the topography of the molecule be generated first. In the following section, the details of building a molecular excluded volume and a molecular “skin” region are explained. Here, the molecular skin is defined as the region around the excluded volume extending out to a defined distance in which the MIF of interest is effective, i.e., nonzero. This is the region of interest in which the grid points used in the simulations lie. 2.2.1. Building of the Excluded Volume and the Molecular Skin. The basic workflow for building the topography of an excluded volume or a molecular skin, and storing the generated grid in a DT-Grid file, is shown in Figure 3. An excluded

requirements of that grid will be on the order of O(MN) by induction (from O(MN−1 + MN)). 2.1.2. Algorithms. The original implementation of DT-Grid by Nielsen and Museth23 includes several algorithms necessary for efficient level set simulations such as sequential access to grid points, rebuilding and dilating the tubular grid, and random access and sequential access to stencil grid points. Because of its significance in this study, the random access algorithm for grid points will be discussed in detail along with a random access algorithm for grid cells that was developed in this study for our purposes based on the neighbor and random access algorithms described in the original work by Nielsen and Museth.23 Random Access to Grid Points. A fast random access method is vital for calculations that require frequent efficient random access operations. DT-Grid, compared to other index compression schemes, in particular to tree-based methods, provides a very efficient and fast random access algorithm. In this method, the index for a value requested is searched recursively in all dimensions of the DT-Grid. If any of the c1, ..., cN coordinates of an index requested happens to be outside the tubular grid, a predefined γ value is returned. If, on the other hand, the coordinates are all found to be inside the tubular grid, the value at the corresponding index is returned. A pseudocode of the algorithm is given in Algorithm S1 with comments shown in gray. The algorithm first checks whether the c(N − 1)th coordinate lies inside any connected components within a p-column. If the coordinate is not found, then a not in the list (NIL) signal is returned along with γ. If the coordinate is found inside the tubular region, an index value into the (N − 1)th level value array is returned. Noteworthy is that, at this index, the (N − 1)th level value array stores pairs of indices in the Nth level icoord and value arrays for the first grid point in each p-column. The minimum and the maximum coordinates of grid points within the p-column are then checked using a binary search. If cN is found inside a connected component, its corresponding value stored inside the Nth level value array is returned along with its index. The value at the index is accessed using the acc array that holds the pointers for the first grid point in all connected components. If cN is not found inside a connected component, a NIL signal is returned. Random Access to the Grid Cells. As crucial as a fast random access algorithm to the grid points is fast random access to the grid cells. In many rigid body molecular simulations, access to the vertices around an intermediate coordinate point is required for a value at that coordinate to be interpolated. Because the DT-Grid has a uniform spacing between its vertices, the grid cells represent cuboid volume spaces, and each cell has 8 vertices in which data are stored. Instead of making 8 random access operations to obtain the data stored in these vertices, a single operation can be made. We present a new algorithm for faster random access to grid cells based on the neighbor access algorithm described in the original DT-Grid paper by Nielsen and Museth.23 The operation for random access to a grid cell is based on a series of recursive operations on all dimensions of the DT-Grid. Algorithm S2 shows the pseudocode of our implementation of random access for grid cells. The algorithm is similar to that shown in Algorithm S1 except that, in every dimension of the DT-Grid structure, indices are returned not only for the coordinate cN but also for its neighbor cN + 1. The size of the returned array is multiplied by two in every dimension, summing up to 2N for an N dimensional DT-Grid. For a 3D D

DOI: 10.1021/acs.jctc.6b00350 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

geometric center of the molecule with a grid spacing chosen by the user. In the case of a molecular skin grid, however, the value of the grid spacing of the input grid is used, and therefore, the manual input value is ignored. The grid spacing defines the resolution of the excluded volume grid; smaller values give a better definition of the molecular surface. The excluded volume in this algorithm is defined by the region enclosed by the surface around the molecule accessible to the center of a spherical probe of radius σ. The grid points that lie inside this region are assigned to the excluded volume. Finally, the grid points in cavities that are buried completely inside the protein are identified and included in the excluded volume. This is done initially by identifying the exterior points on the grid boundaries and those connected to them through at least one neighbor that is an exterior point. By subtracting the exterior region that extends from the boundary of the grid to the surface of the molecule from the entire grid, the excluded volume that includes the buried cavities is obtained. This step helps to reduce the complexity of the topography of the excluded volume. In the case of an excluded volume grid, a DT-Grid structure is then generated that stores only the topography of the excluded volume. The structure of a DT-Grid that stores excluded surface information is shown in Figure 4. This structure is particularly well-suited for grids that store data from only a small set of values, and therefore, it reduces the memory and storage requirements significantly compared to a standard rectangular grid. See below for the evaluation of this structure for the excluded volume of a spherical shell. Converting a regular grid that stores a MIF to DT-Grid requires computation of the molecular skin region in addition to the computation of the excluded volume. The molecular skin defines the topography of the MIF. The data values inside the molecular skin are therefore stored in the final DT-Grid with the remaining data in the original regular grid omitted. For the skin around a molecule to be calculated, the same approach as employed in PIPSA37,38 is used. In this approach, the skin is defined as the volumetric region in a shell of thickness δ surrounding the VDW surface of the molecule enclosed by two surfaces that are accessible to the centers of spherical probes of radii (σ) and (σ + δ), respectively (see Figure 5). Values of σ and δ are input manually by the user. The δ value should be chosen to match the effective distance range of the interaction field; σ, on the other hand, should have the same value as used for the excluded volume of the molecule.

Figure 2. Components of the DT-Grid implementation for molecular simulations, which consists of three major parts: DT-Grid data structures, tools to generate the macromolecular volume and surface skin, and file converters.

3. SIMULATION METHODS SDA (version 7) was compiled using GNU 4.9.2 compilers. All calculations were performed on a compute-cluster node containing two Intel Xeon E5-2650 8-core 2.60 GHz processors except for the calculations for the evaluation of the data structure for a spherical test system, which were performed on a node containing four AMD Opteron 6174 12-core 2.4 GHz processors. 3.1. Force Calculations for Protein−Surface Systems. For the DT-Grid methodology using the SDA software to be evaluated,13,14,31 the forces between a number of proteins and a gold surface were calculated. A model of an uncharged Au(111) surface, as described by Kokh et al.40 and Iori et al.,41 was used in the calculations. The calculations were performed for four different proteins, namely, barstar, β-lactamase protein 1 (TEM1), paraoxonase 1 (PON1), and cytolytic toxin 2Ba (Cyt2Ba). The crystal structures of the proteins listed (PDB IDs: 1BRS,39 1JTG,42 1V04,43 and 2RCI44), which have

Figure 3. Workflow of the program to compute the molecular skin.

volume topography is built initially and then used to construct an excluded volume or a molecular skin DT-Grid structure. The topography of the excluded volume is initially saved on a threedimensional regular grid (of type Boolean) centered on the E

DOI: 10.1021/acs.jctc.6b00350 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Figure 4. Two-dimensional regular grid that stores a molecular excluded surface and the structure of its corresponding DT-Grid. The regular grid shown is the same as that in Figure 1 except for the data types and values stored in them. (a) 2D grid in the (x, y) plane. The tubular region is shown in dark and light green colors. The dark-colored squares represent the minimum and maximum coordinates of each of the connected components in the y direction, whereas the light-colored ones represent the coordinates inside the connected components. True values are stored in these grid points, whereas False values are stored in the rest. (b) Structure of the corresponding 2D DT-Grid. The 2D DT-Grid consists of acc, value, and icoord arrays and proj1D DT-Grid. Unlike the second dimension value array shown in Figure 1, the corresponding value array used for excluded volume has a fixed length (the number of different types of information that need to be stored) regardless of the size of the original grid. (c) 1D grid on the x axis. (d) Structure of the 1D DT-Grid.

protein studied. The grid spacing was 0.1 Å for all proteins. The dimensions of these grids are listed in Table 1. Two sets of calculations were conducted for each of the protein−surface systems: one with regular Cartesian grids and a second with DT-Grids. A probe of radius σ 1.4 Å was used to compute the accessible molecular surface, and a molecular skin thickness δ of 10 Å was used. The molecular skins were then used to construct the DT-Grids for the LJ term. Finally, force calculations were performed for 105 randomly chosen orientations of each protein with the height of the protein center above the surface being 19.4, 27.7, 34.4, and 24.7 Å for barstar, TEM1, PON1, and Cyt2Ba, respectively. The height of each protein center was chosen according to the maximum distance of an atomic coordinate from the center of geometry of the structure of the protein. 3.2. Potential of Mean Force (PMF) Calculations. A second set of tests of the DT-Grid methodology were performed by computing the PMF for a tripeptide of three histidine residues (3H) binding to the same Au(111) surface. The starting structure of the 3H peptide was built using the MODELLER program.46 To this end, 1000 different sets of coordinates were generated and then minimized for a maximum of 1000 steps using the conjugate-gradient method implemented in MODELLER. Of these, the structure with the smallest angles between each of the three imidazole planes was chosen for further calculations. Protons were assigned to the histidine residues at pH 7.0 (0e charge) using the PDB2PQR software (version 1.9)47,48 with the Amber99 force field.49 The N- and C-termini of the peptide were capped with acetyl and N-methyl amide groups, respectively. The method40 implemented in the SDA software package31 was used for the potential of mean force (PMF) calculations. The structures were treated as rigid bodies. The interaction terms of each structure were computed and stored on grids prior to the simulations. The electrostatic potential of the peptide was calculated using the APBS software34 and stored on a grid of size 1503 with a regular spacing of 1.0 Å. The ionic

Figure 5. Illustration of a molecular skin. A cross-section through the structure of the protein barstar (PDB ID: 1BRS39) and its interaction field. The atomic structure of the protein is shown in stick representation with the carbon, nitrogen, and oxygen atoms shown in cyan, blue, and red, respectively. The dots around the molecule represent its VDW surface. Shown in white is the probe-based molecular accessible surface of the protein. The molecular skin is shown in pink. Values of the molecular interaction field in the skin region are stored in the DT-Grid structure.

resolutions of 2.0, 1.73, 2.2, and 1.8 Å, respectively, were used. The protonation states of the residues were computed at pH 7 using the H++45 web server. For the protein−metal interaction forces to be computed for evaluation purposes, only the LJ term with the parameters described in the ProMetCS model40 was used. The LJ term for the force calculations was computed for two different surface atom types and stored in two corresponding grids for each F

DOI: 10.1021/acs.jctc.6b00350 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Table 1. Assessment of the DT-Grid Structure for LJ Grids Used in Force Calculations for Different Protein−Gold Surface Systems disk storage (MB)

a

protein

grid size

regular

Barstar Cyt2Ba TEM1 PON1

550,520,570 680,570,650 670,750,660 750,900,830

622 962 1331 2150

memory (MB)

DT-Grida 276 450 539 711

regular

(44%) (47%) (40%) (33%)

1243 1922 2530 4274

execution time (s)

DT-Grida 546 890 1066 1406

(44%) (46%) (42%) (33%)

regular 34.2 41.4 46.6 57.7

DT-Grida 47.6 71.6 85.5 112.5

(39%) (73%) (83%) (95%)

The compression efficiency of DT-Grid for disk storage and memory, as well as the increase in execution times, are shown in parentheses.

ionizable side-chains were assigned at a pH of 7.0, and hydrogen atoms were added to the structures using PDB2PQR 1.8 program47,48 with Amber99 force-field49 parameters. The total charges of the nucleosome and gH5 structures were −237e and +11e, respectively. For the electrostatic potential calculations, the nonlinear Poisson−Boltzmann equation was solved on a Cartesian grid of size 257 × 257 × 257 centered on each of the structures with a spacing of 1.0 Å. The temperature and the ionic strength were assigned values of 298.15 K and 100 mM, respectively. The dielectric constants of the solute and solvent were assigned as 2.00 and 78.54, respectively, and the VDW surface of the solutes was used to define the dielectric boundary. After computing the electrostatic potentials, the effective charges on the nucleosome and on the gH5 structure were computed using the ECM method.54 The Brownian dynamics (BD) simulations were carried out using the SDA31 software. The solute molecules were treated as rigid bodies. The coordinates of the nucleosome were fixed at the origin with the gH5 structure allowed to move around the nucleosome. The starting position of the mobile gH5 was chosen randomly at a distance of 280 Å from the center of the nucleosome. A variable time-step was used to accelerate the sampling. The time-step was 20 ps when the distance between the centers of the solutes was larger than 260 Å, and it was decreased linearly to 0.5 ps as this distance approached 160 Å. The trajectory was stopped if the center-to-center distance exceeded 500 Å. Translational and rotational diffusion coefficients of the linker histone molecule were set to 0.0185 Å2/ps and 5.04 × 10−5 rad2/ ps, respectively. Three sets of BD simulations, each consisting of 50000 trajectories, were performed for the association of the nucleosome−gH5 complex. In each of these sets of simulations, a different excluded volume grid of resolution 0.1, 0.25, or 0.5 Å, stored in DT-Grid format, was used. The encounter complexes, which had a center-to-center distance of less than 73 Å and a nucleosome dyad-gH5 separation distance of less than 40 Å, were recorded during the simulations. When a new encounter complex that satisfied these geometric criteria was found to have an RMSD of less than 1.0 Å from one of the previously saved encounter complexes, their energy values were compared. If the interaction energy of the new complex was more favorable, it became the new representative of that cluster, otherwise, the old representative was kept. The cluster size was incremented by 1 in both cases. Among the encounter complexes obtained from the 50000 trajectories, the 5000 complexes with the lowest energy values were recorded during the simulations. These 5000 complexes were then clustered into 6 separate groups using the averagelinkage method described by Motiejunas et al.55 3.4. Binding Energy Calculations. To evaluate the DTGrid data structure for a very large molecular system, we computed the binding energy between a viral capsid and a gold

strength was set to 10 mM to match the experimental conditions used in the study by Cohavi et al.50 The temperature was set to 300 K. The dielectric constants of the solute and the solvent were assigned values of 4.00 and 78.00, respectively. Desolvation terms (nonpolar and electrostatic) and LJ interaction energies were calculated as described for the ProMetCS force field.40 Because there are two different types of gold atoms in the gold model used, the LJ energy terms were computed for each of the atom types and then stored in two separate grids, each of which had dimensions of 320 × 360 × 360 points with a regular spacing of 0.1 Å. All of the interaction MIFs were stored in three-dimensional Cartesian grids. The LJ interaction energy grids were then compressed using the DTGrid scheme and stored in DT-Grid format to be used later in the simulations. For the PMF calculations, the reaction coordinate was defined by the distance between the center of geometry of the peptide and the gold surface in the direction perpendicular to the surface plane (along the z axis). Calculations were performed at increments of 0.1 Å along the z axis. At each position along the reaction coordinate, the center of the peptide structure was positioned at different x and y coordinates to account for the energetic variations arising from the periodic atomic structure of the surface. An xy plane of size 6 Å × 6 Å was previously shown to be adequate for sampling the interaction energy variation over a Au(111) surface.40 Therefore, in this study, the x and y coordinates of the peptide center were sampled uniformly from 0 to 6 Å in increments of 1 Å for each of the x and y axes. The rotational degrees of freedom at each position of the peptide center were sampled over each of the Euler rotation angles (θ, ϕ, ψ) uniformly. The 120, 240, and 240 coordinates were used for θ ∈ (0, π], ϕ ∈ (0, 2π], and ψ ∈ (0, 2π] angles, respectively. For each position and orientation of the peptide, the interaction energy with the Au(111) surface was computed with the ProMetCS force field. The calculations were carried out on a single node using four of its processor cores. 3.3. Brownian Dynamics (BD) Simulations. An example system that consists of a nucleosome and the globular domain of a linker histone H5 (gH5) was chosen for test purposes due to the large size of the nucleosome and the presence of concave regions on its surface. Simulation parameters similar to those used in the study by Pachov et al.51 were used. The crystal structures of gH5 (chain B, PDB ID: 1HST;52 2.5 Å resolution) and of the nucleosome (PDB ID: 1KX5;53 1.9 Å resolution) were used as starting structures. The nucleosome structure was modified to remove the flexible histone tails from the structure and to add two 10 base-pair linker DNA segments to the entry−exit sites of the nucleosome. The electrostatic potentials of gH5 and the nucleosome structures were calculated using APBS 1.4.34 Prior to the electrostatic potential calculations, the protonation states of G

DOI: 10.1021/acs.jctc.6b00350 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Figure 6. Evaluation of the DT-Grid structure using various different grid edge lengths for a spherical shell. All grids store data values of type floating-point number. Evaluation results for grids stored and saved in DT-Grid file type are shown by red bars and lines, and the results for Cartesian grids saved in a regular array structure and stored in UHBD file format are shown by purple bars and lines. (a) Memory and (b) hard disk storage requirements for grid edge lengths from 100 to 800. All files are saved in binary (unformatted) format. Time required to perform a random access operation averaged over 107 operations to (c) grid points and (d) grid cells.

containing grid’s edge length and with a fixed thickness of 50 grid points. These 8 regular grids were then converted to the DT-Grid format to store the grid point values that lie in the shell region. First, we evaluated the DT-Grid structure of the model systems. Because the arrays required to compress the shape of a grid are much smaller in size than the value array, the computer memory occupied by the DT-Grid structure was determined mainly by the size of the value array. For the spherical shells used in the comparison tests, the DT-Grid structure sizes varied from approximately 50% to as little as 17% of their corresponding regular arrays with the same grid lengths (see Figure 6a). A second aspect to evaluate is the size of the files required to store the data on a hard disk. Storing the data may help to decrease the amount of time required for the setup of rigidbody simulations under different simulation conditions. Figure 6b shows a comparison of the sizes of regular and DT-Grids stored with UHBD and DT-Grid file formats, respectively. Similarly to the data structure size, the size of a DT-Grid stored on a disk is mainly determined by the size of its value arrays. Therefore, the memory and storage efficiency of a DT-Grid depends on how sparse the shell/stencil points are with respect to a complete regular grid. As mentioned previously, two different random access algorithms for the DT-Grid structure were implemented in this study. Comparison of these algorithms with a simple array access showed that the required random access time increases more quickly with increasing edge length for regular grids than it does for DT-Grids. The required times for a random point access when using DT-Grids vary between 0.8 and 1.0 times and for a random cell access between 0.9 and 1.2 times those required using regular grids (Figure 6c,d). Next, we evaluated the DT-Grid structure tailored for excluded volume and similar types of information grids (see

surface. The crystal structure of cowpea mosaic virus (CPMV) capsid (PDB ID: 1NY7;56 3.0 Å resolution) was input (without water molecules) to the VIPERdb Web server57 to generate the complete capsid structure. The same Au(111) model as employed above was used as the surface structure. Standard protonation states at pH 7 were assigned to ionizable side chains, and the hydrogen atoms were added to the CPMV capsid structure using the PDB2PQR Web server (version 2.0.0)47 with PARSE58 force-field parameters. The total charge of the capsid was calculated to be −540e. The nonpolar desolvation and LJ interaction energies were calculated as described for the ProMetCS force field.40 The LJ energy terms were computed for each of the two gold atom types described in the gold model and then stored in two separate Cartesian grids, each of which had dimensions of 1600 × 1600 × 1600 points with a regular spacing of 0.2 Å. The LJ interaction energy grids were then compressed using the DTGrid scheme and stored in DT-Grid format. The binding energy of the complete capsid adsorbing to the gold surface was computed using SDA.31

4. RESULTS AND DISCUSSION 4.1. Evaluation of the Data Structure. 4.1.1. Memory and Time Complexity Analysis for a Spherical Test System. To evaluate the memory and storage requirements of the data structure and the structure files, as well as the complexity of the basic algorithms used to access the values stored, a 3D spherical object was chosen as an initial test system. The sphere may be considered as a simplified representation of a globular protein. Its surface shell represents the skin around the surface of a protein. To this end, 8 regular cubic grids of edge lengths 100, 200, 300, 400, 500, 600, 700, and 800 were generated in which randomly generated floating point numbers were stored on grid points that lie on predefined spherical shells. Each of these spherical shell regions was defined with an outer diameter of its H

DOI: 10.1021/acs.jctc.6b00350 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Figure 7. Evaluation of the representation of an excluded volume grid for a sphere using DT-Grid or a rectangular array for a range of grid edge lengths. All grids are three-dimensional and cubic. The evaluation results for grids stored and saved in DT-Grid file type are shown by red bars and lines, and the results for regular grids saved in built-in arrays and stored in UHBD file format are shown by purple bars and lines. (a) Memory and (b) storage requirements on the hard disk. All files are saved in binary (unformatted) format. Time required to perform a random access operation averaged over 107 operations for (c) grid points and (d) grid cells.

Figure 4) and the data access algorithms using this structure. As shown in Figure 7a, this type of DT-Grid structure requires significantly less memory than a regular 3D array of Boolean type representing the same excluded volume. The result is not surprising given that the size of the value array is only 1 in the excluded volume DT-Grid structure. As shown in Figure 7b, comparison of the sizes of the UHBD and DT-Grid files created to store the excluded volume grids in disk, similar to that of the memory requirements, shows that the DT-Grid file type is significantly more advantageous. For instance, whereas the size of a UHBD file that stores a grid with an edge length of 800 points reaches 1500 MB, the size of the DT-Grid that stores the same grid is only 10 MB. Comparison of the random access algorithm for accessing points stored in regular arrays randomly is shown in Figure 7c. Times required for the DT-Grid algorithm, as well as the regular access method, increase linearly with the size of the sphere used in evaluations. However, although the time required for the DT-Grid method for a grid of edge length 100 is equal to that required for a regular grid of the same size, it reduces to 0.6 fold for the grid with an edge length of 800. Evaluation of the second method implemented, the random access method for grid cells, shows that the time required increases linearly with the grid size (see Figure 7d). The increase in the time usage for the regular access method compared to that of the DT-Grid is much faster, and the usage exceeds that of the latter for grids of size 200 or more. Along with its extremely low memory footprint and file size on a disk, the data access speed clearly shows that using the DT-Grid method for large grids is much more advantageous than using built-in arrays and storing them in a disk in UHBD format. A system of complex shape, with many cavities and bumps on its surface, has more topographical data to store than a spherical system when using DT-Grid. This is due to the increasing numbers of connected components in the tubular

grid and, therefore, the increasing complexity of the shape of the MIF region it compresses. Even though the tests were conducted with a simple model system, the slow increase in the time required suggests that using a system of complex shape instead, with many cavities and bumps, would not increase the random access operation execution time significantly. Although DT-Grids may be less efficient for storing potential values than storing information grids, such as an excluded volume grid, the time required for random access operations for grid points is lower than that using a regular grid. This is due to the fact that the small footprint of a DT-Grid structure for index compression allows it to fit into a memory cache. This is evident, in particular, for excluded volume-type information grids whose size is so small that the entire DT-Grid structures can easily be stored in the cache. Important to note is that the evaluation tests performed here provide a measure for the time requirement for random access operations to random grid points. In real-life applications, however, the grid points accessed will be located closely in computer memory and therefore will not be entirely random. Figure 6d suggests that the time requirements become comparable for DT-Grid and regular grids when a random grid cell operation (a random grid point with direct neighbors) is performed. 4.1.2. Force Calculations for Proteins Docking to a Flat Surface. To assess the performance of the DT-Grid structure for a more realistic application with extensive memory access operations, we performed force calculations for four proteins docked to an atomically detailed flat gold surface. The LJ interaction energies were computed for two different gold atom types and were stored in rectangular grids of various sizes (Table 1). The times for the force calculations performed with DT-Grid and with regular arrays are given in Table 1. Because each of the proteins has a different grid size determined by the extent of its MIF and a different interaction interface, the number of surface atoms interacting with the protein is I

DOI: 10.1021/acs.jctc.6b00350 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

increase the binding propensity of β-lactamase inhibitor protein (BLIP) significantly when fused to the N-terminal end of the protein.50 Our results show that 3H has a favorable adsorption free energy (ΔG) of −92 kJ/mol from bulk water to Au(111) for the peptide conformation used. This energy value is consistent with the results of previous computational studies40,59 that reported binding free energy calculations of single histidine amino acids on gold surfaces to be between −32 and −47 kJ/mol. The ΔG of the 3H−Au(111) system thus corresponds to approximately 2−3 fold that of the single H− Au(111) system. Indeed, at a distance of 5.65 Å from the surface, the 3H peptide adopts an energetically very favorable orientation (−102 kJ/mol in the PMF) with all three histidines interacting closely with the gold, thereby explaining the almost linear increase in affinity with the number of residues in the peptide (see Figure 8).

different for each case. Therefore, the total number of calls to the force calculation function for each protein−surface system is different. Inspection of the times required to perform the force calculations shows that the calculations performed using the DT-Grid structure require longer than those using regular arrays. Comparison of the two sets of calculations (with and without DT-Grid) demonstrates that the ratios of the required times for force calculations are 1.4, 1.7, 1.8, and 1.9 for Barstar, Cyt2Ba, TEM1 and PON1, respectively. This increase in time is mainly due to fact that the memory access operations are performed on grid points that are more spatially correlated rather than being entirely random. Other factors that affect the time required to carry out a simulation using DT-Grid include the number of other arithmetic operations performed every time after a random access operation for each of the accessed grid points as well as the memory access operations to other arrays that are not stored in DT-Grid format. Because the number of these operations is likely to be higher in other simulations, the ratios of the required times for the calculations are likely to be smaller. In the force calculations, when a surface atom was beyond the predefined cutoff distance from the center of geometry of the protein, the force function was not called for that atom, and the next atom on the surface atom list was checked. The effect of the memory access operations on the total simulation time therefore also depends significantly on the interaction cutoff values. When the mobile solute is further away from the surface than the cutoff, the calculations will be skipped, and therefore, no random access operation to interaction data will be required. We consider the memory requirements of the DT-Grids and arrays used for the force calculations and the sizes of the grid files in the two different formats. Table 1 shows that a regular array that stores LJ energy values within 10 Å of the protein surface consists of up to 67% (in the case of PON1) redundant numerical values, because the coordinates of these points are beyond 10 Å from the surface or are inside the excluded volume of the protein. Eliminating the redundant points by using a DTGrid structure helps to decrease the memory requirement to approximately 33% of the original array because the compression data stored in DT-Grid are often negligible or very small compared to the actual numerical values stored. Similarly to the memory usage, the file sizes are determined mostly by the number of numerical values stored in these files. Table 1 shows a comparison of the sizes of files with UHBD and DT-Grid formats and shows the same trend as that of the memory usage. Finally, we performed the same calculations using the SDA software compiled for parallel execution of the calculations for different random orientations of the same protein. Results show that the scaling performance for the calculations with DT-Grids is slightly higher than that for the calculations with regular grids (Table S1 and Figure S1). 4.2. Applications. 4.2.1. PMF Calculations for a 3H Peptide Binding to a Gold Surface. We applied our DT-Grid implementation to compute the PMF profile of a histidine tripeptide (3H) on a Au(111) surface using the method implemented in the SDA software. Histidine has been shown to have a very high affinity toward gold surfaces by both experimental and computational studies.50,59 Therefore, fusion peptides rich in histidines have been designed and used to enhance the binding propensities of proteins that otherwise bind weakly to gold surfaces. 3H, in particular, was shown to

Figure 8. PMF profile of the 3H peptide on a Au(111) surface. The curve (shown in blue) represents the PMF along the reaction coordinate given by the separation between the center of mass of the peptide and the top atom layer of the surface. The orientation of the 3H adsorbed onto the gold at the free energy minimum is shown in the inset. The CPU times for the same PMF calculations at each step with (dark green) and without (light green) a DT-Grid storage scheme for the LJ interaction energy grids are shown in bars.

This example further provides statistics on the performance of our DT-Grid implementation when used for a nonoptimal application. Each of the two LJ interaction energy grids used in this example, which are the largest of all the grids, requires a memory allocation of 316 MB and a disk space of 159 MB for the grid file (unformatted/binary). Because of the small size of the 3H peptide, the ratio of the volume of the molecular skin (the volume that occupies the space between the surface of the excluded volume of the tripeptide and a 10 Å cutoff distance away from it, in which the LJ interaction field is effective) to the total volume of the grid is relatively large, and therefore, the DT-Grid compression is not optimal. The memory and disk spaces allocated for the compressed LJ grid were 130 and 66 MB, respectively, corresponding to ∼42% of the allocated memory and disk space of their Cartesian grid and UHBD file counterparts. Because the solutes studied in macromolecular simulation studies are usually much larger than a tripeptide, the compression efficiency will generally be higher than the 58% obtained in this example. Because of the systematic sampling of the rotational and translational degrees of freedom of the peptide in the PMF calculations, we could directly compare the execution times of J

DOI: 10.1021/acs.jctc.6b00350 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation Table 2. Clustering of the Binding Modes of the gH5−Nucleosome Complex Obtained from Different Simulations simulation resolution (Å) 0.10

0.25

0.50

binding modea

occurrence time (ns)

representative energy (kJ/mol)

1 2 3 1 2 3 1 2 3

446 (65%) 167 (24%) 40 (6%) 404 (68%) 132 (22%) 31 (5%) 338 (58%) 191 (32%) 25 (4%)

−69.1 −69.1 −68.1 −68.8 −69.3 −68.8 −68.8 −69.3 −68.6

average energy (kJ/mol) −69.1 −69.1 −68.1 −69.1 −68.8 −68.1 −69.3 −69.1 −68.3

(1.2) (1.0) (0.5) (1.2) (1.0) (0.5) (1.2) (1.0) (0.5)

StDb (Å)

rangec (Å)

1.0 0.8 0.6 0.9 0.8 0.7 1.0 0.8 0.8

5.6 4.8 3.8 5.9 4.8 3.8 5.8 4.8 4.5

a

Binding modes 1, 2, and 3 correspond to clusters whose representatives are shown in Figure 9 in green, orange, and purple, respectively. bStandard deviation (StD) of RMSD values within a cluster. cMaximum RMSD value (range) between any two members of a cluster.

Figure 9. Different docked binding orientations of gH5 on the nucleosome (shown in cartoon representation in black) from side (top left) and top (top right) views. Representatives of the first (green), second (orange), and third (purple) largest clusters among six distinct clusters from the BD simulations (res = 0.1 Å) are shown in a cartoon representation. The N-terminus of the gH5 is marked with a blue point in each of the three representative orientations. The regions occupied by the center of geometry of the gH5 during the docking simulations (res = 0.1 Å) are shown in red wireframe patches. A close-up view of the occupied regions in the center, as well as views from the other two simulations, are shown below.

the simulations performed with DT-Grids (for LJ interaction energies) discussed above and those with Cartesian grids only. Figure 8 shows that the CPU times for the PMF calculations of the 3H peptide at varying distances above the surface when using the LJ grids stored in DT-Grid format range from 0 to 23% longer than those for the calculations using the same grids stored in regular Cartesian format. The differences in the execution times are due to the changing number of interacting atoms and, therefore, the number of energy values accessed on the LJ grids. Noteworthy is that the PMF calculations with the DT-Grids took 1696 CPU hours in total, which is only 7% longer than the calculations in which no DT-Grid was used (1586 CPU hours). These results show that our implementation does not cause a significant computational overhead when applied to a real use case. 4.2.2. BD Simulations of Nucleosome-Linker Histone Complexation. Nucleosomes are the building blocks of chromosomes in eukaryotes. They are composed of ∼147 base pairs of DNA wrapped around eight core histone proteins.

In addition to these histone proteins that make up of the core of the nucleosome, there is a separate family of histones, H1/ H5 proteins. The H1/H5 proteins, also called linker histones, bind to the linker DNAs that extend out from the nucleosome, and thus, they contribute to the compaction of chromatin. Further, the linker histone proteins take part in transcription and DNA replication processes in nuclei. Although the linker histones are known to bind to the nucleosome, the exact position of the binding of linker histone proteins is still controversial. Studies suggest two main models of the gH1/ gH5−nucleosome complex60 with the linker histone either binding symmetrically to the nucleosomal DNA on the dyad axis and interacting with both of the linker DNAs61−63 or binding asymmetrically between one of the linker DNAs and the nucleosomal DNA.64−66 Zhou and colleagues64 argued that an explanation of why there are multiple proposed positions could be due to the multiple modes of binding adopted by the linker histone on the nucleosome. K

DOI: 10.1021/acs.jctc.6b00350 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Figure 10. Interactions of a cowpea mosaic virus (CPMV) with a Au(111) surface. (a) Crystal structure of a complete CPMV capsid used in this study. The structure is shown in stick representation with a color assignment determined by the radial distance (from red to blue, for a distance interval from 100 to 160 Å) of each atom from the center of the capsid. (b, c) Views of the molecular surface of the capsid in two orientations from the side facing the gold surface (left) and their corresponding interaction energies (right). The energies were computed at separations measured from the center of geometry of the capsid to the center of the uppermost layer of gold atoms. The total interaction energy calculated for this application consists of the LJ interaction energy (light green) and the nonpolar surface desolvation energy (dark green).

In a previous study by Pachov et al.,51 the binding modes of gH5 to a nucleosome were investigated by means of BD simulations. Analysis of the docking modes revealed a dominant binding mode in which the gH5 protein was bound asymmetrically to one of the linker DNAs and the nucleosomal DNA. Following a similar computational procedure in this study, docking simulations of the gH5−nucleosome complex were performed with an excluded volume grid of 0.5 Å resolution with electrostatic interaction forces only. The geometric structure of the nucleosome exhibits many concave surface regions due to a number of structural features such as the DNA grooves and grooves between superhelical turns. For an accurate description of complicated concave surfaces, such as that of a nucleosome, high structural resolution is necessary. To assess the impact of the resolution chosen for the excluded volume grid, we performed two additional docking simulations with resolutions of 0.25 and 0.1 Å. Clustering of the binding modes obtained from the simulations revealed two major binding modes (Table 2 and Figure 9). The cluster with the highest population corresponds to that found in the study by Pachov et al.51 In the simulations in which an excluded volume grid of resolution 0.5 Å was used, the second cluster, corresponding to symmetric on-dyad binding, showed a 32% population. Increasing the grid resolution decreases uncertainties in the boundary region of the shape that the grid describes. Indeed, increasing the resolution from 0.5 to 0.25 Å changed the populations of the binding modes, but a further increase in resolution to 0.1 Å did not change the cluster populations significantly, giving 65% for the asymmetric binding mode and 22% for the symmetric binding mode. Noteworthy is that increasing the resolution from 0.5 to 0.25 Å and then to 0.1 Å not only changed the distribution of the cluster populations but also increased the residence times for the most common binding modes in the

simulations (Table 2). The two significant binding modes revealed by the BD simulations therefore suggest that gH5 can indeed have multiple binding modes. However, further computational and experimental investigations are required to prove this hypothesis. The results of the docking simulations show that, to obtain accurate and relatively converged docking modes, a grid resolution finer than 0.5 Å may be necessary. For large systems, however, this requirement can be a burden in terms of memory and/or storage resources. Owing to the efficient compression scheme used, the memory requirements for the grids used in this study were 40 and 6 MB for the resolutions 0.1 and 0.25 Å, respectively, although these grids correspond to regular Cartesian grids of sizes 1110 × 1442 × 781 (∼1200 MB) and 446 × 579 × 314 (∼80 MB). Further, the disk storage requirements were only 33 and 5 MB for the grids of resolution 0.1 and 0.25 Å, respectively, corresponding to 0.7 and 1.9% of the disk space requirements of their UHBD file counterparts. 4.2.3. Apolar Binding Energy Calculations for a Virus Capsid on a Gold Surface. Viral protein capsids can be used to encapsulate functional materials for specific nanotechnological applications, such as solid-state nanodevices.67,68 The first step to designing ordered structures of these capsid cages on solid supports is to understand the interactions between the capsid and the surface. A number of experimental studies on the interactions between these protein cages and various types of solid surfaces have been reported.67,69−71 In contrast, few atomic detail computational studies have been reported due to the limitations of the size on modeling and simulation of these structures. The CPMV capsid proteins are able to self-assemble into a protein cage without the nucleic acid material inside. The capsid has an icosahedral shape (Figure 10a) and has a diameter of ∼290 Å. In this study, we computed the apolar L

DOI: 10.1021/acs.jctc.6b00350 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation

Therefore, the assessment shows the speed of the random access algorithms in practice under frequent memory access calls. The time spent in a force calculation subroutine call using DT-Grid varied from 1.4- to 1.9-times that seen when using regular grid arrays. Although the difference in execution times may be significant, in particular for calculations where a reduction in the speed cannot be easily traded with a reduction in the memory and storage requirements, the observed differences will be smaller in many real applications. Indeed, the observed difference in time was only 7% in the PMF calculations for a 3H peptide on a gold surface. Reducing the use of redundant memory access operations, which are costly compared to the arithmetic operations, in simulation codes will further reduce execution times. The applications of DT-Grid to nucleosome and viral capsid systems show that DT-Grid enables calculations for larger systems or at higher resolutions. These systems cannot currently be simulated with the standard rectangular grid treatment without reducing the level of the resolution or decreasing the number of interaction types required for the same level of accuracy. In particular, large macromolecules such as viruses can only be simulated using computers with memory capacities much larger than those of conventional computers. Our application of DT-Grid to store the LJ interaction grids for a CPMV capsid to compute its binding energy on a gold surface, on the other hand, reduces the memory requirements from 64 GB down to 8 GB, thus rendering the simulations to be performed on a personal computer with a memory capacity less than 16 GB possible. With the growing interest in the modeling and simulation of ever larger macromolecular systems, scaling up from viruses to entire cells, the memory requirements with conventional interaction grids will easily exceed the capacities of even the largest memory machines, making the use efficient memory storage schemes like DT-Grid unavoidable. Here, we have demonstrated the efficient application of DTGrid for excluded volume and short-range LJ interactions. We expect that it can easily be extended to storing other types of short-range data with high efficiency. However, for long-range interactions, the current DT-Grid implementation would only offer a limited reduction in memory and storage needs. DTGrid currently does not offer a multiresolution structure. However, long-range interactions with resolution requirements changing as a function of distance can be represented with DTGrid structures of higher dimensions (N > 3). This way, although implicit, an efficient multiresolution grid can be realized. The same idea can easily be applied to higher dimension grids that store interaction grids for different atom types. A future direction could be to adopt a multiresolution representation of electrostatic interactions with a DT-Grid representation for high-resolution short-range interactions coupled with other representations for longer range interactions, such as DT-Grids for low-resolution interactions or a simple Debye−Hückel model.72

component of the interaction energy between a CPMV capsid and a Au(111) surface for two distinct binding orientations of the capsid at various surface separations (see Figure 10b and c). The results showed that the total apolar binding energies were dominated by the LJ term. The LJ interaction energy, on the other hand, is determined mainly by the number of capsid atoms in direct contact with the gold surface. Although a simple visual inspection of the orientation shown in Figure 10b suggests a large capsid surface patch with which the protein atoms can interact with the gold, the LJ interaction energies for this orientation are much less favorable than those computed for the second orientation shown in Figure 10c. This is mainly due to the protruding protein side chains that prevent the surrounding side chains from coming closer to the surface and thus forming strong favorable interactions. The orientation shown in Figure 10c, on the other hand, has a large favorable interaction energy owing to the relatively flat region that allows protein side chains to interact strongly with the gold surface. The analysis of the memory and disk storage requirements of the LJ interaction grids used in these calculations showed significant reductions in favor of DT-Grid. Each of the two LJ grids computed for two separate gold atom types required a disk storage space of ∼16 GB with UHBD format. The requirement was reduced to ∼2 GB when a grid was stored with DT-Grid format. In line with the disk storage requirements, the memory requirement to save each of the LJ grids was reduced from ∼32 to ∼4 GB, respectively, with a regular Cartesian grid and a DT-Grid. This application provides a good example of how these requirements can be reduced without sacrificing the accuracy of the calculations and by making it possible to perform these calculations on a conventional personal computer with a memory size of 16 GB.

5. CONCLUSIONS We have adapted and evaluated a compressed storage structure, DT-Grid, developed originally for level set simulations by Nielsen and Museth,23 for gridded MIFs of macromolecules for rigid-body simulations. DT-Grid matches the need for a single structure type that works well with different data types and varying data sparsity, as well as providing fast random data access and a significantly lower memory footprint than other compressed data storage structures. The application of DT-Grid to store molecular interaction data was shown to reduce the memory and storage requirements significantly. The degree of efficiency was determined mainly by the size of the unique data values stored due to the much smaller sizes of index storage arrays of the DT-Grid. Further, application of DT-Grid to excluded volume data reduced the storage and memory requirements dramatically while allowing very fast random access operation times. Comparison of the random access algorithms implemented for DT-Grid with direct memory look-ups of regular arrays of various different sizes showed that the DT-Grid algorithms were faster than the direct memory access operations for regular arrays. This is due to the higher cache-efficiency provided by the very low memory footprint of DT-Grid. To assess the performance of DT-Grid in realistic applications, we performed force calculations of proteins on a gold surface with their LJ interaction energies stored in compressed DT-Grids. The force calculation subroutine implemented in the SDA software package is one of the most time-consuming parts of the program, and the calculations were performed using values stored in LJ interaction energy grids.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jctc.6b00350. Descriptions of algorithms for random access to grid points and cells and results of assessment of parallel performance (PDF) M

DOI: 10.1021/acs.jctc.6b00350 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation



(24) Bungartz, H.-J.; Mehl, M.; Neckel, T.; Weinzierl, T. Comput. Mech. 2010, 46, 103−114. (25) Durbin, P.; Iaccarino, G. J. Comput. Phys. 2002, 181, 639−653. (26) Söderström, A.; Karlsson, M.; Museth, K. ACM Trans. Graph. 2010, 29, 1−17. (27) Gaspar, M.; Weichert, F. Comput. Aided Des. 2013, 45, 1170− 1181. (28) Mayerich, D. GPU-based Dynamic Tubular Grids for Sparse Volume Rendering. IEEE Visualization Conference, Vis 2010: Salt Lake City, UT, USA, 2010; pp 3−4. (29) Mayerich, D.; Abbott, L.; Keyser, J. IEEE Trans. Vis. Comput. Graphics. 2008, 14, 1611−1618. (30) Mayerich, D. M.; Keyser, J. Filament tracking and encoding for complex biological networks. Proceedings of the 2008 ACM symposium on Solid and physical modeling - SPM ’08; New York, NY, USA, 2008; p 353. (31) Martinez, M.; Bruce, N. J.; Romanowska, J.; Kokh, D. B.; Ozboyaci, M.; Yu, X.; Ö ztürk, M. A.; Richter, S.; Wade, R. C. J. Comput. Chem. 2015, 36, 1631−1645. (32) Davis, M. E.; Madura, J. D.; Luty, B. A.; McCammon, J. A. Comput. Phys. Commun. 1991, 62, 187−197. (33) Madura, J. D.; Briggs, J. M.; Wade, R. C.; Davis, M. E.; Luty, B. A.; Ilin, A.; Antosiewicz, J.; Gilson, M. K.; Bagheri, B.; Scott, L. R.; McCammon, J. A. Comput. Phys. Commun. 1995, 91, 57−95. (34) Baker, N. A.; Sept, D.; Joseph, S.; Holst, M. J.; McCammon, J. A. Proc. Natl. Acad. Sci. U. S. A. 2001, 98, 10037−10041. (35) Humphrey, W.; Dalke, A.; Schulten, K. J. Mol. Graphics 1996, 14, 33−38. (36) Pettersen, E. F.; Goddard, T. D.; Huang, C. C.; Couch, G. S.; Greenblatt, D. M.; Meng, E. C.; Ferrin, T. E. J. Comput. Chem. 2004, 25, 1605−1612. (37) Wade, R. C.; Gabdoulline, R. R.; De Rienzo, F. Int. J. Quantum Chem. 2001, 83, 122−127. (38) Blomberg, N.; Gabdoulline, R. R.; Nilges, M.; Wade, R. C. Proteins: Struct., Funct., Genet. 1999, 37, 379−387. (39) Buckle, A. M.; Schreiber, G.; Fersht, A. R. Biochemistry 1994, 33, 8878−8889. (40) Kokh, D. B.; Corni, S.; Winn, P. J.; Hoefling, M.; Gottschalk, K. E.; Wade, R. C. J. Chem. Theory Comput. 2010, 6, 1753−1768. (41) Iori, F.; Di Felice, R.; Molinari, E.; Corni, S. J. Comput. Chem. 2009, 30, 1465−1476. (42) Lim, D.; Park, H. U.; De Castro, L.; Kang, S. G.; Lee, H. S.; Jensen, S.; Lee, K. J.; Strynadka, N. C. Nat. Struct. Biol. 2001, 8, 848− 852. (43) Harel, M.; Aharoni, A.; Gaidukov, L.; Brumshtein, B.; Khersonsky, O.; Meged, R.; Dvir, H.; Ravelli, R. B. G.; McCarthy, A.; Toker, L.; Silman, I.; Sussman, J. L.; Tawfik, D. S. Nat. Struct. Mol. Biol. 2004, 11, 412−419. (44) Cohen, S.; Dym, O.; Albeck, S.; Ben-Dov, E.; Cahan, R.; Firer, M.; Zaritsky, A. J. Mol. Biol. 2008, 380, 820−827. (45) Gordon, J. C.; Myers, J. B.; Folta, T.; Shoja, V.; Heath, L. S.; Onufriev, A. Nucleic Acids Res. 2005, 33, W368−W371. (46) Šali, A.; Blundell, T. L. J. Mol. Biol. 1993, 234, 779−815. (47) Dolinsky, T. J.; Nielsen, J. E.; McCammon, J. A.; Baker, N. A. Nucleic Acids Res. 2004, 32, W665−W667. (48) Dolinsky, T. J.; Czodrowski, P.; Li, H.; Nielsen, J. E.; Jensen, J. H.; Klebe, G.; Baker, N. A. Nucleic Acids Res. 2007, 35, W522−W525. (49) Wang, J.; Cieplak, P.; Kollman, P. A. J. Comput. Chem. 2000, 21, 1049−1074. (50) Cohavi, O.; Reichmann, D.; Abramovich, R.; Tesler, A. B.; Bellapadrona, G.; Kokh, D. B.; Wade, R. C.; Vaskevich, A.; Rubinstein, I.; Schreiber, G. Chem. - Eur. J. 2011, 17, 1327−1336. (51) Pachov, G. V.; Gabdoulline, R. R.; Wade, R. C. Nucleic Acids Res. 2011, 39, 5255−5263. (52) Ramakrishnan, V.; Finch, J. T.; Graziano, V.; Lee, P. L.; Sweet, R. M. Nature 1993, 362, 219−223. (53) Davey, C. A.; Sargent, D. F.; Luger, K.; Maeder, A. W.; Richmond, T. J. J. Mol. Biol. 2002, 319, 1097−1113.

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Phone: +49 (0)6221 533247. Fax: +49 (0)6221 533298. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We thank Peter Bastian (IWR, Heidelberg University) and Daria B. Kokh (HITS) for helpful discussions and Neil J. Bruce (HITS) for his careful reading of the manuscript. We thank Martin Reinhardt (HITS) for the improvement of the computational efficiency of the mk_LJ_grd tool implemented in SDA for creating the LJ interaction energy grids. M.O. acknowledges the support of Heidelberg Graduate School of Mathematical and Computational Methods for the Sciences (HGS MathComp), Heidelberg University. We gratefully acknowledge the support of the Klaus Tschira Foundation.



REFERENCES

(1) Gabdoulline, R. R.; Wade, R. C. Curr. Opin. Struct. Biol. 2002, 12, 204−213. (2) Chen, R.; Weng, Z. Proteins: Struct., Funct., Genet. 2002, 47, 281− 294. (3) Gray, J. J.; Moughon, S.; Wang, C.; Schueler-Furman, O.; Kuhlman, B.; Rohl, C. a.; Baker, D. J. Mol. Biol. 2003, 331, 281−299. (4) Cheng, T. M.-K.; Blundell, T. L.; Fernandez-Recio, J. Proteins: Struct., Funct., Genet. 2007, 68, 503−515. (5) Sauton, N.; Lagorce, D.; Villoutreix, B. O.; Miteva, M. A. BMC Bioinf. 2008, 9, 184. (6) Macindoe, G.; Mavridis, L.; Venkatraman, V.; Devignes, M.-D.; Ritchie, D. W. Nucleic Acids Res. 2010, 38, W445−W449. (7) Pierce, B. G.; Hourai, Y.; Weng, Z. PLoS One 2011, 6, e24657. (8) Mirzaei, H.; Kozakov, D.; Beglov, D.; Paschalidis, I. C.; Vajda, S.; Vakili, P. A new approach to rigid body minimization with application to molecular docking. 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), 2012; pp 2983−2988. (9) Mirzaei, H.; Beglov, D.; Paschalidis, I. C.; Vajda, S.; Vakili, P.; Kozakov, D. J. Chem. Theory Comput. 2012, 8, 4374−4380. (10) Smith, J. A.; Edwards, S. J.; Moth, C. W.; Lybrand, T. P. Biochemistry 2013, 52, 5577−5584. (11) van Zundert, G. C. P.; Bonvin, A. M. J. J. In Methods in Molecular Biology; Kihara, D., Ed.; Clifton, NJ, USA, 2014; Vol. 1137, pp 163−179. (12) Tao, P.; Sodt, A. J.; Shao, Y.; König, G.; Brooks, B. R. J. Chem. Theory Comput. 2014, 10, 4198−4207. (13) Gabdoulline, R. R.; Wade, R. C. Biophys. J. 1997, 72, 1917− 1929. (14) Gabdoulline, R. R.; Wade, R. C. Methods 1998, 14, 329−341. (15) Carrivain, P.; Barbi, M.; Victor, J.-M. PLoS Comput. Biol. 2014, 10, e1003456. (16) Biesiada, J.; Porollo, A.; Velayutham, P.; Kouril, M.; Meller, J. Hum. Genomics 2011, 5, 497−505. (17) McGuffee, S. R.; Elcock, A. H. PLoS Comput. Biol. 2010, 6, e1000694. (18) Eisenstat, S. C.; Gursky, M. C.; Schultz, M. H.; Sherman, A. H. Int. J. Numer. Meth. Eng. 1982, 18, 1145−1151. (19) Gilbert, J. R.; Moler, C.; Schreiber, R. SIAM J. Matrix Anal. Appl. 1992, 13, 333−356. (20) Boschitsch, A. H.; Fenley, M. O. J. Chem. Theory Comput. 2011, 7, 1524−1540. (21) Brieg, M.; Wenzel, W. J. Chem. Theory Comput. 2013, 9, 1489− 1498. (22) Mignone, A.; Zanni, C.; Tzeferacos, P.; van Straalen, B.; Colella, P.; Bodo, G. Astrophys. J., Suppl. Ser. 2012, 198, 7. (23) Nielsen, M. B.; Museth, K. J. Sci. Comput. 2006, 26, 261−299. N

DOI: 10.1021/acs.jctc.6b00350 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Theory and Computation (54) Gabdoulline, R. R.; Wade, R. C. J. Phys. Chem. 1996, 100, 3868− 3878. (55) Motiejunas, D.; Gabdoulline, R.; Wang, T.; Feldman-Salit, A.; Johann, T.; Winn, P. J.; Wade, R. C. Proteins: Struct., Funct., Genet. 2008, 71, 1955−1969. (56) Lin, T.; Chen, Z.; Usha, R.; Stauffacher, C. V.; Dai, J.-B.; Schmidt, T.; Johnson, J. E. Virology 1999, 265, 20−34. (57) Carrillo-Tripp, M.; Shepherd, C. M.; Borelli, I. A.; Venkataraman, S.; Lander, G.; Natarajan, P.; Johnson, J. E.; Brooks, C. L.; Reddy, V. S. Nucleic Acids Res. 2009, 37, D436−D442. (58) Sitkoff, D.; Sharp, K. A.; Honig, B. J. Phys. Chem. 1994, 98, 1978−1988. (59) Hoefling, M.; Iori, F.; Corni, S.; Gottschalk, K.-E. ChemPhysChem 2010, 11, 1763−1767. (60) Zhou, H.-X. FEBS Lett. 2013, 587, 394−397. (61) Fan, L.; Roberts, V. A. Proc. Natl. Acad. Sci. U. S. A. 2006, 103, 8384−8389. (62) Syed, S. H.; Goutte-Gattat, D.; Becker, N.; Meyer, S.; Shukla, M. S.; Hayes, J. J.; Everaers, R.; Angelov, D.; Bednar, J.; Dimitrov, S. Proc. Natl. Acad. Sci. U. S. A. 2010, 107, 9620−9625. (63) Meyer, S.; Becker, N. B.; Syed, S. H.; Goutte-Gattat, D.; Shukla, M. S.; Hayes, J. J.; Angelov, D.; Bednar, J.; Dimitrov, S.; Everaers, R. Nucleic Acids Res. 2011, 39, 9139−9154. (64) Zhou, H.-X.; Wlodek, S. T.; McCammon, J. A. Proc. Natl. Acad. Sci. U. S. A. 1998, 95, 9280−9283. (65) Cui, F.; Zhurkin, V. B. Nucleic Acids Res. 2009, 37, 2818−2829. (66) Bharath, M. M. S. Nucleic Acids Res. 2003, 31, 4264−4274. (67) Suci, P. A.; Klem, M. T.; Douglas, T.; Young, M. Langmuir 2005, 21, 8686−8693. (68) Putri, R. M.; Cornelissen, J. J. L. M.; Koay, M. S. T. ChemPhysChem 2015, 16, 911−918. (69) Fang, J.; Soto, C. M.; Lin, T.; Johnson, J. E.; Ratna, B. Langmuir 2002, 18, 308−310. (70) Steinmetz, N. F.; Calder, G.; Lomonossoff, G. P.; Evans, D. J. Langmuir 2006, 22, 10032−10037. (71) Armanious, A.; Aeppli, M.; Jacak, R.; Refardt, D.; Sigstam, T.; Kohn, T.; Sander, M. Environ. Sci. Technol. 2016, 50, 732−743. (72) Mereghetti, P.; Martinez, M.; Wade, R. C. BMC Biophys. 2014, 7, 4.

O

DOI: 10.1021/acs.jctc.6b00350 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX