Article pubs.acs.org/JCTC
PARENT: A Parallel Software Suite for the Calculation of Configurational Entropy in Biomolecular Systems Markus Fleck, Anton A. Polyansky, and Bojan Zagrovic* Department of Structural and Computational Biology, Max F. Perutz Laboratories, University of Vienna, Campus Vienna Biocenter 5, Vienna, 1030, Austria S Supporting Information *
ABSTRACT: Accurate estimation of configurational entropy from the in silico-generated biomolecular ensembles, e.g., from molecular dynamics (MD) trajectories, is dependent strongly on exhaustive sampling for physical reasons. This, however, creates a major computational problem for the subsequent estimation of configurational entropy using the Maximum Information Spanning Tree (MIST) or Mutual Information Expansion (MIE) approaches for internal molecular coordinates. In particular, the available software for such estimation exhibits serious limitations when it comes to molecules with hundreds or thousands of atoms, because of its reliance on a serial program architecture. To overcome this problem, we have developed a parallel, hybrid MPI/openMP C++ implementation of MIST and MIE, called PARENT, which is particularly optimized for high-performance computing and provides efficient estimation of configurational entropy in different biological processes (e.g., protein−protein interactions). In addition, PARENT also allows for a detailed mapping of intramolecular allosteric networks. Here, we benchmark the program on a set of 1-μs-long MD trajectories of 10 different protein complexes and their components, demonstrating robustness and good scalability. A direct comparison between MIST and MIE on the same dataset demonstrates a superior convergence behavior for the former approach, when it comes to total simulation length and configurational-space binning. ln(8π2/C°), where kB is the Boltzmann constant and C° is the standard concentration, and is therefore also constant.9−11 Thus, at constant temperature, one may write the following equation:
1. INTRODUCTION Noncovalent interactions involving biomolecules are the basis for a large number of fundamental biological processes, including transcription, translation, and cell signaling. The Gibbs free energy change (ΔG) for a binding reaction valid in an NPT ensemble,
spatial total momentum ΔSsingle = ΔSsingle + ΔSsingle spatial = ΔSsingle
(1)
ΔG = ΔH − T ΔS
configurational roto‐trans = ΔSsingle + ΔSsingle
is a combination of enthalpic (ΔH) and entropic (ΔS) terms, and represents the key thermodynamic determinant of the likelihood that the reaction occurs. Considering the definition of the Gibbs free energy, however, binding affinities and binding mechanisms cannot be fully determined from structurebased (i.e., enthalpic) analyses alone.1 In fact, it has been shown that structurally identical proteins can bind their targets in thermodynamically unrelated ways2 and even one and the same protein can bind its various targets with radically different enthalpic and entropic contributions.3,4 For this reason, the role of entropy in biomolecular processes has come under increased scrutiny, with many fundamental and practical challenges still open.5−8 The total entropy change of a molecule in a given biomolecular process consists of a momentum component and a coordinate-dependent, spatial component. At constant temperature, the momentum component is constant9 and, therefore is, usually not considered. Furthermore, the contribution of the rotational and translational degrees of freedom to the spatial component is given by Sroto−trans = kB single © XXXX American Chemical Society
configurational = ΔSsingle
(2)
Note that the above equation refers to the entropy change of a single molecule upon a change of its environmental conditions. When applied to a binding process, the molecule’s binding partner, as well as the solvent, are treated as the environment and none of their degrees of freedom enter the above equation explicitly. A decomposition of the overall entropy change upon binding into the protein, ligand, and solvent components, as well as internal and roto-translational components, has been discussed in detail,7,12−14 but is outside the scope of this work. Overall, more insight into configurational entropy could greatly improve our understanding of basic biomolecular processes, but also aid in rational drug design by helping to overcome or even directly exploit enthalpy/entropy compensaReceived: December 23, 2015
A
DOI: 10.1021/acs.jctc.5b01217 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Theory and Computation
used the existing serial Perl/Fortran source code11,24 as a template and have developed a parallel, hybrid MPI27/ openMP28 implementation of MIST and MIE entirely in C+ +, called PARENT, which is particularly optimized for highperformance computing and provides an efficient estimation of configurational entropy effects in different molecular processes. In addition, PARENT also allows for a detailed mapping of intramolecular allosteric networks. Below, we first present the theory and our implementation of (1) the coordinate system used for configurational entropy calculations, (2) the MIE algorithm, and (3) the MIST algorithm, followed by a discussion of the key benchmarks of the PARENT suite.
tion. Moreover, protein activity could potentially be modulated or even designed by influencing configurational entropy in an allosteric fashion, a strategy used by nature itself.15 Atomic motions in biomolecules typically are strongly coupled, because of a high degree of packing and covalent connectivity. Such interdependencies invariably lower the configurational entropy and must be properly treated for its accurate estimation. Unfortunately, the determination of configurational entropy from classical molecular dynamics (MD) simulations is much more challenging than the calculation of free energy. This is largely due to the fact that one, in principle, needs full configurational-space coverage in order to accurately estimate the configurational probability density function (ρ) and calculate the full-dimensional entropy integral, which can be written in Cartesian coordinates (still including roto-translational degrees of freedom here) as SNspatial = −kB
∫ dq1...dqN ρ(q1...qN ) ln[ρ(q1...qN )]
2. METHODOLOGY 2.1. The Bond-Angle−Torsion Coordinate System. 2.1.1. Theory. In treatments of configurational entropy, the roto-translational degrees of freedom of the molecule are intentionally neglected. Using Cartesian coordinates, this is achieved by aligning the molecules to a common reference frame and, thus, removing the external degrees of freedom by applying a rigid-body roto-translational coordinate transformation. While this might be appropriate for rigid molecules, the accuracy of the procedure for conformationally flexible molecules is questionable. In fact, considering intrinsically disordered proteins, the assumption of a single well-defined reference frame might actually be a priori unwarranted. Moreover, Cartesian coordinates are known to introduce spurious correlations29 when describing the structural diversity of macromolecules. For molecules, the choice of an internal coordinate system consisting of bond lengths, angles between bonds and torsional angles involving three bonds (BAT11,30−33,26 coordinates, the Z-matrix formalism excluding external degrees of freedom) is therefore a practical choice, as it naturally accommodates a molecular topology as a connected node network with rigid internode distances (bond lengths). Since the angles between bonds also are generally rather stiff, this choice of a coordinate system also allows for a natural partitioning of the degrees of freedom into soft and hard ones. This helps to avoid spurious correlations29 inherent in atomic Cartesian coordinates. Note that, in principle, further optimization of the coordinate system can be achieved by analyzing a given ensemble and taking advantage of the additional specific intramolecular or intermolecular couplings. The Jacobian determinant11 for BAT coordinates assumes the convenient form: ⇀ ⇀ ⇀ ⇀ ⇀ ⇀ J( b , θ , ⇀ φ ) = J( b )J( θ )J(⇀ φ ) = J ( b )J ( θ ) (4)
(3)
Here, the qi terms represent the spatial degrees of freedom in the system, while N is their total number. A full coverage of the configurational space, as well as processing of such data for typical biomolecules, is virtually impossible, given contemporary computational resources. As a consequence, different approaches for deriving configurational entropy from molecular ensembles have been proposed, entailing different approximations, which may or may not critically affect the end result. For example, the widely used quasi-harmonic approaches16−20,5 only take into the account the linear couplings between the degrees of freedom and any anharmonicities are disregarded. Also, the use of Cartesian coordinates with the quasi-harmonic approach requires alignment of the trajectory to a reference frame in order to remove the roto-translational degrees of freedom, which may not always be well-defined, especially for flexible molecules. The approaches implemented in the present workthe maximum information spanning tree21,10 (MIST) and the mutual information expansion22,11,14,23,24 (MIE) approximationsrely on a numerical sampling of the configurational probability density function concerning the degrees of freedom, for which essentially a Z-matrix formalism25,26 is used. Although based on different mathematical formalisms (see below for details), the two estimators rely on the same terms to be computed, allowing one to derive both from a common computational framework. In both approaches, the amount of correlation between different degrees of freedom is incorporated into mutual information (MI) terms, by which the uncorrelated entropy value is corrected. While, in principle, correlations of any order could be taken into the account by an expansion series, convergence problems more or less restrict this method to second-order correlations, in analogy with the quasi-harmonic approaches.5 A major advantage of MI-based methods over the quasi-harmonic approaches, however, is that they naturally give entropy contributions for localized degrees of freedom, which could especially be useful in applications such as drug design or investigations of allostery. Furthermore, anharmonicities and nonlinear correlations between the degrees of freedom are treated without additional assumptions. On the other hand, both MIST and MIE approaches not only rely on exhaustive sampling when it comes to the generation of molecular ensembles, but their application to systems of biologically relevant sizes presents extremely high demands on computational resources. To address this problem, we have
with ⇀ J( b ) =
N i=2
⇀ J( θ ) = J(⇀ φ) =
N
∏ J(bi) = ∏ bi2 N
i=2 N
∏ J(θi)= ∏ sin θi i=3
i=3
N
N
∏ J(φi)= ∏ 1 = 1 i=4
i=4
(5)
where b represents the bond lengths, θ is the angle, and φ is the torsion angle. As illustrated in Figure 1, the BAT coordinate system is constructed by starting with three covalently bonded atoms as root atoms.11,33 Then, a graph-theoretical tree is built B
DOI: 10.1021/acs.jctc.5b01217 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Theory and Computation
bond length, angle, and torsion coordinates, is not unique. This is due to the arbitrary definition of root atoms for starting the iterative algorithm and, for circular molecules (e.g., aromatics), the iterative algorithm for adding atoms to the topology itself (see Figure 1). By default, the coordinate conversion tool in PARENT implements a breadth-first searching algorithm, whereby a list is generated to keep track of which nodes have been added to the topology in the previous step, and those nodes are then branched, adding all nodes directly connected to them during the current step (Figure 1A). Another searching algorithm is known as depth-first (Figure 1B). Here, during every step, new nodes (atoms) are added to the previously appended node in a recursive manner, until every branch has been explored to its end. The difference for the two algorithms lies in the treatment of circular molecules. Both algorithms exclude one chemical bond from the topology, but the identity of the bond differs, affecting also angular and torsional degrees of freedom. While, in principle, this will affect the results, our preliminary analysis shows that the fractional differences lie in the 1/1000 range for proteins with a typical number of aromatic side chains. Nevertheless, PARENT can also be compiled to use a depth-first search instead. The root atoms are determined automatically by the program using the following criteria:11 the first atom must be terminal, i.e., only connected to the second root atom, which itself can only be connected to two other root atoms and terminal atoms. The third root atom must be nonterminal. While at the current stage root atoms cannot be selected by the user, our preliminary analysis suggests that fractional differences between different choices of root atoms also reside in the 1/1000 range. Concerning systems of multiple molecules, a connected topology is created by the introduction of pseudo-bonds, which are physically not present as real chemical bonds. To achieve this, the algorithm first builds the topologies for all molecules in the system separately and then joins them by using pseudo-bonds between the third and the first root atoms, respectively. In this manner, three additional pseudo-torsions and two additional pseudo-angles are introduced alongside each pseudo-bond, yielding the correct number of 3n − 6 nonredundant degrees of freedom for the total system. By building the separate topologies first, consistency is guaranteed when comparing complexes to their constituents. An important aspect concerning the implementation of PARENT is the use of phase angles24,36 for reducing additional geometrical correlations. If two or more torsion angles share three atoms, one representative of this group is assigned to be the master torsion, and the other torsions are measured relative to it. Because this works best when the master torsion is rather rigid, the conversion program offers an option to specify the names of atoms defining rigid torsions (e.g., atom names of the backbone for a protein). If one torsion consists entirely of rigidtype atoms, its assignment as a parent torsion is prioritized by the conversion program. 2.2. Mutual Information Expansion. 2.2.1. Theory. The most direct way to estimate the configurational entropy from a given structural ensemble is to compute the full-dimensional entropy integral (eq 3) by sampling the corresponding fulldimensional configurational probability density function. Using a binning approach, the degree of sampling necessary for this histogram to be filled rather smoothly (but at least nonsparsely) grows with the power of the number of atoms in the system. This exponential increase in sampling requirements confines the described method to molecules with a very limited number
Figure 1. Algorithms for coordinate conversion. Tree construction using (A) breadth-first and (B) depth-first searching algorithms. Red nodes depict root atoms where construction is initiated, while search pathways are illustrated by arrows. The numbers attached to the arrows indicate the iteration step at which a node is added to the tree, while the numbers inside nodes are a (nonunique) measure for the closeness to the root atoms. A dashed line represents a chemical bond, which is not included in the tree topology, and its position is dependent on the choice of algorithm, with the difference only affecting ringlike topologies.
by successively adding atoms, which are connected to previous atoms by covalent bonds. A covalent bond connecting every newly added atom to the present topology is added to the coordinate topology as a bond-length degree of freedom. In addition, the first two bonds (including the newly added bond length) on the shortest path back to the root atoms define a new angle and the first three bonds define a new torsion angle (see Figure 1). Considering that the root atoms provide two bond lengths and one angle, this makes up for n − 1 bond, n − 2 angular, and n − 3 torsional degrees of freedom, with n denoting the number of atoms in the molecule. This adds up to 3n − 6 (internal) degrees of freedom in total, which is the same number as that for an equivalent Cartesian coordinate system when the rotational and translational degrees of freedom are disregarded. Given a system of two or more molecules, nonphysical pseudo-bonds are added arbitrarily to introduce connectivity to the system. 2.1.2. Implementation. The coordinate conversion tool in PARENT takes topology and trajectory files from an MD simulation package as input. In the present version, PARENT utilizes GROMACS34,35 input files and other formats are planned for future releases. The output is a binary uncompressed BAT trajectory file using a custom file format. In addition to the converted BAT coordinates for every frame of the trajectory, it contains further information in its header, including a version number for upward compatibility, the precision of the trajectory, the number of torsions in the system (from which the number of atoms can easily be calculated), the number of frames, the definition of all torsions, and, thus, all degrees of freedom in the system, the residue names and numbers, as well as the masses and names of all atoms in the system, according to the topology file of the MD simulation package used. Given that every frame includes its external degrees of freedom, as well as a time stamp and box vectors, an entirely functional back-conversion to a trajectory file identical up to machine precision to the original trajectory file from the MD simulation package is possible and offered by the program. Parenthetically, this file format may serve as a convenient interface to other applications, not necessarily aimed at the calculation of configurational entropy values. Importantly, the BAT topology, that is, the choice of a set of nonredundant C
DOI: 10.1021/acs.jctc.5b01217 J. Chem. Theory Comput. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Theory and Computation
molecules) are not only correctly dimensioned, but also correctly reflect the total entropy change.9 Finally, when working with internal BAT coordinates, a correction term depending on the Jacobian determinant (eqs 4 and 5) must be applied9 and PARENT performs this automatically. 2.2.2. MIE Implementation. PARENT performs the above calculations using Message Passing Interface (MPI) parallelization, together with an openMP parallelization on the thread level. It reads in a binary BAT trajectory file created with the coordinate conversion tool of the program suite and distributes the data equally between the MPI processes. In a homogeneous cluster, where all shared-memory computation nodes are equal in architecture, one MPI process per node is expected to yield the fastest results, because it saves communication overhead. On heterogeneous clusters, assigning more MPI processes to nodes with a larger memory capacity might solve problems of shortage in RAM. With the exception of special cases with a large number of atoms and a limited number of frames, the memory requirements of the program will mainly be dependent on the size of the BAT trajectory file. No temporary files are written to permanent storage media (e.g., hard disk). In order to take advantage of the multiple cores per computation node, openMP parallelization is used, with the number of threads set equal to the number of cores intended for each MPI process. Although originally not designed for that purpose, PARENT can run on shared-memory workstations and common desktop computers (e.g., quad-core CPU) as well. Concerning workstations with a high number of cores, the great reduction in communication overhead when using only one MPI thread turns out to be advantageous. After the trajectory is read in, 1D and 2D probability density functions must be sampled. To accomplish this, the extreme values of every degree of freedom in the system are determined first. Since torsion angles are circular, they are shifted to yield the smallest possible interval where the probability density function is nonzero,24 yielding improved accuracy by utilizing the available bins efficiently. Second, histograms are built using a common binning approach. For 1D histograms, this involves no communication between the MPI processes, as the trajectory is distributed in a per-degree-of-freedom fashion. Holding the entire trajectory of each of its assigned degrees of freedom, all MPI processes dispose of the necessary information to build the corresponding 1D histograms. Concerning 2D histograms, communication is necessary considering that two degrees of freedom may be hosted by different MPI processes. The MPI processes sequentially send out the trajectories of their hosted degrees of freedom to all other MPI processes. Every MPI process then calculates the 2D histograms of its hosted degrees of freedom against the currently distributed one. Redundant calculations for pairs which have already been treated are avoided. To improve efficiency in memory usage, entropy values are calculated onthe-fly during the histogram building process, immediately after each individual histogram has been sampled to completeness. Then, using the discrete probabilities pj obtained from the sampled histograms, the following formulas14,24 for the 1D and 2D entropy values including discretization are utilized:
of atoms (below 10 or so) and most biologically relevant molecules exceed this mark significantly. Hence, in order to treat such systems, the result of the analytically exact highdimensional problem must be approximated by mapping the problem to a lower-dimensional one. As it is instructive for introducing the Maximum Information Spanning Tree10,21 (MIST) approximation in the next subsection, we will first illustrate the Mutual Information Expansion22,11,14,23,24 (MIE) formalism for a system with three degrees of freedom Xi. First, in an analytically exact manner, the full-dimensional entropy integral can be expanded to S3(X1, X 2 , X3) = S1(X1) + S1(X 2) + S1(X3) − I2(X1; X 2) − I2(X1; X3) − I2(X 2 ; X3) + I3(X1; X 2 ; X3) (6)
with I2 (Xi; Xj) and I3(X1; X2; X3) defined as I2(Xi ; Xj) ≡ S1(Xi) + S1(Xj) − S2(Xi , Xj)
(7)
and I3(X1; X 2 ; X3) ≡ S1(X1) + S1(X 2) + S1(X3) − S2(X1, X 2) − S2(X1, X3) − S2(X 2 , X3) + S3(X1, X 2 , X3) (8)
where S and I denote entropy and mutual information, respectively, with the subscript indices describing their dimensionality. Now truncating S3 at pairwise order, which is equivalent to applying the Kirkwood Superposition approximation11,37 to the three-dimensional probability distribution function, yields S3(X1, X 2 , X3) ≈ S3MIE2(X1, X 2 , X3) =
∑ S(Xi) − ∑ I2(Xi ; Xj) i
i