Extraction of Protein Conformational Modes from Distance

Sep 19, 2016 - Extraction of Protein Conformational Modes from Distance Distributions Using ... to the missing-data problem in Bayesian statistical le...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/JPCB

Extraction of Protein Conformational Modes from Distance Distributions Using Structurally Imputed Bayesian Data Augmentation Xun Sun, Thomas E. Morrell, and Haw Yang* Department of Chemistry, Princeton University, Princeton, New Jersey 08544, United States S Supporting Information *

ABSTRACT: Protein conformational changes are known to play important roles in assorted biochemical and biological processes. Driven by thermal motions of surrounding solvent molecules, such a structural remodeling often occurs stochastically. Yet, regardless of how random the conformational reconfiguration may appear, it could in principle be described by a linear combination of a set of orthogonal modes which, in turn, are contained in the intramolecular distance distributions. The central challenge is how to obtain the distribution. This contribution proposes a Bayesian data-augmentation scheme to extract the predominant modes from only few distance distributions, be they from computational sampling or directly from experiments such as single-molecule Förster-type resonance energy transfer (smFRET). The inference of the complete protein structure from insufficient data was recognized as isomorphic to the missing-data problem in Bayesian statistical learning. Using smFRET data as an example, the missing coordinates were deduced, given protein structural constraints and multiple but limited number of smFRET distances; the Boltzmann weighing of each inferred protein structure was then evaluated using computational modeling to numerically construct the posterior density for the global protein conformation. The conformational modes were then determined from the iteratively converged overall conformational distribution using principal component analysis. Two examples were presented to illustrate these basic ideas as well as their practical implementation. The scheme described herein was based on the theory behind the powerful Tanner−Wang algorithm that guarantees convergence to the true posterior density. However, instead of assuming a mathematical model to calculate the likelihood as in conventional statistical inference, here the protein structure was treated as a statistical parameter and was imputed from the numerical likelihood function based on structural information, a probability model-free method. The framework put forth here is anticipated to be generally applicable, offering a new way to articulate protein conformational changes in a quantifiable manner.



by such molecular walkers such as myosin.10 However, the high dimensionality of proteins compounded by the randomness of their conformational variations poses a significant challenge for understanding, and eventually predicting, protein motions in general. To advance our understanding of protein dynamics, it is important to experimentally determine how a protein changes its conformation and what structural transitions are most relevant to its function. In this context, the orthogonal modes of protein motions are of particular interest because they provide a concrete physical picture of the stochastic conformational fluctuations. Regardless of how complicated the overall protein dynamics may appear, in principle, they can be expressed as a linear combination of orthogonal conformational modes.11,12 Moreover, conformational modes are expected to enable critical comparison of protein dynamics, allowing one to articulate changes in conformational dynamics between

INTRODUCTION Proteins are remarkable molecular machines; they carry out such diverse functions as catalyzing chemical transformations, relaying biophysical and biochemical information, and doing mechanical work. These tasks are accomplished in the presence of thermal fluctuations where the solvent molecules surrounding the protein stochastically drive structural changes in the protein.1,2 The coupling between solvent thermal fluctuations and a protein is particularly important for the protein’s lowfrequency, large-amplitude conformational motions because of their comparable energy scales, ∼kBT (with kB being the Boltzmann constant and T temperature in Kelvin).3 The largeamplitude conformational transitions on the μs-to-ms time scale are a manifestation of collective motions from the many degrees of freedom in a protein4 and are expected to play a central role in the protein’s function.5,6 For example, through the direct observation at the single-molecule level, access to the active site of an enzyme is gated by its conformation, as seen in adenylate kinase7 as well in protein tyrosine phosphatase B;8 the chemiosmotic synthesis of each ATP molecules by FOF1ATP synthases is coupled to the stepwise rotation in the subunits;9 the mechanical contraction of muscles is carried out © XXXX American Chemical Society

Received: August 1, 2016 Revised: September 18, 2016

A

DOI: 10.1021/acs.jpcb.6b07767 J. Phys. Chem. B XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry B

both: Examples include the elastic network modeling that simplifies the potential energy term in NMA using a set of connected uniform springs18,19 and the kernel PCA that generalizes PCA to the nonlinear cases.20 Although the connection between these two complementary approaches has been nicely discussed in the literature,21,22 to the best of our knowledge, no explicit statement has been made before this work regarding the distribution of protein conformations and its connection to conformational modes. Consider a protein whose structure is defined by N unit-mass atoms, ri, i = 1, ..., N, where ri represents the Cartesian coordinate vector of the ith atom. The atoms move randomly under thermal agitation but remain constrained by chemical bonds that hold them together. Combined cooperatively, these fluctuations may give rise to large-scale structural changes in the protein. Very generally, the motions can be expressed in terms of collective coordinates, {q}, which are linear combinations of atom coordinates, qj = ∑Ni=1 Cjiri, where Cji is the coefficient matrix for atom i in collective coordinate j with j = 1, ..., (3N − 6), because we are concerned with only internal coordinates, omitting translation and overall rotation of the protein. It is possible to find coordinate transformations so that the resulting collective modes are orthonormal to one another, that is, qTj qk = δij. With an orthonormal collective coordinate basis set, the inherent or “natural” conformational variations of a protein, z, at thermal equilibrium can be described by a linear combination of the orthonormal basis set

mutants or upon interaction with substrates, partner proteins, and allostery effectors. Can the conformational modes of a protein be recovered using conformational distributions as the experimental observable, which can be readily provided by single-molecule experiments such as single-molecule Förster-type resonance energy transfer (smFRET)? Although smFRET is a powerful technique affording experimental data that is otherwise not obtainable using other methods, this technique has its own limitations. For example, the conformational dynamics measured using smFRET is coarsely grained in time and space due to its finite spatial and temporal resolutions; usually only one distance can be measured at a time in routine experiments. Moreover, there is limited control over the dyepair labeling points while maintaining the protein’s integrity and functionality. These shortcomings amount to the improbability of experimentally sampling all possible intramolecular distance distributions of a protein at the singlemolecule level. That is, in practice, one must confront the challenge of reconstructing the global conformational modes from sets of single-molecule data with incomplete information. To surmount this obstacle, we propose to follow the strategies by which the “missing-data problem” in the statistics community is solved.13 Using toy models, we show how a statistical sampling scheme based on the Bayesian inference framework can be used to determine the missing coordinates (in a statistical sense) by integrating knowledge of the protein structure and multiple experimentally measured smFRET distance distributions. Finally, we demonstrate that it is possible to recover the statistical predominant coarse-grained conformational modes of large-amplitude protein conformational changes from smFRET distributions. We close by commenting on some practical points as well as possible applications.

3N − 6

z=



ajqj

(1)

j=1

where aj assigns the coefficient (or projection) to the jth mode of the protein’s movement. From Atomic Geometry to Conformational Modes: NMA. NMA defines a set of orthogonal collective coordinates, {qNMA}. It requires the explicit knowledge of the atomistic positions, {r}, of the equilibrium geometry and the interaction forces.23 The Hessian matrix of the potential energy from an energy-minimized structure at conformational equilibrium is then diagonalized to yield a set of motion vectors as well as corresponding vibrational frequencies. In other words, the normal modes are determined by numerically solving the eigenvalue equation



THEORY Background: Information of Dynamics Modes Contained in Conformational Distributions. The conformational modes can be recovered from experimental smFRET distributions. At first glance, this may seem counterintuitive because conformational modes are a consequence of dynamical structural changes, whereas conformational distributions are static. Therefore, for completeness, we clarify the connection between the atomistic movements, the collective coordinates, and conformational modes. We start with a bottom-up view of protein large-amplitude motions, briefly reviewing how normal mode analysis (NMA) connects atomistic motions to orthogonal modes and distributions of protein conformations. From a top-down perspective, we recall the essence how principal component analysis (PCA) helps to recover mutually independent modes of complex dynamics from distribution data. It should be emphasized at the outset that this work is not about PCA, nor does it imply PCA being the optimal analysis tool. Instead, NMA (bottom-up view) and PCA (top-down view) are summarized here only to illustrate the connection between atomistic movements and conformational modes; they are used together as an example because they have been widely discussed in the literature and provide a physically intuitive picture. Readers who are familiar with these basic ideas are encouraged to skip to the next section. The applications of NMA and PCA to analyzing simulated protein trajectories have been reported in pioneering works starting from the early 1980s at selected coarse-grained levels.5,14−17 Methodological developments have been seen in

HQNMA = QNMAΛ

(2)

NMA

where Q is the matrix in which each column represents one orthogonal normal mode and Λ is the eigenvalue matrix that corresponds to each normal mode. H is the Hessian matrix, the elements of which are defined as [hij]xy =

∂ 2E . ∂xi ∂yj

For simple

model systems with a small number of atoms, the numerically equivalent GF method by Wilson24 may be advantageous because it provides a clear mathematic connection between the normal mode coordinates and the internal coordinates via similarity transformations. The potential energy can be readily written in a second-order manner similar in form to a harmonic oscillator21,22

E=

1 T ΔR HΔR 2

(3)

where ΔR ≡ R − ⟨R⟩ denotes a vector storing the threecoordinate displacements of atoms compared to their minimum-energy positions, ⟨R⟩, with R representing the B

DOI: 10.1021/acs.jpcb.6b07767 J. Phys. Chem. B XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry B

reconstructed by a linear combination of an orthogonal basis set (cf. eq 1). Thus, PCA allows one to find the conformational modes from structural distributions, be they from computational modeling or from experiments. Naturally, it is implied that the resolution of the PCA modes will depend on the quality of the structure collection. Asymptotic Relationship between NMA and PCA Modes. If the form of the potential mean force (PMF) around an energyminimum protein structure is quadratic with respect to structural variations (Gaussian conformation distribution), there exists a very simple relationship between the results from these two methods. From eq 4, requiring that the harmonic oscillators in eq 3 obey the equipartition principle between the average potential and kinetic energy, with each 1 form as 2 kBT , we have

structure of the protein. Importantly, the position variations in eq 3 naturally lead to a distribution of the coordinate displacements on a given normal mode. From Distribution Data to Conformational Motions: PCA. PCA extracts a set of collective coordinates, {qPCA}, that maximize the variance in the collective coordinates. 25 Compared to NMA, PCA extracts motions from a collection of structures by replacing the Hessian matrix with a covariance matrix as the second moment of coordinate distributions,26 and each principal axis is associated with an eigenvalue that denotes the relative weight in the motion decomposition. In contrast to NMA, it does not require a priori knowledge of the microscopic force constants;25 therefore, in principle, the distributions could also come from experiments. Let us now consider a collection of L protein structures sampled either from molecular dynamics (MD) simulations or from some sampling algorithm.22 The covariance matrix from these sampled structures is M̅ ≡

1 ΔSΔST = ΔR̅ ΔR̅ T L

M = kBT H−1

where M denotes the ensemble average of the sample covariance matrix. In arriving at eq 8, we have used

(4)

where ΔS denotes the N-by-L displacement matrix with elements

ΔR = ⟨ΔR ⟩ = lim ΔR̅ L →∞

This expression has been documented earlier5,25,27 and can be derived alternatively by calculating the Boltzmann average of the covariance matrix.21 Equation 8 indicates that the same set of collective coordinate can satisfy both NMA and PCA eigenequations. Accordingly, the eigenvalues from NMA (Λ) and PCA (Ξ) have the following reciprocal relationship

Δsit = rit − ri

denoting the displacement of the ith atom in the tth sampled structure from its sample mean, ri , with i = 1, ..., N and t = 1, ..., L. ΔR̅ denotes the sample estimate for the displacement vector (cf. eq 3). The over bar denotes sample averaging, 1 L (···) ≡ L ∑t = 1 (···). PCA finds orthogonal displacement vectors (modes) that recapitulate the structural variations in the sample. It is again an eigenvalue problem, for which the eigenequation of M̅ is PCA

MQ ̅

PCA

=Q

Ξ

Ξ = kBT Λ−1

(5)

(6)

where UT = Ξ−1/2(QPCA)TΔR̅ . The projection of the resample structure onto a PCA axis is done by ΔW = (QPCA)T ΔR

(9)

The physical meaning of NMA’s eigen λi is the vibrational frequency (ωi = λi ) for the vibration between atoms i and j, whereas the PCA’s eigen represents the mean-square fluctuations along this axis.25 That a harmonic potential is typical for a PMF could be understood from the central limit theorem perspective: Over a time scale much longer than the fluctuations, the coordinates themselves are Gaussian distributed,28 which naturally gives rise to the quadratic PMF after the logarithmic operation.29 In the remainder of this article, it is understood that the PCA approach will continue to be used to analyze the overall conformation distribution but only to the extent of illustrating an application of extracting the conformational modes from the overall structural distributions, as inferred from low-dimensional experimental data. Single-Molecule FRET as an Experimental Approach to Conformational Distribution. The above discussion for connecting the atomistic variations to the distribution of a collective coordinate points to the idea of using conformational distribution as an experimentally accessible variable for determining the conformational modes of a protein. At present, the single-molecule approach is the only way to afford experimentally determining the entire distribution of a protein’s conformation, beyond the first (mean) and the second moment (variance). This is important because a protein is likely to exist in multiple interconverting conformational states−normally, it would not be possible to characterize the conformation-state distribution using an ensemble-averaged method when the number of states is larger than two. Ideally, one would like to measure the conformational distribution without having to

where Ξ is the eigenvalue matrix. Clearly, the eigenvector matrix from PCA, QPCA, also forms an orthogonal vector space. More importantly, whereas the term {Δrit} carries the timedependent information, the covariance matrix (M̅ ) is independent of time. Therefore, the principal component conformational modes, although a consequence of dynamic motions, are contained in the static distributions from input data. In addition, as the empirical covariance matrix, M̅ , is constructed from sampled structures, it inevitably contains noise. Consequently, for numerical stability, eq 5 is usually solved numerically using singular value decomposition,22 formally written as

ΔR̅ = QPCAΞ1/2UT

(8)

(7)

where matrix ΔW defines the range of the projected motions on a certain axis. The variance in each projected axis (Δwj) reflects the ranking of the corresponding eigenvalue, ξj. The finite projected principal-component variances (eq 7) indicate that the protein will in general exhibit a distribution along some experimentally accessible coordinate, similar to eq 3; even if the experimental coordinate does not coincide with any of the principal-component or normal coordinates, it can always be C

DOI: 10.1021/acs.jpcb.6b07767 J. Phys. Chem. B XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry B assume any model in terms of the form of the distribution or the kinetics of conformational change.30,31 Many such methods have been developed and applied to experimental data. For example, statistical methods that afford a quantitative measurement of single-molecule distributions include the maximuminformation method,32,33 the photon statistics theory,34−36 or most recently the maximum-trajectory entropy method.37−40 An smFRET experiment measures the distance as a function of time on a single molecule along a coordinate defined by the donor and acceptor transition dipoles.41 The dynamics of the protein is projected onto that particular coordinate, and the conformational distribution along the FRET coordinate is then constructed. Conceptually, this can be understood as follows: Let xk be the distance between the donor and acceptor transition dipoles, rk,D and rk,A, respectively (here xk is not to be confused with the x axis of a Cartesian coordinate). In terms of the Cartesian coordinate, xk = rk,A − rk,D. The FRET coordinate can be expressed in terms of the collective coordinate. In the context of present discussion, the conformational modes projected onto the FRET coordinate can be written as

Figure 1. Triatomic example. (A) Distance distributions from simulation (red) and those resampled on the basis of the simulated distribution (vertical bars). (B−D) Modes of motions ranked by decreasing eigenweights in PCA (slate) compared with those from simulation (purple). PCA of resampled distribution is seen to capture the essence of the modes. Ball-and-stick images are created by PyMOL (1.7.0.5) throughout this work.

intramolecular distance distributions allow the complete recovery of the equilibrium structure as well as the conformational modes, even when both are unknown. In general, however, it is highly unlikely that one is able to measure all (3N − 6) distinct distance distributions of a complex protein molecule using smFRET. Moreover, because of the mutation necessary to place the FRET probes or the dye labeling, the protein may lose its function or simply does not fold at all.8,42 Therefore, one is inevitably faced with incomplete information from smFRET measurements. Separation of Time Scale, Rigid-Body Assumption, and Integration with Computational Modeling. Protein structures are usually known from X-ray crystallography as static snapshots or from nuclear magnetic resonance (NMR) as NMR-restraint-allowed models. What are generally unknown are the range of the large-amplitude conformational motions in solution, the number of conformational states in these motions, and the interconversion rates between them. For smFRET, one must simultaneously consider the time resolution and distance precision because the technique averages the observable over time.43 Typically, motions slower than 1 ms can be followed quantitively with 5 Å spatial precision.32 Structural changes that occur on a time scale faster than 1 ms will be averaged out; for example, side-chain fluctuation, torsional libration, or short loop motions all occur on the nanosecond time scale. On the time scale of smFRET experiments, fast-moving structural components will appear as if they are static,44,45 akin to the Kubo−Anderson motional narrowing concept in NMR.46,47 Knowledge of the protein structure and a general understanding for the time scales of fast-moving local structural fluctuations, both of which can be characterized using ensemble-averaged methods,48,49 allow one to make reasonable assumptions to treat domains and portions of a protein as rigid body on the single-molecule time scale. The rigid-body assumption significantly simplifies the consideration of conformational modes. Ideally, one would perform additional control experiments to validate the rigid-body assumption. However, even with the rigid-body assumption, the limited number of biochemically allowable FRET-pair labeling points may still be insufficient to recover the conformational modes. Computational modeling could in principle provide additional structural insight to make up for the missing information. For instance, the integration of computational modeling and smFRET has allowed a detailed modeling of multiple conformations in short polyprolines as well as size-dependent

3N − 6

xk =

∑ j=1

bkjqj

(10)

where {qj} is an orthonormal basis set that defines the protein’s conformational modes from, for example, either NMA or PCA. Equation 10 shows that the information of all of the modes is contained in an smFRET measurement. Therefore, as eq 10 is a linear equation, the conformational modes in principle could be completely determined if (3N − 6) independent smFRET measurements could be made. Some simple examples for which the internal modes can be completely determined are given below. Examples. The first example is a diatomic model, a trivial case, where the vibrational mode is confined in one dimension (1D). This case is used to illustrate some basic ideas. There is only (3N − 5) = 1 internal degree of freedom because the molecule is linear. We used MD simulations to simulate the “true” dynamics of the molecule (see SI for simulation details). We then measured the distance between the two atoms (the FRET coordinate) as a function of time and built up the distance distribution over a period of observation (Figure S1A). Application of PCA yielded the expected stretching of interatom axis, shown in Figure S1D, which is identical to the vibrational mode built into the simulation (Figure S1C). On the other hand, if the molecular structure does not change as a function of time, there is no conformational distribution on the smFRET coordinate (Figure S1B). In other words, conformation distribution results from time-dependent molecular structural changes. The second example is a triatomic model with three internal degrees of freedom. We measured the distributions for all three interatom distances as in the previous example (Figure 1A). From three measured distance distributions, we sampled each interatom distance given the observed probability density and triangulated the relative positions of all atoms (see SI for details). As a result, the equilibrium molecular structure was determined from the mean of the distributions. Using PCA, we recovered the vibrational modes (slate arrows in Figure 1B−D), which are seen to nicely recapitulate the motions from this model. Similar motions are seen via NMA (data not shown). These two simple and visually intuitive examples reinforce the idea that full (3N − 6) independent measurements of D

DOI: 10.1021/acs.jpcb.6b07767 J. Phys. Chem. B XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry B persistence length.50 Combining computational modeling with multiple quantitative smFRET distance measurements51−53 and ensemble-averaged experimental techniques54−57 have generated a handful of modeled structures. We next formulate the problem of experimentally educing conformational modes from a Bayesian perspective13 and outline how the data-augmentation scheme in applied statistics58 allows us to formally integrate different pieces of information together for extracting conformational modes. We note that the Bayesian approach has been utilized in smFRET data analysis, including parameter inference,59,60 model selection,61 state determination,62,63 and model parameter posterior distribution inference.64 Missing Information, Data Augmentation, and Bayesian Inference Framework. Let R denote the structure of a protein. Then, because of thermal fluctuations, R will trace out a three-dimensional (3D) probability distribution in space. Let p(R) denote the probability density function for such a distribution. The goal is then to infer p(R) from smFRET distance distributions given the known structures of the protein (e.g., from NMR, X-ray crystallography, or homology modeling), the time scale of structural motifs (e.g., rigid-body assumption), and possibly other additional information based on biochemical and biophysical knowledge. That is, we seek to compute p(R|{xk}; I), where p(·|·) follows the convention of conditional probability notation and the I after the semicolon denotes collectively the rigid-body assumptions and other structural knowledge presumed in the computation. It is understood that these pieces of given information are treated as “known facts” and will be dropped for brevity in the subsequent discussion: p(R|{xk}) ≡ p(R|{xk};I). The curvy parentheses {xk} denote a set of scalar distances (x1, ..., xK), with xk = |xk| resampled from the experimentally measured smFRET distributions, p(xk), where k = 1, ..., K with K being the number of distinct smFRET coordinates (cf. eq 10). Thus, the probability density function that describes the equilibrium structural distribution is p(R ) =

∫ p(R|x1, ..., xk)p({xk}) d{xk}

could be introduced, where the auxiliary variable could be the angle of domain rotation, some reduced coordinates, or additional unobserved distances depending on the specifics of the protein system. In other words, (ζ1, ..., ζV, x1, ..., xK) forms a complete data set from which it is straightforward to determine the corresponding structure. The auxiliary data, {ζv}, is sometimes called “missing data”, “unobserved data”, or “latent data” because if they could be acquired experimentally, then the posterior density of R could be readily computed. A moment’s consideration leads us to realize that the problem at hand is isomorphic to the well-studied missing-data problem in applied statistics. In particular, as we are interested in the posterior distribution of R, the general data-augmentation scheme for solving such problems by Tanner and Wong appears to be directly applicable.65 Below, we follow closely the original reasoning by Tanner and Wong and outline how their data-augmentation algorithm can be applied to the structural distribution problem. In particular, we map the problem of structural distribution with missing information to the Tanner−Wong data-augmentation framework. In conventional applications of Bayesian dataaugmentation scheme, one is usually interested in parameters that specify a probability density function for the observables with a well-defined mathematical form (e.g., the mean or variance of a normal distribution). Here, we expand the application scope and treat the protein structure R as a parameter of interest in the conventional statistical sense. Knowledge of {R} also allows one to specify the probability distribution of experimental smFRET distances; however, in the current case, the probability density function will be determined numerically from the inferred geometric structure distribution instead of from a presumed mathematical model. Hence, the treatment described here could be considered as a form of model-free inference of the structural distribution because it does not require knowledge of the specific form of the conditional probability. Iterative Bayesian Inference. The sought-after structural distribution density is

(11)

p(R |{xk}) =

∫{ζ } p(R|{ζv , xk})p({ζv}|{xk}) d{ζv} v

where p({xk}) is the probability density of distance sets and d{xk} ≡ dx1, ..., dxK is a shorthand notation. Possible correlation between the independently and separately measured smFRET coordinates is contained in the conditional probability; this point will be elaborated shortly in the implementation of the method. From eq 11, we see that the likelihood of observing a particular structural snapshot Rt at time t is p(Rt). Following the convention in statistics, drawing sample Rt from probability density p(R) is denoted Rt ∼ p(R) as a shorthand notation. We note that p(Rt) is the likelihood of obtaining Rt and is a numerical value, whereas p(R) is the probability density f unction of R and is a distribution in 3D space. To minimize potential confusion, in the remainder of the article we will use dependent variables with and without subscripts in p(·) to denote its numerical value and its entire distribution, respectively. The form of eq 11 suggests that the Bayesian framework is a natural choice for computing the distribution of R conditioning on the experimentally measured distances; namely, we seek the posterior distribution of p(R|{xk}). Unfortunately, in most cases, directly computing p(R|{xk}), as discussed earlier, is difficult because the experimental data together with the prior knowledge, I, usually is not sufficient to determine R. On the other hand, R can be readily computed if auxiliary variables {ζ}

(12)

where p(R|{xk}) is the posterior density of the protein structure given the smFRET data, p(R|{ζv, xk}) is the conditional density given the augmented data (ζ1, ..., ζV, x1, ..., xK), and p({ζv}|{xk}) is the predictive density of the missing data, {ζv}, given {xk}. Using Bayes’ theorem, the predictive density of ζ is related to the structural distribution density by p({ζv}|{xk}) =

∫R′ p({ζv}|R′, {xk})p(R′|{xk}) dR′

(13)

where the integration is carried out over all possible structures (denoted by an intermediate integral variable R′). Replacing the p({ζv}|{xk}) term in eqs 12 with 13 gives p(R |{xk}) =

∫ K(R , R′)p(R′|{xk}) dR′

(14)

∫ p(R|{ζv , xk})p({ζv}|R′, {xk}) d{ζv}

(15)

where K (R , R′) ≡

can be understood as the kernel of a transfer functional, T, defined as E

DOI: 10.1021/acs.jpcb.6b07767 J. Phys. Chem. B XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry B T[f (R )] =

∫ K(R , R′)f (R′) dR′

N

(16)

gi + 1(R ) =

Equations 14 and 16 point to an iterative solution for computing gi(R) that approximates p(R|{xk}) in ith iteration

gi + 1(R ) = (T[gi])(R )

(c) validate gi+1(R) by resampling {p̂k(x)} from gi+1(R) and compare with experimentally measured {pk(x)}. In the above, step a1 performs the operation in eq 13 and steps a2−a5 perform the operations in eqs 12 and 17. As pointed out by Tanner and Wong, the number of samples M does not have to be the same from iteration to iteration. Different latent variables ζv could be imputed sequentially.66 The prior knowledge and assumptions, I, about the protein structure, including computer modeling, are incorporated in step a2 when constructing the protein structure from the augmented data {ζv, xk}. The equilibrium structural distribution in eq 11 is carried out by step b, where gn is the approximation to p(R|{xk}n) after iterations in step a. The energy calculation in step a3 can be carried out by molecular mechanics modeling using such popular packages as CHARMM, GROMACS, AMBER, and so forth. One may also allow additional structural relaxation in this step using Langevin dynamics or all-atom MD simulations but at the cost of computational resources. As the relative energies from these calculations are used to build the Boltzmann weight, the force field is expected to play a role in the overall quality of the final conformation distribution. We suspect that the choice of force field and the energy calculation method will again be systemdependent and may have to be optimized for each system. How these parameters may impact the overall conformation distribution, and hence the conformation modes, in the context of experimental uncertainties is beyond the scope of the current work and will be studied in follow-up works. In steps a5 and b, updating the density of R is accomplished by updating the statistical weight and atom positions for each of the N records that store the protein structure. These records are then used in conjunction with {xk} to form the predictive density, p({ζv}|R, {xk}), used in step a1, the structural imputation step. For example, suppose only the position of a particular atom ζ1 is needed to furnish a complete data set for each {xk}n. Then step a1 can be computed as per the following geometry-intersecting procedure:

(17)

Tanner and Wong elegantly gave a formal proof that the iterative procedure converges to the true density lim g (R ) → p(R |{xk})

i →∞ i

if the following weak conditions are met: (1) K(R, R′) is uniformly bound and equicontinuous in R and (2) it is possible to generate {ζ} from p({ζv}|R, {xk}) such that p(R|{ζv, xk}) is nonzero.65 The first condition is expected in the current application because the integrand in eq 15 is a product of two probability density functions defined in the range of protein conformations. The second condition will depend on how the auxiliary variable is defined because here the conditional densities are computed geometric projections from the protein structure ensemble, rather than from a well-defined mathematic structure. In practice, some quick trial-and-error should suffice to find suitable auxiliary variable candidates that meet this condition. Indeed, the Tanner−Wong method is powerful because it guarantees the asymptotic convergence to the true posterior distribution so long as the conditional probability densities in the iteration steps can be computed, regardless how they are calculated. Basic Algorithm for Data-Augmentation Sampling. The integration steps in eqs 12, 13, and 17 are usually difficult to be carried out analytically but can be numerically accomplished using the Monte Carlo method.13,65 We modify the Tanner− Wong algorithm and use statistical weighing idea from the sequential imputation scheme by Kong, Liu, and Wong66 so that it can deal with the structural imputation problem at hand. Consider drawing N sets of {xk} from the smFRET measured {pk(x)} to give {xk}n, n = 1, ..., N, such that the set of {xkn} = xk1, xk2, ..., xkN represents a fair sampling of the density, pk(x). Given the current approximation gn,i(R) to p(R|{xk}n) in ith iteration, the general algorithm reads as follows: (a) for each n and each v, (a1) generate M sets of {ζv } from the current approximation to p({ζv}|R, {xk}n) to give {ζv}m, m = 1, ..., M; (a2) for each m, generate the corresponding structure, Rmn; (a3) calculate the energy for this structure, Emn, and the corresponding Boltzmann factor, wmn = exp[−Emn/ kBT]; (a4) assign a numerical value to the conditional probability density

(a1.1) set up a 3D grid within a volume that encompasses all possible M positions of ζv from the M records of gn,i; (a1.2) make a 3D density estimate by building up a histogram for the M positions of ζv through adding up the energyweighted likelihood inside each grid point; (a1.3) construct the predictive density numerically by intersecting the 3D density with the geometrical constraint formed by set of distances {xk}n; (a1.4) draw a new ζv from the predictive density. In step a1.4, the predictive density could be 1D or multidimensional because the geometrical constraints could be a line, a sphere, or an object with a complicated shape. In cases with more than one missing coordinates, it may not be straightforward to implement the aforementioned geometrical intersection steps a1.4, even though they are intuitive. To deal with more general cases, rather than 3D geometrical intersections, the following structure-sampling procedure constructs the predictive density, p({ζv}|R, {xk}), from the samples of the protein structure ensemble R:

p(R mn|{ζv}m , {xk}n) = wmn/Wn

with Wn = ∑mM= 1 wmn ; (a5) update the approximation to p(R|{xk}n) by M

gn , i + 1(R ) =

1 ∑ g (R ) N n=1 n,i+1

∑ p(R mn|{ζv}m , {xk}n) m=1

(b) compute the approximation to p(R) by F

DOI: 10.1021/acs.jpcb.6b07767 J. Phys. Chem. B XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry B (a1b.1) collect a subset of structures {R0} ∈ {R} in which the smFRET coordinate {xk,0} is close to the drawn set of {xk}n to within ϵx, ∀k, |xk,0 − xk,n| ≤ ϵx; (a1b.2) construct numerically the predictive density, p({ζv}|R, {xk}), by high-dimensional interpolating p({R0}) at fixed {xk} and systematically varying {ζv}; (a1b.3) draw a new ζv from the predictive density.

The exact form of the trial distribution does not play a significant role in the overall computation because the weight of the protein structure constructed from the augmented data, ({ζv}(0) n , {xk}n), is based on the energy of the resulting structure and Boltzmann weighing. At the same time, distance constraints are drawn from pk(x) to allow the computation of the initial guess of p({ζv}|R, {xk}n). In addition to objectively removing unphysical structures, the energy-calculation step also helps to restore correlation among the otherwise independently drawn {xk} from experimental smFRET distributions (pk(x)). Convergence and Validation. In validation step c, the expected experimental distributions, {p̂k(x)}, are generated from the approximate probability density, g(R). To quantify the difference between {p̂k(x)} and the experimentally measured densities, {pk(x)}, the Hellinger distance, D(pk, p̂k), provides a suitable metric.67 Its square has the following form

The reduction in the structure ensemble space in step a1b.1 is necessary because the high-dimensional interpolation in step a1b.2 quickly becomes computationally intensive as more structures are used for interpolation. ϵx could be determined empirically with a few trial runs. In addition, to enhance numerical stability during multidimensional interpolation, appropriate boundary conditions for the high-dimensional likelihood function are necessary. Specifically, this is achieved by setting p({xbound }n, {ζbound }n) = 0, where the superscript k v “bound” denotes boundaries within which all possible practical values of that variable are enclosed. However, it should be noted that although the idea of setting the boundary likelihood to zero is general, the implementation is likely to be casedependent. A set of empirical guidelines has been laid out in the Supporting Information (see eq S3 and the SI text related to it for details). Protein Structure as a Parameter and an Independent Variable. The algorithm treats protein structures R either as the parameter to be estimated in statistical inference or as an independent variable for calculating the likelihood. This is distinctly different from the conventional statistical inference in which a mathematical expression maps the numerical value of an independent variable (or a parameter) to a likelihood weight. Here, the mapping between a protein structure and its corresponding likelihood weight is achieved by searching through records of data structure containing, for example, atom positions that define the protein structure, likelihood weight, energy, and so forth. The storage requirements are estimated to be ∼1.6 and ∼6.1 kB/kDa when stored in binary or in protein data bank (pdb) formats, respectively (Figure S3). For smFRET applications, the position records also include the coordinates for the most probable centers of transition dipoles of the donor and acceptor dyes recalculated for each structure because the experimentally measured distances are calculated between these two centers. As including the FRET dye transition-dipole centers is straightforward to implement, this additional step was omitted when describing the algorithm earlier. Initialization. The structural imputation algorithm sketched above must be initialized. First, a reference structure with atomic coordinates is also needed. The reference structure could come from X-ray crystallography, NMR, homology modeling, or full-atom MD simulations, just to name a few. Second, in the initialization round (i = 0), approximation gn,0(R) does not yet exist; as such, p({ζv}|R, {xk}n) has to be constructed using a trial distribution. Here, we choose a uniform trial distribution for drawing {ζv}: {ζv}(0) n ∼ U({ζv}min, {ζv}max), where {ζv}min and {ζv}max are the bounds for {ζv}. Moreover, when drawing the trial distribution (step a1b.2), one may wish to use a denser sampling on pk(x) (greater N) to ensure a sufficient number of structures in the “pruned” subset to permit the high-dimensional interpolation (see SI). To speed up the computation after the initialization, one may reduce N or M adaptively while minoring the convergence.

D2(pk , pk̂ ) ≡

1 2

∫ [pk1/2 (x) − pk̂1/2 (x)]2 dx

=1−



pk (x)pk̂ (x) dx

(18)

As shown in eq 18, the Hellinger distance is symmetric with respect to swapping the two arguments, D(pk, p̂k) = D(p̂k, pk). It satisfies 0 ≤ D(pk, p̂k) ≤ 1, where D(pk, p̂k) = 0 when the two densities are identical and D(pk, p̂k) = 1 when the two densities have no overlap. Here, the Hellinger distance is used to quantify the manner by which the expected densities, {p̂k(x)}, generated from the approximation, gi+1(R), may successively approach the experimentally observed densities, {pk(x)}, iteration by iteration in step (c) of the basic algorithm outlined in this section. That is, ideally, one would have D(pk, p̂k) → 0 when it is converged; however, as p̂k is statistically inferred, there is bound to be statistical variations in p̂k, and hence one would not expect D(pk, p̂k) → 0 in practice. The convergence rate may be extremely slow, especially the Tanner−Wong algorithm only guarantees convergence to the true distribution but says nothing about the converging rate. As shall be seen later in the Results and Discussion section, the Hellinger distance quickly approaches a leveling value and fluctuates around it. This observation suggests a practical convergence criterion, |δD(pk , pk̂ )| < ϵH , after D(pk, p̂k) has visually reached a leveling value, where convergence radius ϵH is to be determined empirically.



RESULTS AND DISCUSSION In this section, two toy models are used to illustrate the proposed conformation mode extraction method. Example One is a six-atom skeleton model consisting of two domains where each domain is treated as a rigid body on the smFRET time scale. This example illustrates the main ideas of this contribution, namely, smFRET distance distribution as the experimental observable for the extraction of conformational modes, the identification of the missing coordinate, and the implementation of Bayesian data augmentation using both structural imputation methods a1.1−a1.4 and a1b.1−a1b.3. This simple example is also used to characterize the implementation of the algorithm. The second, Example Two, is an extension of the first example but adds additional structural features to both domains as well as a flexible linker that connects the two domains (see SI for simulation details). This example has two missing coordinates and is used to illustrate how to deal with more than one missing coordinates. G

DOI: 10.1021/acs.jpcb.6b07767 J. Phys. Chem. B XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry B

and the bottom domains. The two-domain architecture is a fair representation of a protein conformation class because many proteins are encoded with a two-domain architecture, such as insulin-degrading enzyme, cellulase, luciferase, acetyl-CoA synthetase, and maltose binding protein, to name a few. This model is described in detail below because it is hoped that this model serves to demonstrate how a particular problem may be analyzed to identify the latent data and to devise a specific dataaugmentation scheme to infer the overall structural distribution. Again, note that although the basic algorithm is general its implementation to solve a particular problem could be casedependent. In this toy model, four atoms in the bottom domain are considered to be on the same plane with all the interatom distances fixed, whereas the distance between A5 and A6 in the top domain is also fixed. Although the intradomain distances are held fixed (the rigid-body assumption, see SI for simulation details), the interdomain distance between the top and the bottom domains and the relative orientations are allowed to vary. The goal now is to recover the interdomain conformational modes given the four smFRET distance distributions. All six atoms, A1, ..., A6, are considered to be FRET-dye labeling points with two on the top and the rest on the bottom as shown in Figure 2A. Suppose four sets of interatom distance distributions (pk(x), with k = 1, ..., 4) are measured by smFRET. Let (x1, x2, x3, x4) be a particular set of distances drawn from the measured distributions with x1 = A1A 5 and so forth as indicated in Figure 2A. Apparently, even with the structural knowledge (the top domain must be on the top side), the rigid-body assumption (7 degrees of freedom in the bottom plane and one fixed bond length from the top domain), and four independently measured smFRET distances, the information is still insufficient to define a structure. That is, one additional coordinate is needed to make up the total number of the degrees of freedom (3 × 6 − 6 = 12), the missing coordinate is the missing data or the latent data in Bayesian missing data problems. Next, this model is analyzed in more detail to identify the missing coordinate. Without loss of generality, the positions of all four bottom-domain atoms (rA1, rA2, rA3, rA4) are fixed because of interest in the relative motions between the two domains. If angle ζ between the A1A5A2 plane and the bottomdomain plane A1A2A3A4 is known, then the position of A5 can be uniquely determined (eq S2). With the rigid-body assumption for the top domain (fixed A 5A 6 length), the unknown position of A6 will trace out a spherical surface of radius A 5A 6 centering at rA5 (gray rectangular patches in Figure 2A). On the other hand, the x3 and x4 constraints require that rA6 be on a circle evolving around the A3A4 axis, as indicated by the dashed-dot magenta partial circle in Figure 2A. Thus, rA6 is located at the intersection of the circle and the spherical surface, A6 and A′6; however, they will have drastically different energies and will be discriminated out at the Boltzmann weighing step. The analysis above suggests that ζ is a viable latent data for the missing coordinate. This in turn allows one to write the pseudocode for the structural imputation algorithm of this particular example, listed in Algorithm 1. In this example, both procedures a1.1−a1.4 and a1b.1−a1b.3 will be implemented to further clarify the idea of structural imputation and its implementations.

All of the algorithms presented in this work for Example One and Example Two were implemented under Linux operation systems using scripts written in bash/python to call programs written in MATLAB for structural imputation and probability density estimation and CHARMM68 for MD simulation and energy evaluation (see SI for programming details). In the discussions below, the smFRET observables simulated using MD outputs are designated “experimental”, whereas structures and dynamics from MD data are considered to be physically true when comparing with those obtained from Bayesian inference. Example One. As shown in Figure 2A, this skeleton model is an abstraction of a protein with two rigid domains, the top

Figure 2. Molecular structure of Example One and the concept of structural imputation. (A) The equilibrium geometry of six atoms in Example One. Atoms in the top and the bottom domains are colored in red and blue, respectively. Rigid distances are in green, and measured distances are black dashed lines. The dihedral angle, ζ, between planes A1A2A3A4 and A1A2A5 is the missing coordinate required to fully determine all relative coordinates (see the text for more details). A0A5 is perpendicular to A1A2, and A0 is in both planes A1A2A3A4 and A1A2A5. A6 and A6′ are two possible locations of the sixth atom with fixed x3 and x4 distances, but the latter has a much higher energy than that of the former. (B) The isosurface of the 3D likelihood for the position of rA5 based on the current iteration, gi(R) (see the text for more details). The coordinate of A5 calculated using x1 and x2 from a given {xk}n as a continuous function of ζ traces out a green circle intersecting the likelihood cloud (eq S2), which in turn allows the drawing step, ζm ∼ p(ζ|{xk}n). The 3D rA5 likelihood cloud is colored by normalized likelihood (see legend for the likelihood weights). (C) A representative example for the predictive density (open circles) as a function of ζ, obtained from the geometrical intersection between the green line with the A5 probability density cloud. The dotted line is a spline interpolation and serves as an eye guide. (D) Cumulative distribution function (cdf) of the data shown in (C) by numerical integration (dashed-dot line), from which M number of ζ imputations are drawn. ζ = 90° corresponds to the equilibrium position. H

DOI: 10.1021/acs.jpcb.6b07767 J. Phys. Chem. B XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry B

For the geometrical intersection procedure in a1.1−a1.4, the 3D probability density cloud of rA5 using the equilibrium distances of x1 and x2 after completing step a1.2 is visualized in Figure 2B. The green line threading through the probability density cloud depicts the geometrical intersection step a1.3. An example of the probability density as a function of the missing coordinate, ζ, is displayed in Figure 2C. Finally, a new ζ can be drawn from its cdf (Figure 2D) using a standard randomnumber drawing procedure69 (see SI for details). Figure 3A−C shows that the 3D probability density cloud quickly changes from the initial guess to the correct location after the second iteration (Movie S1). Subsequent iterations only incurred minor changes to the less probable regions (