Subscriber access provided by TEXAS A&M INTL UNIV
Article
A Maximum-Likelihood Approach to Force-Field Calibration Bart#omiej Zaborowski, Dawid Jagie#a, Cezary Czaplewski, Anna Ha#abis, Agnieszka Lewandowska, Wioletta #mudzi#ska, Stanislaw Oldziej, Agnieszka Karczy#ska, Christian Omieczynski, Tomasz Wirecki, and Adam Liwo J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.5b00395 • Publication Date (Web): 11 Aug 2015 Downloaded from http://pubs.acs.org on August 17, 2015
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 70
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
1
REVISED
A Maximum-Likelihood Approach to Force-Field Calibration
BartÃlomiej Zaborowski,1 Dawid JagieÃla,1 Cezary Czaplewski,1 Anna HaÃlabis,2 Agnieszka ˙ Lewandowska,2 Wioletta Zmudzi´ nska,2 StanisÃlaw OÃldziej2 , Agnieszka Karczy´ nska,1 Christian Omieczynski,1 Tomasz Wirecki,1 and Adam Liwo1,3∗
1 2
Faculty of Chemistry, University of Gda´ nsk, ul. Wita Stwosza 63, 80-308 Gda´ nsk, Poland,
Laboratory of Biopolymer Structure, Intercollegiate Faculty of Biotechnology, University of Gda´ nsk and Medical University of Gda´ nsk, KÃladki 24, 80-922 Gda´ nsk, Poland
3
Center for In Silico Protein Structure and School of Computational Sciences, Korea Institute for Advanced Study, 87 Hoegiro, Dongdaemun-gu, 130-722 Seoul, Republic of Korea
The authors declare no competing financial interest.
∗
Corresponding author, phone:
+48 58 523 5124, fax:
[email protected] ACS Paragon Plus Environment
+48 58 523 5012, email:
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 2 of 70
2
Abstract A new approach to the calibration of the force fields is proposed, in which force-field parameters are obtained by maximum-likelihood fitting of the calculated conformational ensembles to the experimental ensembles of training system(s). The maximum-likelihood function is composed of logarithms of the Boltzmann probabilities of the experimental conformations, calculated with the current energy function. Because the theoretical distribution is given in the form of the simulated conformations only, the contributions from all simulated conformations, with Gaussian weights in the distances from a given experimental conformation, are added up to give the contribution to the target function from this conformation. In contrast to earlier methods for force-field calibration, the approach does not suffer from the arbitrariness of dividing the decoy set into native-like and non-native structures; however, if such a division is made instead of using Gaussian weights, application of the maximum-likelihood method results in the well-known energy-gap maximization. The computational procedure consists of cycles of decoy generation and maximum-likelihood-function optimization, which are iterated until converged. The method was tested with Gaussian distributions and then applied to the physics-based coarse-grained UNRES force field for proteins. The NMR structures of the tryptophan cage, a small α-helical protein, determined at three temperatures (T = 280 K, T = 305 K, and T = 313 K) by HaÃlabis et al. (J. Phys. Chem. B, 2012, 116, 6898-6907), were used. Multiplexed replica-exchange molecular dynamics was used to generate the decoys. The iterative procedure exhibited steady convergence. Three variants of optimization were tried: optimization of energy-term weights alone and use of the experimental ensemble of the folded protein only at T = 280 K (run 1), optimization of energy-term weights, and use of experimental ensembles at all three temperatures (run 2), and optimization of the energy-term weights and of the coefficients of the torsional and multibody energy terms, and use of of experimental ensembles at all three temperatures (run 3). The force fields were subsequently tested with a set of 14 α-helical and 2 α + β proteins. Optimization run 1 resulted in a better agreement with the experimental ensemble at T = 280 K compared to optimization run 2 and in a comparable performance on the test set but in poorer agreement of the calculated folding temperature
ACS Paragon Plus Environment
Page 3 of 70
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
3
with the experimental folding temperature. Optimization run 3 resulted in the best fit of the calculated to the experimental ensembles of tryptophan cage but in much poorer performance on the training set, this suggesting that use of a small α-helical protein for extensive force-field calibration resulted in over-fitting the data of this protein at the expense of transferability. The optimized force field resulting from run 2 was found to fold 13 out of 14 tested α-helical proteins and one small α + β-protein with correct topology; the average structures of 10 of them were predicted with the accuracy of about 5 ˚ A Cα -root-mean-square deviation or better. Test simulations with an additional set of 12 α-helical proteins demonstrated that this force field performed better on α-helical proteins than the previous parameterizations of UNRES. The proposed approach is applicable to any problem of maximum-likelihood parameter estimation when the contributions to the maximum-likelihood function cannot be evaluated at the experimental points and the dimension of the configurational space is too high to construct histograms of the experimental distributions. Key words: maximum likelihood principle, force-field calibration, protein folding, proteinstructure prediction, UNRES force field
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 4 of 70
4
INTRODUCTION The design of empirical energy functions for simulations of biologically-important system has a long history.1 Both atomic-detailed2–4 and coarse-grained5–8 approaches are used. The advantage of the first approach is greater accuracy, while the size of the systems and time-scale of simulations are limited, although a great progress was made in the field recently, especially by the Shaw group, owing to the construction of the ANTON supercomputer, which is dedicated to biomolecular simulations.9,10 On the other hand, coarse-grained approaches offer a tremendous extension of the size- and time-scale of simulations,11,12 at the expense of losing the details of the system under study. A long-sought goal of the development of empirical force fields for biomolecular simulations is predictivity understood as the ability to find the ‘native’ structure of a biomolecule. This goal is particularly pursued for proteins for two reasons. One reason is the prediction of protein structures, the other one are simulation studies of protein folding, functionally-important motions, and conformational changes, which cannot be trusted, if a force field cannot locate the native structure of any protein. Two types of approaches to protein-structure prediction are pursued. In the first one, the native structure is sought as the lowest-energy structure.13–15 In the second one, it is understood as a family of structures that appear below the foldingtransition temperature.5,16,17 The first approach, although more efficient in execution because efficient global-optimization methods of the target function such as, e.g., the ConformationalSpace Annealing (CSA) method18,19 or basin hopping20 can be used, does not take the conformational entropy into account, which is considered in the second approach. Moreover, the force fields designed to predict protein structures understood as ensembles are good to study protein folding. The predictive power of a force field is usually achieved by tuning its parameters, which process is termed force-field optimization or calibration. The first such approach was proposed by Crippen and coworkers21,22 and was based on maximization of the potential-energy gap between the lowest-energy native-like structure and the lowest-energy non-native structure. This principle was later utilized by Shakhnovich and coworkers23 to propose a foldability criterion;
ACS Paragon Plus Environment
Page 5 of 70
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
5
however, Thirumalai and coworkers subsequently demonstrated that energy-gap maximization was insufficient to produce a foldable energy landscape, because it does not guarantee that folding will occur before the protein is trapped in a glassy-like part of the landscape.24 Later, it was demonstrated that the ordering of the energies of the non-native structures according to native-likeness is critical for foldability.25 At about the same time that energy-gap-optimization approaches were being developed, Wolynes and coworkers26 designed a method based on the maximization of the difference between the folding temperature and the glassy-transition temperature, which can be approximated by Z-score, defined as the ratio of the difference between the mean energy of the native-like structures and the mean energy of non-native structures to the standard deviation of the energy of the non-native structures. These two principles: energy-gap and Z-score optimization have been the basis of force-field calibration for many years and are especially efficient for threading (fold recognition),27–29 in which case the nativelike structure is unique and the set of non-native decoys is usually fixed. For many years, we have been developing the physics-based coarse-grained UNRES force field for the simulation of protein structure and dynamics.12,17,30–46 The force field is based on the expansion of the potential of mean force of polypeptide chains in water into Kubo’s cluster-cumulant functions,47 which enabled us to identify the respective effective energy terms with potentials of mean force of small model systems, which are comparatively easy to handle. The different contributions to the effective potential energy are then multiplied by weighting factors, termed energy-term weights and added up to produce the complete effective energy function. These weights are primary targets of force-field calibration. Our first attempts at the UNRES force field calibration were made by using the Z-score-31 and energy-gap optimization.33,34 However, optimization was successful only for small simple training proteins, such as the three-helix-bundle staphylococcal protein A (PDB: 1BDD). The reason for this was that these approaches ignore the ordering of non-native structures with different degree of nativelikeliness. This lack of ordering produces a glassy-like rather than a funnel-like energy landscape and the native structure is generally difficult to locate, even though it has a distinctively lower energy than those of non-native structures.
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 6 of 70
6
To overcome the problem of energy-landscape frustration, we developed the hierarchicaloptimization method.17,35,36 For each training protein, the conformations are grouped into structural levels. The conformations of each level contain the same native-like elements but no conformation contains more native-like elements than the other ones in the group. The levels are subsequently ordered following the thermal-unfolding pathway determined experimentally. Optimization is directed at ordering the free energies of the structural levels according to the native-likeness of the levels. At temperatures lower than the folding-transition temperature, the higher the native likeness of a level, the lower its free energy. This order is opposite at temperatures higher than the folding-transition temperature and, at the folding-transition temperature, the free energies of all levels are equal. This idea has been used to design the target function which is a sum of the penalties for violating the free-energy gaps between levels at different temperatures.17,35,36 By using this approach, we produced reasonably predictive versions of the UNRES force field for protein-structure prediction based on global optimization,35,36 as well as ensemblebased prediction and protein-folding and dynamics simulations.17 The resulting force fields scored well in CASP exercises,48,49 and was applied with success to protein-folding simulations, including the simulations of folding kinetics, studies of free-energy landscapes,11,50–53 and investigations of biologically-important processes such as, e.g., PICK1 to BAR binding,54 Hsp70 chaperone cycle55 and, recently, modeling the structure and stability of the complex between the iron-sulfur-binding protein 1 (Isu1) and the Jac1 Hsp40 co-chaperone.56 However, a disadvantage of the hierarchical-optimization method is that the decision as to whether a section of a simulated structure under consideration bears resemblance to the corresponding section of the experimental structure (which is required to assign the conformation to a particular structural level) is based on arbitrary cut-offs, which are to set on selected measures of nativelikeliness such as, e.g., the fraction of secondary structure, the fraction of native contacts, the root-mean-square deviation (rmsd) from the native structure, etc.35,36 For example, to decide that a conformation has a native α-helix, the fraction of pi · · · pi+3 contacts (where pi is the ith peptide group) in the section of the simulated conformation that corresponds to an α-helix
ACS Paragon Plus Environment
Page 7 of 70
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
7
in the experimental structure is calculated, and, if that fraction is greater than a pre-assigned cut-off, the conformation is considered to have the α-helix (and is, consequently, assigned to a higher level of the hierarchy), otherwise it is considered to be devoid of the helix (and, consequently, is assigned to a lower hierarchy level). Moreover, most of the target free-energy gaps between structural levels can be assigned only approximately, based on limited thermodynamic data of the fragments of the training proteins; only the free-energy gaps between the completely folded and completely unfolded structures are fully available from the experimental free energy of unfolding vs. temperature [∆Gf (T )] curves. Quite recently, by using the multiplexed replica exchange (MREMD) approach57,58 with UNRES,59 we carried out a simulation study of the folding of protein A at in a wide range of temperatures.52 A surprising conclusion of this study was that, at the folding transition temperature and about 50 K above it, the shape of the protein was largely preserved, except that about half of population had a mirror-image packing topology, with the native contacts or slightly shifted native contacts present but secondary structure largely melted. We used the variant of the UNRES force field calibrated by using our hierarchical method17 with another three-α-helix-bundle protein, 1GAB. Moreover, the experimental studies of the mechanism of the folding of the B1 domain of immunoglobulin-binding protein (IGG) have suggested that early structures of the folding pathway exhibit general shape of the protein, but the secondary structure is not formed and the native contacts are shifted by 1-2 residues.60 Furthermore, small-angle X-ray (SAXS) and neutron-scattering studies (SANS) of small proteins61–63 have demonstrated that, upon thermal denaturation, the radii of gyration of small proteins increase only very moderately near the folding-transition temperature62,63 or even shrink with increasing temperature near the folding-transition point,61 as opposed to chemical denaturation.61–63 All these observations are in agreement with the molten-globule concept of thermal denaturation of proteins64 and strongly suggest that proteins achieve “rough” shape at quite high temperatures. Such a picture of the folding process is most close to the “hydrophobic collapse” mechanism of protein folding, albeit guided by favorable local interactions that initiate chain reversals in the right places;60 these interactions can also have hydrophobic nature.65
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 8 of 70
8
For the above reasons, we concluded that rigid (cut-off-based) definition of structural levels does not reflect the folding pathway. Therefore, a better approach to force-field calibration than the hierarchical optimization seems to be tuning the force field so that the ensembles calculated at the temperatures of measurements fit the respective experimental ensembles. A natural method to tackle such a problem is the maximum-likelihood method.66 However, as opposed to the textbook version of this approach, the probability-density function (pdf) cannot be evaluated exactly in the conformational analysis of proteins in the continuous space, even at a coarse-grained level. Instead, it is only available indirectly in the form of the conformational ensembles simulated at different temperatures. Because of the high dimension of the conformational space, the construction of histograms to represent the pdf is not feasible. Therefore, in this work, in order to use the simulated conformations to compute the likelihood function, we take the contribution from each of them with a Gaussian weight depending on the distance from the respective experimental conformation. To assess the method, we use the experimental data from a recent experimental study67 of a small α-helical protein, the tryptophan cage, at three temperatures which cover the folding transition. We test the resulting force fields (obtained by using various sections of the data and optimizing fewer or more parameters) with two sets of mostly α-helical proteins that differ in size and topology. This paper is organized as follows. First, in the Theory section, we present the version of the maximum-likelihood fitting developed in this work and show that, with a partition of the set of simulated conformations into native-like and non-native-like conformations and given a low- or high-temperature limit, it reduces to the maximization of the energy-gap between the nativelike and non-native conformations. In the Methods section, we summarize the UNRES force field, the multiplexed replica-exchange molecular dynamics (MREMD) method57,58 adapted to UNRES,59,68 and the computational procedure implemented to execute the calibration of the UNRES force field by using the maximum-likelihood method. In the Results section, we describe the tests of the maximum-likelihood approach with single- and multidimensional Gaussian distribution first and, subsequently, the results of force-field calibrations with tryptophan cage and tests of the resulting variants of the UNRES force field. Finally, in the Conclusions sections,
ACS Paragon Plus Environment
Page 9 of 70
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
9
we point to further directions of the applications of the approach in force-field calibration and beyond.
THEORY Maximum likelihood approach In the classical implementation of the maximum-likelihood approach, the target function, log L, is maximized (or − log L is minimized) to fit the parameters of the fitted pdf to the distribution of experimental points.66 Assuming that n experimental points are given in an m-dimensional space, whose coordinates are Xi = [xi1 , xi2 , . . . xim ]T , i = 1, 2, . . . n, the maximum-likelihood problem can be formulated as given by eq 1.
max log L ≡ min − log L = −
z1 ,z2 ,...zν
Z
···
Z
z1 ,z2 ,...zν
n X
wi ln f (Xi ; z1 , z2 , . . . zν )
(1)
i=1
f (Y; z1 , z2 , . . . zν )dVY = 1
(2)
where wi is the weight of the ith experimental point, f is the normalized pdf, and dVY is the volume element in the space of system variables. The maximum-likelihood principle cannot be used for parameter estimation in a straightforward manner when f cannot be computed at the experimental points (analytically or numerically) exactly but, instead points can only be sampled according to f , a situation that takes place when the pdf is a function of a huge number of variables and there is no analytical expression for its integral over the configurational space. This is exactly the problem of the pdf of polymer conformations in the continuous 3D space, which is given by the Boltzmann law (equation 3):
¸ · U (Xi ) 1 dVX exp − P (Xi )dVX = Q RT · ¸ Z Z U (X) Q = · · · exp − dVX RT
(3) (4)
where X = [X1 , X2 , . . . Xm ]T denotes the vector of the coordinates of the molecule (m being the dimension of this vector), U (Xi ) denotes its energy, Q denotes the partition function, R ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 10 of 70
10
denotes the universal gas constant and T denotes absolute temperature and VY is the volume i h (X) , can be evaluated analytically for any conforelement. The Boltzmann factor, exp − URT mation but the partition function Q cannot be evaluated in a deterministic manner unless a simple lattice representation of the polymer under study is considered. Moreover, the experimental conformations are usually high in energy, because the refinement algorithms put much more emphasis on satisfying the experimental restraints than on finding a lower energy. This observation suggests that the neighborhoods of experimental conformations instead of the experimental conformations themselves should be considered in evaluating log L. A set of low-energy conformation can be obtained by molecular simulations, in which points from the configurational space are sampled according to the Boltzmann distribution with a given potential-energy function; a more robust approach involves the use of multicanonical sampling such as, e.g., the replica-exchange57,58 and entropic-sampling69 methods or their hybrids.70–72 All these approaches lead to points sampled from a given distribution, hereafter referred to as ρ(Y); in particular, this distribution might have the same functional form as the target distribution, ρ(Y) = f (Y; z1◦ , z2◦ , . . . zν◦ ), where z1◦ , z2◦ , . . . zν◦ are the initially guessed parameters of the distribution function. Obviously, except for the situation when the configurational space is discretized to a low-resolution lattice, simulations provide a set of points Yi , i = 1, 2, . . . N , where N is the number of snapshots from a simulation, none of which is identical with any of the experimental points (conformations). In what follows, we will consider the negative of the maximum-likelihood function, lN , which is normalized by dividing by the sum of the weights of all conformations, W , expressed by eq 5. It should be noted that the normalization factor is independent of force-field parameters and, therefore, minimization of lN of eq 5 gives the same results as minimization of l of eq 1. As will become clear in section “Gaussian representation of the δ function”, the purpose of normalization is to avoid the trivial zero minimum of the approximate lN ; the resulting equations are also simpler when lN is used instead of l.
ACS Paragon Plus Environment
Page 11 of 70
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
11
1 log L W n X W = wi
(5)
lN = −
(6)
i=1
Because the Boltzmann probabilities cannot be calculated directly at the experimental conformations, the simulated conformations have to be used. Let us assume that conformations identical to the experimental ones have been sampled (a situation which is, in principle, possible if a lattice representation of the conformational space is used). Because the conformations are drawn from a distribution ρ(X) (which is not necessarily the Boltzmann distribution because non-Boltzmann sampling techniques are usually more efficient than canonical sampling), the contribution to eq 5 from conformation i would be ∆lN (Xi ) = 1 ρ(Xi )wi W
(s)
ln f (Xi ), where Xi
1 w W i
(s)
ln f (Xi ) =
denotes the point obtained by sampling. Therefore, lN cannot
be computed directly from the contributions at the sampled points but each contribution has to be divided by ρ(Xi ). To handle the real cases in which simulations do not yield conformations identical to the experimental ones, let us express lN by using the integrals over the configurational space given the sampling probability density ρ(X), weighted by the Dirac δ functions centered at the experimental points; likewise, to normalize − log L, we express the sum of weights, W , in terms of integrals over the Dirac δ functions. The resulting expression is given by eq 7. It should be noted that the expression given by eq 7 gives an identical result to that of the parent expression given by eq 5 with − log L expressed by eq 1. 1 lN = A with
Z
···
Z X n i=1
A=W = ′
R
Z
δ (Xi − Y) ρ (Y) R wi ln f (Y; z1 , z2 , . . . zν )dVY · · · δ (Y − Y′ ) ρ(Y′ )dVY′
···
Z X n i=1
R
δ (Xi − Y) ρ (Y) R wi dVY · · · δ (Y − Y′ ) ρ(Y′ )dVY′
(7)
(8)
where δ(Y − Y ) is the Dirac δ function; δ(Y − Y′ ) = ∞ for Y = Y′ and zero elsewhere with
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Z
Z ···
Z
···
Z
¡ ¢ δ ξ − ξ ′ dVξ′ = 1
¢ ¡ δ ξ − ξ ′ f (ξξ′ )dVξ ′ = f (ξξ ),
Page 12 of 70
12
(9) (10)
In eq 7, the numerators of the components of the sum in the integrand represent sampling the points from distribution ρ, while the denominators represent the correction for the bias introduced by sampling the points from a non-uniform distribution. Eq 7 is the basis of the modified maximum-likelihood method introduced in this work, from which its various implementations can be constructed by using different approximate representations of the Dirac δ function. In the next section we will show that, with the hypercylinder representation of the δ function, the method becomes equivalent to the method of energy-gap maximization.
Relation between maximum-likelihood optimization and energy-gap optimization of force fields Let us consider the application of the maximum-likelihood method to a situation in which the simulated conformations are divided into two classes: native-like (nat) and non-native (non-nat), this assignment depending on, e.g., the rmsd from the crystal or average NMR structure. A conformation is considered native-like if its rmsd does not exceed a pre-assigned cut-off value, s, and is considered non-native otherwise. This dissection can be considered as application of a hypercylinder representation of the Dirac δ function in eq 7, as given by eq 11.
δ(Y − Y′ ) = lim
s→0
( Γ( m +1) π
0
2 m 2 sm
for ||Y − Y′ || < s otherwise
(11)
where m is the dimension of the conformational space and Γ(x) is Euler’s Γ function. When the representation of the δ function given by eq 11 with a finite radius s of the hypercylinder is substituted into eq 7 and it is assumed that all experimental conformations ∗ have weights equal to 1, the approximation lN to lN , is given by eq 12.
ACS Paragon Plus Environment
Page 13 of 70
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
∗ lN
′ N N′ X 1 X exp (−βUi ) − ln exp (−βUi ) + = − ′ ln N i=1 i=1 i∈{nat}
i∈{nat}
13
N −N ′
X i=1
i∈{non-nat}
exp (−βUi ) (12)
where β = 1/RT , N ′ is the number of native-like conformations and N is the total number of conformations. Without loss of generality, the conformations can be ordered so that the first native-like conformation is the lowest-energy native-like conformation and the first non-native conformation is the lowest-energy non-native conformation. ∗ At the low-temperature limit (large β) and assuming that U1nat < U1non−nat , lN is approxi-
mated by eq 13, which is obtained from eq 12 by expanding the ln terms in the Taylor series about 1 and neglecting all terms in the expansion except that with the energy of the lowestenergy non-native conformation.
∗ lN
· µ ¶¸ £ ¡ non-nat ¢¤ 1 1 nat − U1 min Ui − min Ui = ′ exp −β ≈ ′ exp −β U1 i∈{non-nat} i∈{nat} N N
(13)
∗ which implies that lN is minimized (or log L is maximized) when the energy gap between the
native structure and the lowest-energy non-native structure is maximized; this is a long-known approach to force-field optimization.21,33 When U1nat is always greater than U1non-nat (i.e., the energy function is not optimizable), the assumption that the temperature is very small results in eq 14.
∗ lN ≈
¢ β ¡ nat U1 − U1non-nat N
(14)
which means that the gap between the lowest-energy native-like and the lowest-energy nonnative structure should be minimized or the gap between the lowest-energy non-native and the lowest-energy native-like structure is to be maximized, which is the same result as that obtained by assuming that U1nat − U1non-nat < 0.
Gaussian representation of the δ function The hypercylinder representation of the δ function introduces a hard cut-off s on the distances of the simulated points from the experimental points. Therefore, to obtain the final form of the ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 14 of 70
14
working approximation to lN , we use the Gaussian representation of the δ function (eq 15).
′
δ (Y − Y ) = lim
s→0
³√
2πs
´−m
||Y − Y′ ||2 exp − 2s2 µ
¶
(15)
where m is the dimension of the Y space. By replacing the δ function into eq 7 by its Gaussian representation (eq 15), we obtain eq 16.
lN =
1 lim A s→0
Z
···
Z X n i=1
´ ³ 2 ρ (Y) exp − ||Xi2s−Y|| 2 ´ ³ × R R ′ || ′ )dV ′ ρ(Y lim · · · exp ||Y−Y Y 2 2s
s→0
×wi ln f (Y; z1 , z2 , . . . zν )dVY ³ ´ 2 Z Z X n exp − ||Xi2s−Y|| ρ (Y) 2 ³ ´ wi dVY A = lim · · · R R ||Y−Y ′ || s→0 ′ )dV ′ lim · · · exp ρ(Y i=1 Y 2s2
(16) (17)
s→0
In principle, the constants s in the numerator and in the denominator of the right-hand sides of eqs 16 and 17 could be different; however, we keep the same s for simplicity. By removing the condition s → 0, as in section “Relation between maximum-likelihood optimization and energygap optimization of force fields”, and replacing the integration over Y and Y′ with summation over the simulated points, we obtain eq 16, which is the desired approximate expression for lN , ∗ hereafter referred to as lN (eq 18). It should be noted that these points are drawn from the ρ
distribution function so that ρ is implicit in the obtained distribution of points and, therefore, does no longer appear in the equation.
∗ lN =
=
1 A
N X n X j=1 i=1
N X
´ ³ ||Xi −Yj ||2 exp − 2s2 ³ ´ wi ln f (Yj ; z1 , z2 , . . . zν ) N P ||Yj −Yk ||2 exp − 2s2
k=1
ai ln f (Yj ; z1 , z2 , . . . zν )
(18)
´
(19)
j=1
N X n X
||X −Y ||2 − i2s2 j
´ wi ³ ||Yj −Yk ||2 exp − 2s2 k=1 ³ ´ ||Xi −Yj ||2 n exp − 2s2 1X = ´ wi ³ N A i=1 P ||Yj −Yk ||2 exp − 2s2
A =
j=1 i=1
ai
exp
³
N P
k=1
ACS Paragon Plus Environment
(20)
Page 15 of 70
1 2 3 4 5 6 7 8 9 10 11 12 13 14Fig. 1 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
15
The denominators of the components of the sum over the simulated points in eq 18 are corrections of the bias resulting from drawing these points from a pre-defined distribution ρ(Y) and not the uniform distribution (which would be impossible in the case of conformational sampling). Together with the terms in the numerators, they provide weights with which the ∗ contributions from the simulated points are added to lN at a given experimental point. The
relation between the approximate and the original maximum-likelihood function is illustrated in Figure 1. The need to define the normalization factor by eq 17 and not simply as 1/W (eq 6) is now apparent because the un-normalized right-hand side of eq 16 can be made arbitrarily close to zero (which is the smallest value that can be achieved, given the fact that the pdf is always normalized to 1) if ρ(Y) overlaps poorly with the distribution of the experimental points. Therefore, minimization of the un-normalized maximum-likelihood function, − log L, could easily result in a pdf divergent from the experimental points. It should be noted that summation over the experimental points is on the top of the summation over the simulated point in eq 18, although the order of integration in eq 16 is opposite. This modification is correct because the summations are carried out over a fixed range of indices. With this modification, the sums over the experimental points corresponding to given simulated points (ai , i = 1, 2, . . . N of eq 20) can be computed at the beginning of the minimization procedure and used as constant weights when running a given iteration. Equation 18, together with equations 19 and 20, is a general formula of the variant of the maximum-likelihood-fitting approach proposed in this work except that it assumes that the Gaussian representation of the Dirac δ function is used. It can be applied to any type of fitted distribution function, f . Obviously, the Gaussian representation of the δ function can be replaced with any other representation, as implemented in section “Relation between maximumlikelihood optimization and energy-gap optimization of force fields” where the relationship between the proposed method and earlier approaches to force-field optimization was explored. The procedure of energy-function calibration based on eq 18, and on use of MREMD57–59,68 to generate the decoys, is introduced in subsection “Maximum-likelihood calibration of macro-
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 Fig. 2 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 16 of 70
16
molecular energy functions” of the Methods section.
METHODS UNRES force field In the UNRES model30,38 a polypeptide chain is represented by a sequence of α-carbon atoms with united side chains attached to them and peptide groups positioned halfway between two consecutive α-carbons (Figure 2). The effective energy function originates from the potential of mean force (PMF) of the chain constrained to a given coarse-grained conformation plus the surrounding solvent, which is then expanded into the Kubo cluster-cumulant functions47 to obtain the effective energy terms, as described in our earlier work.32 The effective energy of a coarse-grained polypeptide chain is expressed by eq. (21).
U = wSC
X
USCi SCj + wSCp
V DW USCi pj + wpp
i6=j
i