How Molecular Size Impacts RMSD Applications in Molecular

Mar 7, 2017 - We also provide a mechanism for the emergence of the “curse of dimensionality” for RMSD from the law of large numbers by showing tha...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/JCTC

How Molecular Size Impacts RMSD Applications in Molecular Dynamics Simulations Karen Sargsyan,*,† Cédric Grauffel,† and Carmay Lim*,†,‡ †

Institute of Biomedical Sciences, Academia Sinica, Taipei 115, Taiwan Department of Chemistry, National Tsinghua University, Hsinchu 300, Taiwan



S Supporting Information *

ABSTRACT: The root-mean-square deviation (RMSD) is a similarity measure widely used in analysis of macromolecular structures and dynamics. As increasingly larger macromolecular systems are being studied, dimensionality effects such as the “curse of dimensionality” (a diminishing ability to discriminate pairwise differences between conformations with increasing system size) may exist and significantly impact RMSD-based analyses. For such large bimolecular systems, whether the RMSD or other alternative similarity measures might suffer from this “curse” and lose the ability to discriminate different macromolecular structures had not been explicitly addressed. Here, we show such dimensionality effects for both weighted and nonweighted RMSD schemes. We also provide a mechanism for the emergence of the “curse of dimensionality” for RMSD from the law of large numbers by showing that the conformational distributions from which RMSDs are calculated become increasingly similar as the system size increases. Our findings suggest the use of weighted RMSD schemes for small proteins (less than 200 residues) and nonweighted RMSD for larger proteins when analyzing molecular dynamics trajectories.



INTRODUCTION The root-mean-square deviation (RMSD) of atomic coordinates after optimal rigid body superposition of two structures is undoubtedly the most popular choice for quantifying differences in macromolecular structures and dynamics. It is commonly used in analyses of molecular dynamics (MD) trajectories to establish the equilibration period and quality of biomolecular simulations and to cluster similar conformations. It is also used in protein structure predictions to assess the match between the modeled and X-ray/NMR protein structure as well as in docking calculations to spot the near-native structure among the numerous docked conformations.1−4 The RMSD allows a comparison between the structure of the native folded protein and the corresponding partially or fully unfolded structure; hence, it has been used as a reaction coordinate in protein folding simulations.5 It also allows tracking dynamical changes following protein modification.6 In computing the RMSD, a least-squares fit is first performed to minimize the difference between two superimposed structures.7 This assumes the structures are rigid and the least-squares procedure finds the global minimum in a timeefficient manner. After superposition of the two structures, the RMSD (in length units) is calculated according to

In eq 1, di is the distance between atom i in the two structures, and N is the total number of equivalent atoms. Since the RMSD calculation requires the same number of atoms in both structures, it is often computed using only the Cα or backbone heavy atoms of each amino acid (aa) residue. Note that the superimposition procedure depends on eq 1, hence it should be changed if another similarity measure were used because the superimposition based on eq 1 may not minimize differences between structures using a new similarity measure. Although useful in practice, the RMSD has known shortcomings. For example, the RMSD can only compare structures with the same number of atoms. One solution for comparing structures with different number of atoms would be to compare local parts. Other approaches utilize clever mathematical techniques such as Wigner-D functions. 8 Furthermore, the RMSD does not reveal the local regions that are the most flexible; i.e., it does not distinguish between a molecule with some very rigid regions and some very flexible ones from a molecule where all regions are semiflexible.9,10 A popular approach to circumvent the problem of small flexible regions such as hinges leading to large RMSDs when the rest of the structure remains largely the same is to assign weights to the distances in eq 1.10,11 This procedure ensures that distance outliers affect the “weighted” RMSD to a lesser extent than the

N

RMSD =

∑i = 1 di 2 N

Received: January 11, 2017 Published: March 7, 2017

(1) © 2017 American Chemical Society

1518

DOI: 10.1021/acs.jctc.7b00028 J. Chem. Theory Comput. 2017, 13, 1518−1524

Article

Journal of Chemical Theory and Computation

Structures of various sizes were selected so as to diversify the protein lengths of the structures in the data set. The sizes and the PDB/CATH codes of these proteins are listed in Supporting Information Table S1. To assess the dependence of the results derived from the AMBER10 data set on the force field, reference frame, and simulation length, we created a second data set called CHARMM20 (Supporting Information Table S2), containing 18 20 ns trajectories (excluding equilibration) obtained from MD simulations with explicit water using the CHARMM program21 (version 37) with the CHARMM36 force field.22 Conformations were saved every picosecond for each AMBER or CHARMM trajectory. The RMSDs were computed using the Biopython module23 for the AMBER10 data set but using the MDTraj package24 for the CHARMM20 data set. Representative plots of the RMSD as a function of simulation time for proteins of various sizes show that the simulations are stable (see Supporting Information Figure S1). Testing RMSD for “Curse of Dimensionality” by Comparing Conformational Distributions. To test if the “curse of dimensionality” is inherent in the RMSD computed using eq 1, we first propose a path for its emergence: Let X denote a probability distribution of mean μ. If we draw samples Xi (i = 1,...,N) of various sizes (different N) from this probability distribution and calculate their sample averages, (X1 + ··· + XN)/N, the latter approximates μ better for larger N according to the law of large numbers. Hence, the sample averages vary wildly for small N but vary less with large N. Eq 1 effectively calculates an average (RMSD2), where the squared distances from a pair of conformations comprise a sample of Xi = di2 drawn from some distribution. This implies that as the number of atoms N increases, the RMSD2 (thus RMSD) variations of protein conformations would decrease according to the law of large numbers if the distances were drawn from the same probability distribution. Thus, an increase in the similarity of the sample (conformational) distributions from which RMSDs were calculated would lead to curse of dimensionality. To check if the conformational distributions of a MD trajectory are similar, we apply Kolomogorov-Smirnov (KS) two-sample statistic to measure the distance between cumulative probability distributions underlying any two samples of squared distances, as illustrated in Supporting Information Figure S2. The KS “distance” is sensitive to the overall shape of the distribution and detects even minor differences. For each of the 95 AMBER trajectories, we took every 100th frame of the trajectory and superimposed all (100 × 99/2) pairs of conformations. Each pair of superimposed conformations yielded a probability distribution of squared distances, di2. The KS “distances” were calculated for all 4,950 pairs of distributions and averaged. Testing RMSD and RMSD-Based Similarity Measures for “Curse of Dimensionality” by Seeing if Their Distribution Widths Narrow with Increasing Protein Size. In addition to assessing similarity of the conformational distributions for a given MD trajectory, we also directly tested the manifestation of the “curse of dimensionality” by seeing if the RMSD distribution narrows with increasing protein size. Relative to the reference (initial) frame of a given trajectory in the AMBER10 or CHARMM20 data set, the RMSDs of the other MD conformations were computed and averaged to yield a mean value and corresponding standard deviation. Since the absolute RMSD depends on the protein size,12 the standard deviation was divided by the mean RMSD to yield a relative variance in order to compare the standard deviations of RMSDs derived from the MD trajectories of different proteins. The above analysis was carried out using wRMSD for each trajectory in the AMBER10 and CHARMM20 data sets. wRMSD employs a Gaussian-weighting factor for each atom i, wi, such that atoms that barely move between the two conformations will have a greater weighting than those that have a large displacement. It is defined by

unweighted RMSD, but it changes the superimposition (see above). However, “weighted” RMSD methods often perform an RMSD fit as their first guess, thus making them slower than RMSD. Thus, although methods have been designed to overcome the limitations in RMSD, they are unfortunately slower, and sometimes their iterative superposition routines do not converge. Furthermore, they do not necessarily provide a global optimum for superposition of two structures. Consequently, RMSD remains the fastest way to measure structural similarity in practice, an especially important requirement for large data samples. Because of its widespread use, the RMSD properties are of great importance for interpreting results obtained using RMSD. Previous studies12−16 have addressed the significance of an absolute RMSD between two structures and whether a single, fixed RMSD threshold can be used to judge conformational similarity. Notably, Maiorov and Crippen12 found an intrinsic RMSD cutoff for conformational similarity corresponding to the minimal RMSD between a structure and its mirror-inverted one, which increases as n1/3 where n is the number of protein residues. To compensate for size effects, Maiorov and Crippen13 introduced a new RMSD-based measure of structural similarity denoted as ρ that is independent of the molecular size, enabling conformational similarity to be judged by a certain fixed threshold. The RMSD becomes a high-dimensional distance for large proteins where the x, y, z atomic coordinates comprise highdimensional data (i.e., data with many attributes). Such highdimensional data pose problems for clustering methods, as traditional clustering algorithms may fail to find meaningful clusters in high-dimensional data due to the so-called “curse of dimensionality” − the inability to differentiate similar and dissimilar objects.17 However, whether this “curse” affects the RMSD or other similarity measures in comparing MD structures and if it affects different similarity measures in different ways has not been explicitly addressed (to our knowledge). This is important as this affects the choice of clustering methods, which have different properties such as sensitivity to outliers.18 Here, we first show how the “curse of dimensionality” using RMSD could emerge. Next, we show its existence directly in MD trajectories by the narrowing of the RMSD distribution with increasing protein size. We further show that the distributions of the Gaussian-weighted RMSD10 (denoted as wRMSD) and the size-independent ρ13 (both defined in the Methods section) also become narrower with increasing protein size. To obtain these results, we created two data sets containing MD trajectories derived using different force fields and simulation lengths, as described in the next section. Our demonstration of how the “curse of dimensionality” could emerge allowed us to hypothesize and test the correlation between the level of mutual agreement in RMSDs and wRMSDs and the number of residues. The results suggest the use of wRMSD for proteins with