Subscriber access provided by University of Newcastle, Australia
Article
Geometric Patterns for Neighboring Bases Near the Stacked State in Nucleic Acid Strands Ada Anna Sedova, and Nilesh K. Banavali Biochemistry, Just Accepted Manuscript • DOI: 10.1021/acs.biochem.6b01101 • Publication Date (Web): 10 Feb 2017 Downloaded from http://pubs.acs.org on February 12, 2017
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Biochemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
Geometric Patterns for Neighboring Bases Near the Stacked State in Nucleic Acid Strands
Ada Sedova2,† and Nilesh K. Banavali1,2,∗
1: Laboratory of Computational and Structural Biology Division of Genetics, Wadsworth Center, NYS Department of Health 2: Department of Biomedical Sciences, School of Public Health State University of New York at Albany ∗ email:
[email protected] CMS 2008, Biggs Laboratory, Wadsworth Center, NYS Department of Health, Empire State Plaza, PO Box 509 Albany, NY 12201-0509 Tel: 518-474-0569 Fax: 518-402-4623 † Present address: Oak Ridge National Laboratory, Scientific Computing Group, National Center for Computational Sciences, Bldg. 5600, Rm I311, P.O. Box 2008, Oak Ridge, TN, USA 37830-6286
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Abstract Structural variation in base stacking has been analyzed frequently in isolated double helical contexts for nucleic acids, but not as often in non-helical geometries or in complex biomolecular environments. In this study, conformations of two neighboring bases near their stacked state in any environment are comprehensively characterized for single-strand dinucleotide (SSD) nucleic acid crystal structure conformations. An ensemble clustering method is used to identify a reduced set of representative stacking geometries based on pairwise distances between select atoms in consecutive bases, with multiple separable conformational clusters obtained for categories divided by nucleic acid type (DNA/RNA), SSD sequence, stacking face orientation, and presence or absence of a protein environment. For both DNA and RNA, SSD conformations are observed that are either close to the A-form, or close to the B-form, or intermediate between the two forms, or further away from either form, illustrating the local structural heterogeneity near the stacked state. Amongst this large variety of distinct conformations, several common stacking patterns are observed between DNA and RNA, and between nucleic acids in isolation or in complex with proteins, suggesting that these might be stable stacking orientations. Non-canonical face:face orientations of the two bases are also observed for neighboring bases in the same strand, but their frequency is much lower, with multiple SSD sequences across categories showing no occurrences of such unusual stacked conformations. The resulting reduced set of stacking geometries is directly useful for stacking-energy comparisons between empirical force fields, prediction of plausible localized variations in single-strand structures near their canonical states, and identification of analogous stacking patterns in newly solved nucleic acid containing structures.
Keywords: base stacking geometries, ensemble clustering, crystallographic database survey, k-means
1
ACS Paragon Plus Environment
Page 2 of 55
Page 3 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
Nucleic acid conformational motifs are understood to be stabilized in large part by base stacking interactions. 1 The precise geometric features of base stacking have mostly been analyzed in the context of dinucleotide base-step parameter variation in isolated duplex structures, 2,3 and to a lesser extent in duplex structures bound to proteins. 4 Recent structural analyses of nucleic acids have focused on the 3D motifs in RNA backbone conformations, 5 with base stacking orientations used primarily for identification of these larger conformational patterns. 6 The forces and lifetimes of DNA basepair stacking interactions have also been characterized using single molecular optical-tweezer methods employing DNA origami blunt ends. 7 However, the range of stacking geometries favored by nearest-neighbor bases in non-duplex states or complex environments, and their variation due to the identity of the two bases, does not seem to be fully characterized. The underlying energetics of stacking and the precise nature of the forces involved have also been extensively investigated by computational methods. Quantum mechanical (QM) methods have been used to probe the stacking energies for all dinucleotide base steps at progressively higher levels of theory. 8–14 These calculations based on idealized duplex-form starting geometries suggest that correlation-based dispersive interactions confer greater contributions to nucleic acid base stacking than electrostatic interactions. 15,16 The dependence of the stacking interaction on the twist angle between the two bases was observed to be substantial in vacuum, but was almost entirely negated in solution. 17 Such analysis of the effect of geometric variation on stacking energetics usually relies on starting geometries derived from idealized or duplex states, resulting in only a small subset of stacked geometries that are observable in crystal structures being covered. These studies also do not generally include the influence of additional factors such as the sugar or phosphate moieties or associated water or ion molecules, which can influence conformation and energetics at the dinucleotide level. Molecular mechanics empirical energy functions have also been used to quantify the free energy profiles for dinucleotide base stacking in aqueous solution. 18–21 The experimentally observed preference in stacking nucleosides 22 (purine-purine > purine-pyrimidine, pyrimidine-purine > pyrimidine-pyrimidine) could be reproduced using such methods. 19,20 Discrepancies between experimental 23 and computational results 21 can potentially be resolved by taking into account small geometric variations and alternative populations. 24 2
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
The base stacking driving forces have also been deconstructed using simulations of stacking of individual bases in solvent. 25 These studies, however, tackle only the stacking behavior up to the solvated dinucleotide level, and do not address the change in patterns caused by more complex environments. The unstacking free energy was also analyzed for 10 double stranded DNA base pair steps with a nicked strand using umbrella sampling simulations, and the results were seen to depend on the water model used. 26 It is also not clear if existing force fields can properly represent nucleic acid stacking interactions, 27 although a lot of recent effort has been devoted to obtaining better agreement with experiment. 11,28–31 Such parameterization needs to disable unphysical transitions away from stable canonical structures, while at the same time accurately capturing transitions between less stable canonical structures and alternative non-canonical structures. 32,33 Accounting for all possible localized geometric variations known to occur in nucleic acids is required for maintaining such a balance, but such knowledge is presently incomplete. This is illustrated by our recent discovery of previously unreported B-form conformations in RNA single-strand dinucleotide (SSD) contexts, which are geometrically separable from the A-form population in individual RNA SSD dinucleotide sequences. 34 The precise pattern of SSD stacking can have functional consequences, for example, it influences the effect of radiation on nucleic acid bases. 35,36 DNA helices containing a covalent linkage between neighboring thymines in an SSD (thymine dimers) show large deviations from canonical helical geometries. 37,38 Ultrafast formation of cyclobutane dimers observed by femtosecond time-resolved spectroscopy implies that such deviations occur prior to dimer formation, 38–40 indicating that stacking conformations contribute to the propensity for lesions. These ultrafast studies can help probe structure and dynamics at very short time scales with high sensitivity, 41 with potential applications in nanotechnology. 42 Radiation-induced excitation and subsequent decay pathways are also influenced by multiple environmental factors such as solvent, polymerization states, neighboring bases and backbone configurations, and local conformations. 36,41 Single bases in solution have been found to decay from excited states in hundreds of femtoseconds via nonradiative decay mechanisms including non-planar base structures. 35 In single- and double-stranded DNA, however, stacking interactions are implicated in greatly increased persistence of these excited states despite the increased degrees of freedom offered by the polymer. The uncertain relationship between base 3
ACS Paragon Plus Environment
Page 4 of 55
Page 5 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
stacking conformational details and the relaxation of excess electronic energy in such polymers can therefore be greatly enhanced through more precise structural understanding of the variability of SSD stacking geometries. 35 The number of high-resolution crystal structures available for nucleic acids keeps increasing, and is much larger than just a decade earlier. While structural analysis of nucleic acid backbone torsions has a long history, 43 crystallographic determination of backbone atoms is more difficult due to a higher number of degrees of freedom, and base atoms are easier to distinguish due to their characteristic, planar π electron systems. 44,45 This suggests that stacking patterns should be better characterized than backbone torsions in structures with lower overall resolutions, and should be more representative of local nucleic acid structural motifs. Statistical analysis of stacking patterns in SSDs rather than duplex contexts enables a comprehensive analysis of all nucleic acid geometries, including multiple environments, such as pairing complementary strands, bound ligands, or protein complexes. In this study, nucleic acid crystal geometries in the RCSB Protein Data Bank 46 were split into individual single-strands, filtered using a distance cutoff, and their SSD near-stacking patterns were analyzed using a 9-dimensional pairwise distance parameter for the resulting 45,405 SSD conformations. The distributions were clustered using a k-means-based ensemble method, which identified separable populations, and a reduced library of such geometries was created by assigning a representative structure for each separable cluster. This structural classification provides an expanded understanding into the common near-stacking geometric patterns and the heterogeneity of neighboring base-base conformations that stabilize nucleic acid structures.
Methods The advanced search option in the RSCB protein databank 46 was used to extract crystal structures in four ˚ deposited as of categories: DNA, DNA-protein, RNA, and RNA-protein with a minimum resolution of 3 A January 2013. Structures with any nucleic acid atoms within bonding distance of non-nucleic acid atoms and non-solvent atoms were excluded to reduce the analysis to unmodified nucleic acid structures. These structures were split into individual nucleic acid chains based on residue and chain identifiers. The individual
4
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
structures were then imported into the program CHARMM, 47,48 which was used to further split the chains into individual SSDs, and perform all subsequent geometric analysis. There is no consensus geometric definition of stacking, but one simple definition from previous work 27 classified bases as “stacked” using a ˚ We used this definition, and filtered out “unstacked” nucleotides by choosing C5-C5 distance cutoff of 5.6 A. ˚ geometries with C5-C5 distances less than or equal to 5.6 A. The face orientations of the SSDs were obtained by least-squares minimization alignment 49 of nonhydrogen base atoms onto a base geometry pre-oriented in the XY plane, and recording whether the average Z-coordinates of atoms for the other base in the SSD were positive or negative. Each base has two faces, thus there are four distinct possibilities for face-face orientations of the SSD: α−α, α−β, β−α, and β− β in the internally defined Rose et al. nomenclature. 50 These four categories would be respectively named 5′ − 5′ (“55”), 5′ − 3′ (“53”), 3′ − 5′ (“35”), and 3′ − 3′ (“33”) in the RNA ontology consortium nomenclature, 51 which relies on the base orientation seen in standard duplex geometries. We defined a Euclidian-distance-based feature-vector, V = [d(N1, N1), d(N1, N3), d(N1, C5), d(N3, N1), d(N3, N3), d(N3, C5), d(C5, N1), d(C5, N3), d(C5, C5)] where d(X, Y ) is the atomic distance between atom X from the first base and atom Y from the second base. It consists of 9 inter-base pairwise distances between the N1, N3, and C5 hexameric ring atoms of each SSD base. In contrast to backbone torsions usually used as nucleic acid structural parameters, 44 it is non-periodic and has a reduced dimensionality compared to the set of all base non-hydrogen atoms. After separation of SSDs into the four face orientation categories, it can also accurately and uniquely regenerate each original geometry, as long as bases are roughly in planes parallel to each other (further discussion on the uniqueness of V is provided in the Supplementary Information). Subtle geometric differences, such as the orientation between the two base planes, can be tracked by this vector without using angular descriptors. Its component distances are also comparable, except between SSDs that interchange their purine-pyrimidine components, as the three atoms chosen in each base are not the same between purines and pyrimidines. Redundancy in SSD structures was reduced by removing all but one of the geometries with 9-dimensional Euclidean distance less than 0.0005 A˚ from the feature vector of any other 5
ACS Paragon Plus Environment
Page 6 of 55
Page 7 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
geometry. Any DNA SSDs with uracil were also excluded, while maintaining the other SSDs in that crystal structure. Finally, structures with missing atoms, atom naming variations, or other such inconsistencies were also manually removed. The remaining 20,710 DNA dinucleotides and 24,695 RNA dinucleotides were analyzed further. MATLAB (version 7.12.0 R2011a, copyright The MathWorks, Inc.) was used to implement an algorithm to cluster the 9D vectors for all SSD geometries. In the first coarse-grained step of this k-means-based 52,53 ensemble clustering algorithm, 54–56 the value of k, the number of clusters, was chosen as the one that provided the lowest standard deviation for cluster centroids over 1000 separate clusterings. The range of possible k values explored was from 2 to 5. The most frequent clustering for that value of k from over 1000 further iterations was then chosen, and the results are shown in Figures S9-S11 of the Supplementary Information. While these clusters obtained using low values of k did yield stable clustering results with low standard deviations for centroids, they had a wide range of structures with distinct outlier conformations. To obtain tight clustering with a reasonable separation between clusters, the clusters were further refined using a second “trimming and splitting” k-means-based ensemble clustering step. 53 To ensure homogeneity ˚ and the and separation of the clusters, the maximum within-cluster pair-wise distance was set to 2.5 A ˚ Outlier conformations from clusters minimum between-representative pair-wise distance was set to 1 A. that did not meet the maximum within-cluster pair-wise distance criterion were first identified by using a confidence ellipsoid (Mahalanobis distance, multivariate Gaussian approximation) with a cutoff of 0.99, and then “trimmed” to be clustered separately. Clusters that met the adequacy test after trimming were stored. Clusters that did not meet the adequacy test after trimming were re-clustered in their original form using the first coarse-grained step mentioned above, with a set k-value of 2. The entire process was repeated until all clusters met the two criteria. For clusters with less than 9 members, the confidence ellipsoid is not defined, so these were subjected simply to partitioning using the first step with k = 2. Most non-compliant clusters did not require multiple iterations for separation into compliant clusters. Finally, to prevent over-partitioning of continuous diffuse clusters, clusters whose representatives were separated by a distance less than 1 A˚ were merged. Molecular images were generated using Rasmol 2.7 57 or VMD, 58 plots were generated using gnuplot 4.0 59 or MATLAB, and composite figures were constructed using Gimp 1.2 (http://www.gimp.org). 6
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Results and Discussion Of the four different face:face orientations of the SSDs, the predominant orientation is the ”35” orientation seen in canonical A- and B-form structures (45416 geometries in our dataset). In contrast, the ”55” (489 geometries), ”53” (449 geometries), and ”33” (263 geometries) orientations are significantly less frequent. In the results and their discussion given below, all subsections report on ”35” SSD stacking results, except for the penultimate one entitled ’Stacking with alternate face orientations’, which reports and discusses results for the ”55”, ”53”, and ”33” orientations.
Single-strand dinucleotide stacking in isolated DNA In contrast to previous analysis of stacking patterns in isolated DNA that focused on duplex DNA, 2,3 the stacking patterns identified here are for neighboring bases within the same strand. Figure 1 shows these stacking patterns for 5,920 DNA SSDs that arrange into a total of 120 clusters across the 16 SSD sequences. In all figures showing stacking patterns, only the top 5 cluster representatives are shown in sticks with the rest shown in grayscale wireframe to visually distinguish the most populated clusters. Table 1 shows the numbers and populations of these clusters. Throughout this paper, the numbering for all clusters is based on their populations, with cluster 1 being the most populated, and subsequent numbers indicating decreasing population. In isolated DNA, the smallest number of SSDs (89) is seen for CT, and the largest (1159) for CG. The smallest number of clusters (4) occurs in the AC and AT SSDs, and the largest number (11) in the GG and CG SSDs. Clusters with only one member, of which there are 16, represent a single SSD structure that is sufficiently distinct geometrically from all other SSD structures of the same sequence to merit its own cluster classification. The largest cluster occurs for CG with 812 members that all possess very similar geometries. These results show that there is heterogeneity in the way in which neighboring bases can stack in isolated DNA, with the number of distinguishable geometries varying from 4 (AT) to 11 (GG, CG) for each SSD sequence. DNA is known to transform between the B-form and the A-form depending on its sequence and environment. 60 A distinction between the A- and B-forms is maintained even at the level of just two bases in 7
ACS Paragon Plus Environment
Page 8 of 55
Page 9 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
an SSD context. 34 Table 5 shows the proximity of each of these cluster representatives to analogous SSD base stacks in canonical A-form and B-form structures. The clusters are classified as “A” if they are within -1.0 and -0.5 and “B” if they are within 0.5 and 1.0 of the the normalized difference in 9D distance from the canonical A- and B-forms, and a normalized distance of 1.5 from the interpolation line between the two forms. The clusters are classified as “AB” if they are within the normalized distance of 1.5 from the interpolation line, but not in the A or B categories. These represent a conformation that is intermediate between the two canonical structures. All other clusters are labeled “non-A,B,AB”, and further details about this analysis are given in a subsequent subsection entitled “Deviation from A- and B-form conformations.” Of the 120 clusters, 33 correspond to geometries close to the B-form, 30 correspond to geometries close to the A-form, and 12 correspond to intermediate geometries (AB clusters). Of these 12 AB clusters, 6 are also the most populated cluster (for AT, GG, GT, TA, TC, and TT), suggesting that these SSDs prefer a conformation intermediate between the A- and B-forms. The most populated cluster for the AG and CC SSD stacks is close to the A-form, suggesting that their most preferred neighboring base stacking orientation is close to the A-form. This leaves 45 clusters that are both distinct from the two canonical forms and not intermediate between them (i.e. non-A,B,AB clusters, and represented with a “–” in Table 5- Table 8). At least one of these occurs in each SSD sequence, suggesting that neighboring base stacking geometries not within the range spanning the A- and B-forms are possible for any SSD in isolated DNA.
Single-strand dinucleotide stacking in isolated RNA RNA differs from DNA in the absence of a methyl group in the uracil base, and the presence of a 2′ hydroxyl group in the furanose sugar, which can result in differences in intrisic structural preference and in the propensity to form non-duplex structures. Figure 2 shows the stacking patterns in isolated RNA for 7,314 SSDs that can be collected into a total of 145 clusters. Table 2 lists the number and populations of these clusters for individual SSD sequences in isolated RNA. The smallest number of SSD structures is seen in the UA and UU SSD sequences (177) and the largest is seen for the GC SSD (797). The GC SSD also has the largest cluster with 797 geometries, and there are 23 clusters with only one member in all 16 SSDs. The largest number of clusters is seen for AA (16), and the smallest number of clusters is 8
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
for UA and UC (5). The A- or B-form nature of the cluster representatives is assessed in Table 6. There are 42 clusters that show stacking similar to the A-form, 9 intermediate AB clusters, and 76 non-A,B,AB clusters. There are also 18 clusters close to the B-form, and only the AU, GC, CC, UC, and UU SSDs show no clusters near the B-form. There are as many as 3 B-form-like clusters in a SSD (AA, GG, and CA), and the most populated cluster for AA is an intermediate AB cluster. Although this A- and B-form proximity assessment relies only on similarity between neighboring single-strand base stacking, the presence of Bform-like clusters suggests that the introduction of a 2′ -hydroxyl does not completely eliminate B-form-like base stacking patterns in RNA.
Single-strand dinucleotide stacking in DNA complexed with proteins Some of the stacking patterns in DNA complexed with proteins are expected to be similar to those seen for isolated DNA, as the protein does not necessarily contact or influence the structure of all DNA SSDs in any protein-DNA complex. Figure 3 shows these stacking patterns for 14,799 SSDs that coalesce into a total of 191 clusters. The number of DNA SSDs complexed to proteins is also 2.5 times larger than the number of isolated DNA nucleotides, therefore some of the differences could be a result of the additional structures, and some the result of a different environment. The numbers and populations of the clusters for this category are shown in Table 3. There are 22 clusters with only one member, and the largest cluster is for AT with 1260 similar structures. Table 7 shows the A- and B-form proximity of the cluster representatives. There are 55 clusters near the B-form, 21 clusters near the A-form, and 18 AB clusters intermediate between the two forms, leaving 97 clusters that are not near either canonical form. All 16 SSDs have at least 3 of these non-A,B,AB clusters (TA) with the highest number being 11 (AG). Except for the AA, CG, and TG SSDs, whose top population cluster is close to the B-form, all the highest populated clusters in this category are intermediate AB clusters. Only the AT, CC, and CT SSDs do not show any cluster near the A-form, all other SSDs have at least one A-form proximal cluster. The percentage of B-form and AB clusters is similar in isolated DNA (28% and 10%) and in this category (29% and 9%), but a greater proportion of distinct SSDs clustered geometries are near the A-form in the isolated DNA category (25%) than in this category (11%). 9
ACS Paragon Plus Environment
Page 10 of 55
Page 11 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
Single-strand dinucleotide stacking in RNA complexed with proteins RNA-protein complexes differ from DNA-protein complexes in that multiple structures of large RNA-protein complexes (especially ribosomes) represent a significant amount of RNA-protein structural data at the SSD level. Figure 4 shows the stacking patterns in RNA complexed to proteins for 17,383 RNA SSDs forming a total of 239 clusters. These represent 2.4 times the number of SSDs and 1.7 times the number of clusters demonstrating these patterns as compared to isolated RNA. The populations and numbers of the clusters is shown for individual SSD sequences in RNA-protein complexes in Table 4. The least number of SSDs is for UA (632) and the most is for UU (2418). The least number of clusters (6) occurs in UG and the most number of clusters (24) occur in CC and UU SSD sequences. Table 8 classifies the A- or B-form characteristics of the representatives of each cluster. The most populated cluster for all SSDs is A-form-like, but there is at least one B-form-like cluster in 11 of 16 RNA SSDs complexed with proteins. Of the remaining 5 SSDs (AU, GC, CG, CC, CU), only GC has no intermediate AB cluster. There are a total of 48 A-form-like clusters, 15 B-form-like clusters, 12 intermediate AB clusters, and 164 non-A,B,AB clusters. The percentages of clusters for isolated RNA and RNA complexed to proteins are similar only for the intermediate AB clusters (6% and 5%), and are different for A-form (29% and 20%), B-form (12% and 6%), and non-A,B,AB (52% and 69%) clusters. As mentioned before, these differences could be caused by either the larger number of geometries or their different environment, but without accounting for this caveat, isolated RNA has a greater proportion of SSDs closer to the canonical forms than RNA in complex with proteins.
Deviation from A- and B-form conformations As described above, many SSD clusters in all categories can be classified as non-A,B,AB, which indicates that they are not near either the A- or the B-form conformations, nor in the region intermediate between them. The 9D distances between the A- and B-form conformations are themselves not large at the local SSD stacking level and are shown in Supplementary Information Figure S1. The deviation of the cluster geometries away from both forms simultaneously can be assessed by the perpendicular distance of these conformations from the interpolation line connecting the two canonical forms in 9D distance space, 60 and
10
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
this criterion was used to classify the structures as either A, B, AB, or non-A,B,AB in Table 5-Table 8. Figure 5 shows the distribution of distances from the interpolation line for all cluster representatives in the four categories. These interpolation line distances are normalized by the difference in 9D distance between the canonical A- and B-form conformations to make all SSDs comparable to each other. Figure 5 illustrates the approach of using the proximity of a geometry to the interpolation line to classify it as either A-form, B-form, intermediate AB, or non-A,B,AB using green-line dividers and a normalized distance cutoff of 1.5. No matter which exact cutoff is used for this classification, it is clear that geometries near, in between, and further away from both forms are possible for both DNA and RNA. Even though this way to visualize the data does not provide the population of each cluster, a comparison of the distributions in this space reflects the clear preference for RNA to stay near the A-form. Nevertheless, RNA shows the ability to occupy Bform-like stacking geometries, and also seems to have more geometries further away from both forms than DNA. Figure 6 shows the individual unnormalized scatter plots of all cluster representative geometries in the 2D space of their 9D distances from both A- and B-form conformations. The isolated nucleic acid (red squares) and those complexed to proteins (green dots) show similar distances from both forms in some cases, and are distinct in others. This confirms that there are both common and distinct stacking patterns across environments. The interpolation line is shown as a black line in all panels. Significant perpendicular deviation from this line occurs for all 16 SSDs, although the exact extent varies depending on the SSD sequence, its nucleic acid type (DNA/RNA), and its environment. For instance, AC clusters in isolated DNA deviate less than some clusters in protein-complexed DNA, while AC clusters in isolated RNA and RNA complexed to protein deviate similarly to a greater extent than both DNA counterparts. These results show that the range between and near the duplex canonical forms do not fully describe the conformational heterogeneity of single-strand stacking in nucleic acids.
11
ACS Paragon Plus Environment
Page 12 of 55
Page 13 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
Effects of further restricting crystal structure resolution ˚ which could be permissive to The minimum resolution of the structures included in the analysis was 3 A, modeling errors. 61 To understand how the clustering would be affected by being more selective for resolution, resolution cutoffs of 2.0 A˚ and 2.5 A˚ were used to filter cluster members in each category. This shows the clusters that remain or disappear upon increasing stringency, and how each category and SSD sequence is affected differentially. It also illustrates how the results would be altered different resolution choices are made. The resulting percentage change in cluster size is shown in Tables S1-S8 of the Supplementary Information. A very stringent 2.0 A˚ cutoff results in loss of 22 clusters (19%) for isolated DNA, 76 clusters (40%) for DNA complexed to proteins, 47 clusters (32%) for isolated RNA, and 145 clusters (61%) for RNA complexed to proteins. Of the top 5 most populated clusters from each of the 16 SSD sequences, 7 are lost for isolated DNA, 9 are lost for DNA complexed with proteins, 9 are lost for isolated RNA, and ˚ causes the elimination of 10 18 are lost for RNA complexed with proteins. A less stringent cutoff of 2.5 A clusters (8%) for isolated DNA, 31 clusters (16%) for DNA complexed with proteins, 21 clusters (15%) for isolated RNA, and 88 clusters (37%) for RNA complexed with proteins. Of the top 5 most populated clusters from each of the 16 SSD sequences, 1 is eliminated for isolated DNA, 1 is eliminated for DNA complexed with proteins, 2 are eliminated for isolated RNA, and 8 are eliminated for RNA complexed with proteins. The spread of the individual structures around the cluster representatives is not very large, which is shown visually in Supplementary Information Figures S2-S5 . This suggests that loss of many, but not all, structures within a cluster may not affect the cluster representatives shown. Overall, the order in which the four categories are affected by further restricting the resolution of the structures is RNA-protein > DNA-protein > RNA > DNA, with the DNA-protein and RNA categories showing somewhat similar effects especially for ˚ resolution cutoff. The effects are, however, not large for the 5 most populated clusters, suggesting the 2.5 A that these clusters usually contain at least some stacking conformations from high-resolution structures.
12
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Identifying similar SSD stacking conformations across sequences and categories Four similar groups can be formed within the 16 SSD sequences, each containing 4 SSD sequences. These are purine-purine or R-R (A-A, A-G, G-A, G-G), purine-pyrimidine or R-Y (A-C, A-T/U, G-C, GT/U), pyrimidine-purine or Y-R (C-A, C-G, T/U-A, T/U-G), and pyrimidine-pyrimidine or Y-Y (C-C, C-T/U, T/U-C, T/U-T/U). The clusters in DNA or RNA can also be similar, as can clusters in isolated or proteincomplexed nucleic acids. The 9-dimensional vector used to identify the clusters can also be used to judge their similarity across the different groups of non-identical but similar SSDs. This can be done using a distance cutoff in 9D space, the choice of which is explained in the Supplementary Information text and Supplementary Information Figure S6. Table 9 show the similarity between clusters by grouping those that ˚ from each other in this 9D space. The results obtained by using a 0.3 A ˚ cutoff are within a cutoff of 0.4 A are shown in Supplementary Information Table S9. There are 14 common groups of clusters for R-R SSDs, 11 for R-Y SSDs, 15 for Y-R SSDs, and 13 for Y-Y SSDs. Of these, 3 R-R, 5 R-Y, 3 Y-R, and 2 Y-Y groups are similar between DNA and RNA. In addition, 11 R-R, 8 R-Y, 9 Y-R, and 9 Y-Y groups are similar for different environments, i.e. between isolated or protein complexed nucleic acids. These results suggest that many clustered stacking geometries are common to DNA and RNA, and also occur in both isolated and protein-bound nucleic acids.
SSD sequence frequencies The relative population size of each of the 16 SSDs in the crystal structures analyzed could influence the clustering results. Table 10 shows the percentage frequencies of the 16 SSDs in the different categories, and their comparison to genomic SSD frequencies from over 1300 archaeal and bacterial genomes. 62 An ideal equal distribution of SSD frequencies would result in 6.25% occurrence of each SSD. It can be seen that the percentage deviates from this ideal value for many SSDs, and this deviation depends on the category. For instance, the AG and CT SSDs are underrepresented (1.5%) and the CG SSD is overrepresented (19.6%) in isolated DNA. The same SSDs have much more parity in the other three categories. Such frequency variations may well change with future increases in the structures deposited in the PDB, and can affect both the populations and numbers of the clusters. They might therefore be worth considering if the clustered conformations were to be used in some weighted manner for structure prediction. 13
ACS Paragon Plus Environment
Page 14 of 55
Page 15 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
Stacking with alternate face orientations The nearest neighbor base stacking reported in previous sections is between the 3′ -face of the first nucleotide and the 5′ -face of the second nucleotide in the SSD sequence, with the 3′ -face and 5′ -face defined based on orientation of the O5′ and O3′ atoms relative to the plane of the nucleotide base in the canonical A-form or B-form structures. This ”35” stacking 5 represents an overwhelming majority of SSD stacking geometries for nucleic acids, and occurs in the helices resembling A- and B-form structures, but there are three other stacking orientations possible between two bases: ”55”, ”53”, and ”33”. These alternate stacking orientations can conceivably occur more readily through intercalating stacking of bases from different strands in non-helical nucleic acid structures, but the patterns of such alternate stacking in covalently connected neighboring nucleotide bases is less understood. These three alternate stacking orientations for SSDs were clustered and the cluster representatives are shown in Figure 7 for 55 stacking, Figure 8 for 53 stacking, and Figure 9 for 33 stacking. As expected for these alternate nearest neighbor SSD stacking patterns, the number of geometries observed is much smaller than for 35 SSD stacking. Not all SSD sequences show such stacking, and more RNA SSDs show such stacking as compared to DNA SSDs. There are no structures observed for 55 DNA stacking in GA, GC, CA, CC, CT, and TT SSDs; for 53 DNA stacking in AG, AU, GA, CC, CT, TC, and TT SSDs; and for 33 DNA stacking in GC, GT, CC, CT, and TC SSDs. In contrast, all 16 SSD sequences show 55 RNA stacking; there are no 53 RNA stacking structures only for the CU SSD; and there are no 33 RNA stacking structures only for the GC SSD. Similar to the more common 35 stacking patterns, the single distance cutoff to identify stacked conformations results in multiple cluster representatives that would not be classified as stacked using more stringent criteria. 5 For 55 stacking, the TA SSD sequence in isolated DNA, the AG, AC, and CA SSD sequences in isolated RNA, and GA, CC, UA, and UG SSD sequences in protein-bound RNA do not seem to show a geometry with good stacking overlap between the two bases. For 53 stacking, the CA and CG SSDs in isolated DNA, the AC and GA SSDs in isolated RNA, the CG and TG SSDs in protein-bound DNA, and the AU, GU, and UC SSDs in protein-bound RNA show poor base-base overlap. For 33 stacking, the AC and AU SSDs in isolated RNA, the AG, GA, and TT SSDs
14
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
in protein-bound DNA, and the CC SSD in protein-bound RNA all demonstrate limited base-base overlap. The number of geometries in each cluster is also lower for these alternate stacking patterns, showing the difficulty for individual SSDs to invert their stacking orientation from the 35 orientation. Nevertheless, many SSD sequences do show 55, 53, and 33 stacking, suggesting that such inversion of stacking is not prohibitively difficult, even for neighboring, covalently connected bases. The number of clusters for 55, 53, and 33 stacking according to their SSD sequence are summarized in Supplementary Information Tables S10-S12. The number of members in these clusters and the cluster number (which indicates the rank of the cluster in terms of number of members) are shown in Supplementary Information Tables S13-S15.
Limitations of a stacking filter consisting of a single distance ˚ C5-C5 distance cutoff was used to exclude unstacked structures, but The previously recommended 27 5.6 A a single distance cannot capture the complexity of all possible relative stacking orientations of two bases. Many SSD structures classified as stacked by this criterion do not have the two bases in roughly parallel planes. Whether this criterion filters out conformations not really distinct from stacked structures can be approximately assessed using Euclidian distances in the 9-dimensional vector space. An absolute difference was calculated between Vstd of the standard canonical stacked 3′ -base conformation (B-form for DNA, A-form for RNA) and the Vunstacked for the cluster representatives initially classified as unstacked by the ˚ C5-C5 distance cutoff (|Vstd − Vunstacked |, a positive 9-dimensional vector). An empirical cutoff of 5.6 A 1.6 A˚ for the mean distance over the nine dimensions was then used to identify unstacked cluster representatives and their associated cluster members that were close to these standard canonical geometries. While many clusters of SSD geometries classified as unstacked by the 5.6 A˚ C5-C5 distance cutoff were clearly separated from the stacked structures, there were some SSD structures identified through this ad hoc criterion that were continuous with stacked SSD structures. When overlaid together with the stacking geometries using the 5’-base, a significant density of these SSD structures also had C5-C5 distances within ˚ This suggests that if a single distance cutoff is desired for its simplicity, it can be made more inclusive 6 A. by a slightly larger value of the cutoff distance.
15
ACS Paragon Plus Environment
Page 16 of 55
Page 17 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
The correspondence of these additional “near-stacked” SSD clusters to the SSD clusters previously classified as stacked, and their percentage populations are shown in Tables S13-S16 of the Supplementary Information. A representative example of the placement of these additional SSD geometries with respect to SSD geometries previously identified as stacked is shown for the CG SSD in Figure S7 of the Supplementary Information. For isolated DNA, 8 SSD sequences contained such extension clusters, 4 of which contained over 25 % of the structures for that SSD sequence. For DNA in complex with proteins, 10 extension clusters contained over 25 % of the geometries for that SSD sequence, with only the AG and GA SSD sequences having no such clusters. For CA, CG, and CT SSD sequences, these extension clusters contained over 80 % of the SSD structures. For both RNA alone and in complex with proteins, both the numbers of extension clusters and their percentage population were lower. Only 7 of the isolated RNA SSD sequences had such clusters, of which only CA, CG, and UG SSD sequences had over 25 % of their overall population in such clusters. For RNA in complex with proteins, 12 SSD sequences contained these clusters, but 8 of these represented less than 5 % of the population for that SSD sequence, with only CG and UG SSD sequences containing over 25 % of their population in such clusters. This suggests that the variability of neighbor base geometric patterns near the stacked state is greater than that identifiable by the ˚ C5-C5 distance cutoff for both DNA and RNA, with a possibly greater propensity for populating single 5.6 A this extended range in DNA, especially when complexed with proteins. A more sophisticated method for binary classification of stacked versus unstacked conformations, based on overlap area of the projection of one base onto the other along with a distance cutoff, is implemented in the widely used nucleic acid structure analysis program 3DNA. 63–65 All structures classified as stacked by the simpler 5.6 A˚ C5-C5 distance cutoff were assessed with this method. In all categories, and in all face:face orientations, multiple SSD cluster representatives that were previously classified as stacked, were found to be unstacked by this method. The percentage of cluster representatives identified as unstacked varied between 5% and 50% (Table S20 in the Supplementary Information). This suggests that the 5.6 A˚ ˚ distance cutoff used by 3DNA in its method, C5-C5 distance cutoff, which is much larger than the 4.3 A results in inclusion of SSD conformations that are near the stacked state, but not overlapping in their planar projections. The C5-C5 distance cluster representative relative probability for the alternative 55, 53, and 16
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
33 stacking orientations are shown in Figure S12. The pdb files of the cluster representatives provided in the Supplementary Material are separated into “overlap” and “nooverlap” subsets based on this definition of stacking. In addition to duplex structures, the low-dimensional standard helical parameters shift, slide, rise, tilt, roll, twist, X-displacement, Y-displacement, H-rise, inclination, tip, and H-twist can also describe SSD geometries. 43,65 These may not be numerically comparable to previously calculated duplex context counterparts, but can be used for qualitative comparison. They were calculated using 3DNA and are provided for each cluster representative SSD geometry in the Supplementary Material.
Conclusion The variability of neighboring base stacking within nucleic acid single-strands is generally underappreciated. The presented results demonstrate clearly that substantial heterogeneity exists in SSD base stacking orientations. The first coarse grained k-means clustering step with a small value of k could capture some features of the stacking distributions (explained in Supplementary Information text and shown in Supplementary Information Figures S8-S11), but not all their complexity. A second trimming-splitting and concatenation step could further refine the clustering to fully characterize the heterogeneity of the structural distribution. ˚ C5-C5 distance cutoff generally works well as a filter for identifying base stacking, it Although the 5.6 A suffers from two problems. It can be satisfied even when the bases are not stacked, e.g. when one base approaches the other edgewise. Conversely, geometries closely clustered with stacked geometries may not satisfy this cutoff. Since the number of geometries in the PDB is not yet vast, and stacked geometries with non-neighbor bases were excluded, it is possible that the heterogeneity of stacking is larger than that reported here. The populations of the clusters provide some clues about energetic preferences of the stacking geometries, but these could be biased by the types of structures deposited, and are therefore subject to change as more structures are deposited. Even with these caveats, the clustering of stacking geometries performed here is expansive and substantially furthers the understanding of nucleic acid base stacking. The present results can directly benefit future studies related to base stacking. Although QM calculations have been extensively applied to base stacking, 8–14 they could be made more comprehensive by covering
17
ACS Paragon Plus Environment
Page 18 of 55
Page 19 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
all the SSD stacking clusters and also assessing their internal variations. These clustered geometries also provide a database of possible local structures that could be compared to specific metastable stacked or partially unstacked states identified in MD simulations. 66–68 Moreover, these clusters provide building blocks to predict nucleic acid geometries that have the assurance that each local SSD structure is a real conformation extracted from nucleic acid crystal structures in the PDB. We intend to pursue these applications in future studies for predicting local and global structural variability, especially for single-stranded regions of nucleic acids in complex environments.
Acknowledgements Discussions with Drs. Janice Pata, Joachim Jaeger, Randall Morse, and Angel Garcia that considerably aided this work are gratefully acknowledged.
Supplementary Information This consists of 20 tables, 12 figures, and some text description. In addition, a compressed archive containing all stacked base SSD cluster representative pdb files, their SSD base step parameters analyzed using 3DNA, their 9D coordinates in the distance vector V, and a list of the PDB IDs of the crystallographic structures analyzed in the dataset is provided.
18
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 20 of 55
References [1] Turner, D. H. (2013) Fundamental interactions in RNA: Questions answered and remaining. Biopolymers 99, 1097–1104. [2] Hunter, C. A. (1993) Sequence-dependent DNA structure: the role of base stacking interactions. J. Mol. Biol. 230, 1025–1054. [3] Hunter, C. A., and Lu, X.-J. (1997) DNA base-stacking interactions: a comparison of theoretical calculations with oligonucleotide X-ray crystal structures. J. Mol. Biol. 265, 603–619. [4] Olson, W. K., Gorin, A. A., Lu, X.-J., Hock, L. M., and Zhurkin, V. B. (1998) DNA sequence-dependent deformability deduced from protein–DNA crystal complexes. Proc. Natl. Acad. Sci. USA 95, 11163– 11168. [5] Sarver, M., Zirbel, C. L., Stombaugh, J., Mokdad, A., and Leontis, N. B. (2008) FR3D: finding local and composite recurrent structural motifs in RNA 3D structures. J. Math. Biol. 56, 215–252. [6] Petrov, A. I., Zirbel, C. L., and Leontis, N. B. (2011) WebFR3Da server for finding, aligning and analyzing recurrent RNA 3D motifs. Nucleic Acids Res. 39, W50–W55. [7] Kilchherr, F., Wachauf, C., Pelz, B., Rief, M., Zacharias, M., and Dietz, H. (2016) Single-molecule dissection of stacking forces in DNA. Science 353, aaf5508. ˇ [8] Sponer, J., Riley, K. E., and Hobza, P. (2008) Nature and magnitude of aromatic stacking of nucleic acid bases. Phys. Chem. Chem. Phys. 10, 2595–2610. [9] Ringer, A. L., and Sherrill, C. D. (2009) Substituent effects in sandwich configurations of multiply substituted benzene dimers are not solely governed by electrostatic control. J. Am. Chem. Soc. 131, 4574–4575. [10] Hunter, R. S., and Van Mourik, T. (2012) DNA base stacking: thymine/thymine minima. J. Comput. Chem. 33, 2161–2172.
19
ACS Paragon Plus Environment
The stacked uracil/uracil and
Page 21 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
´ ´ s, P., and Otyepka, M. (2013) Nature and [11] Sponer, J., Sponer, J. E., Mladek, A., Jureˇcka, P., Banaˇ magnitude of aromatic base stacking in DNA and RNA: Quantum chemistry, molecular mechanics, and experiment. Biopolymers 99, 978–988. [12] McDonald, A. R., Denning, E. J., and MacKerell Jr, A. D. (2013) Impact of geometry optimization on base–base stacking interaction energies in the canonical A-and B-forms of DNA. J. Phys. Chem. A 117, 1560–1568. [13] Parrish, R. M., and Sherrill, C. D. (2014) Quantum-Mechanical Evaluation of π–π versus Substituentπ Interactions in π Stacking: Direct Evidence for the Wheeler–Houk Picture. J. Am. Chem. Soc. 136, 17386–17389. [14] van Mourik, T., and Hogan, S. W. (2016) DNA base stacking involving adenine and 2-aminopurine. Struct. Chem. 27, 145–158. [15] Alhambra, C., Luque, F. J., Gago, F., and Orozco, M. (1997) Ab initio study of stacking interactions in A-and B-DNA. J. Phys. Chem. B 101, 3846–3853. ˇ [16] Svozil, D., Hobza, P., and Sponer, J. (2009) Comparison of intrinsic stacking energies of ten unique dinucleotide steps in A-RNA and B-DNA duplexes. Can we determine correct order of stability by quantum-chemical calculations? J. Phys. Chem. B 114, 1191–1203. ˇ [17] Florian, J., Sponer, J., and Warshel, A. (1999) Thermodynamic parameters for stacking and hydrogen bonding of nucleic acid bases in aqueous solution: ab initio/Langevin dipoles study. J. Phys. Chem. B 103, 884–892. [18] Dang, L. X., and Kollman, P. A. (1990) Molecular dynamics simulations study of the free energy of association of 9-methyladenine and 1-methylthymine bases in water. J. Am. Chem. Soc. 112, 503– 507. [19] Norberg, J., and Nilsson, L. (1995) Potential of mean force calculations of the stacking-unstacking process in single-stranded deoxyribodinucleoside monophosphates. Biophys. J. 69, 2277–2285.
20
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
[20] Norberg, J., and Nilsson, L. (1995) Stacking free energy profiles for all 16 natural ribodinucleoside monophosphates in aqueous solution. J. Am. Chem. Soc. 117, 10832–10840. [21] Friedman, R. A., and Honig, B. (1995) A free energy analysis of nucleic acid base stacking in aqueous solution. Biophys. J. 69, 1528–1535. [22] Ts’o, P. O., Melvin, I. S., and Olson, A. C. (1963) Interaction and association of bases and nucleosides in aqueous solutions. J. Am. Chem. Soc. 85, 1289–1296. [23] Gellman, S., Haque, T., and Newcomb, L. (1996) New evidence that the hydrophobic effect and dispersion are not major driving forces for nucleotide base stacking. Biophys. J. 71, 3523–3525. [24] Luo, R., Gilson, H. S., Potter, M. J., and Gilson, M. K. (2001) The physical basis of nucleic acid base stacking in water. Biophys. J. 80, 140–148. [25] Mak, C. H. (2016) Unraveling Base Stacking Driving Forces in DNA. J. Phys. Chem. B 120, 6010–6020. ¨ [26] Hase, F., and Zacharias, M. (2016) Free energy analysis and mechanism of base pair stacking in nicked DNA. Nucleic Acids Res. 44, 7100–7108. [27] Chen, A. A., and Garc´ıa, A. E. (2013) High-resolution reversible folding of hyperstable RNA tetraloops using molecular dynamics simulations. Proc. Natl. Acad. Sci. USA 110, 16820–16825. [28] Danilov, V. I., Dailidonis, V. V., van Mourik, T., and Fruchtl, H. A. (2011) A study of nucleic acid base¨ stacking by the Monte Carlo method: Extended cluster approach. Central Eur. J. Chem. 9, 720–727. [29] Jafilan, S., Klein, L., Hyun, C., and Florian, J. (2012) Intramolecular base stacking of dinucleoside monophosphate anions in aqueous solution. J. Phys. Chem. B 116, 3613–3618. [30] Condon, D. E., Kennedy, S. D., Mort, B. C., Kierzek, R., Yildirim, I., and Turner, D. H. (2015) Stacking in RNA: NMR of Four Tetramers Benchmark Molecular Dynamics. J. Chem. Theory Comput. 11, 2729– 2742.
21
ACS Paragon Plus Environment
Page 22 of 55
Page 23 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
[31] Brown, R. F., Andrews, C. T., and Elcock, A. H. (2015) Stacking free energies of all DNA and RNA nucleoside pairs and dinucleoside-monophosphates computed using recently revised AMBER parameters and compared with experiment. J. Chem. Theory Comput. 11, 2315–2328. ´ s, P., Hollas, D., Zgarbova, ´ M., Jurecka, P., Orozco, M., Cheatham III, T. E., Sponer, J., and [32] Banaˇ Otyepka, M. (2010) Performance of molecular mechanics force fields for RNA simulations: stability of UUCG and GNRA hairpins. J. Chem. Theory Comput. 6, 3836–3849. ´ M., Sponer, J., Otyepka, M., Cheatham III, T. E., Galindo-Murillo, R., and Jurecka, P. (2015) [33] Zgarbova, Refinement of the Sugar–Phosphate Backbone Torsion Beta for AMBER Force Fields Improves the Description of Z-and B-DNA. J. Chem. Theory Comput. 11, 5723–5736. [34] Sedova, A. A., and Banavali, N. K. (2016) RNA approaches the B-form in stacked single strand dinucleotide contexts. Biopolymers 105, 65–82. [35] Kohler, B. (2010) Nonradiative decay mechanisms in DNA model systems. J. Phys. Chem. Lett. 1, 2047–2053. [36] Improta, R. (2012) Photophysics and photochemistry of thymine deoxy-dinucleotide in water: A PCM/TD-DFT quantum mechanical study. J. Phys. Chem. B 116, 14261–14274. [37] Park, H., Zhang, K., Ren, Y., Nadji, S., Sinha, N., Taylor, J.-S., and Kang, C. (2002) Crystal Structure of a DNA decamer containing a cis-syn thymine dimer. Proc. Natl. Acad. Sci. USA 99, 15965–15970. ´ [38] Schreier, W. J., Schrader, T. E., Koller, F. O., Gilch, P., Crespo-Hernandez, C. E., Swaminathan, V. N., Carell, T., Zinith, W., and Kohler, B. (2007) Thymine dimerization in DNA is an ultrafast photoreaction. Science 315, 625–629. ´ [39] Law, Y. K., Azadi, J., Crespo-Hernandez, C. E., Olmon, E., and Kohler, B. (2008) Predicting thymine dimerization yields from molecular dynamics simulations. Biophys. J. 94, 3590–3600. [40] McCullagh, M., Hariharan, M., Lewis, F. D., Markovitsi, D., Douki, T., and Schatz, G. C. (2010) Conformational control of TT dimerization in DNA conjugates. A molecular dynamics study. J. Phys. Chem. B 114, 5215–5221. 22
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
[41] Chen, J., and Kohler, B. (2014) Base stacking in adenosine dimers revealed by femtosecond transient absorption spectroscopy. J. Am. Chem. Soc. 136, 6362–6372. [42] Banyasz, A., Gustavsson, T., Onidas, D., Changenet-Barret, P., Markovitsi, D., and Improta, R. (2013) Multi-pathway excited state relaxation of adenine oligomers in aqueous solution: a joint theoretical and experimental study. Chem. Eur. J. 19, 3762–3774. [43] Saenger, W. Principles of Nucleic Acid Structure; Springer, 1984. [44] Richardson, J. S., Schneider, B., Murray, L. W., Kapral, G. J., Immormino, R. M., Headd, J. J., Richardson, D. C., Ham, D., Hershkovits, E., Williams, L. D., Keating, K. S., Pyle, A. M., Micallef, D., Westbrook, J., and Berman, H. M. (2008) RNA backbone: consensus all-angle conformers and modular string nomenclature (an RNA Ontology Consortium contribution). RNA 14, 465–481. [45] Chou, F.-C., Sripakdeevong, P., Dibrov, S. M., Hermann, T., and Das, R. (2013) Correcting pervasive errors in RNA crystallography through enumerative structure prediction. Nat. Methods 10, 74–76. [46] Rose, P. W., Beran, B., Bi, C., Bluhm, W. F., Dimitropoulos, D., Goodsell, D. S., Prli´c, A., Quesada, M., Quinn, G. B., Westbrook, J. D., Young, J., Yukich, B., Zardecki, C., Berman, H. M., and Bourne, P. E. (2011) The RCSB Protein Data Bank: redesigned web site and web services. Nucleic Acids Res. 39, D392–D401. [47] Brooks, B., Bruccoleri, R., Olafson, B., Swaminathan, S., and Karplus, M. (1983) CHARMM: A program for macromolecular energy, minimization, and dynamics calculations. J. Comput. Chem. 4, 187–217. [48] Brooks, B. R., Brooks III, C. L., Mackerell Jr., A. D., Nilsson, L., Petrella, R. J., Roux, B., Won, Y., Archontis, G., Bartels, C., Boresch, S., and et al., (2009) CHARMM: the biomolecular simulation program. J. Comput. Chem. 30, 1545–1614. [49] Kabsch, W. (1978) A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallogr. Sect. A 34, 827–828.
23
ACS Paragon Plus Environment
Page 24 of 55
Page 25 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
[50] Rose, I. A., Hanson, K. R., Wilkinson, K. D., and Wimmer, M. J. (1980) A suggestion for naming faces of ring compounds. Proc. Natl. Acad. Sci. USA 77, 2439–2441. [51] Hoehndorf, R., Batchelor, C., Bittner, T., Dumontier, M., Eilbeck, K., Knight, R., Mungall, C. J., Richardson, J. S., Stombaugh, J., Westhof, E., Zirbel, C. L., and Leontis, N. B. (2011) The RNA Ontology (RNAO): An ontology for integrating RNA sequence and structure data. Appl. Ontology 6, 53–89. [52] Jain, A. K., and Dubes, R. C. Algorithms for clustering data; Prentice-Hall, Inc.: Englewood Cliffs, New Jersey, 1988. [53] Gan, G., Ma, C., and Wu, J. Data Clustering: theory, algorithms, and applications, ASA-SIAM Series on Statistical and Applied Probability; ASA and SIAM: Philadelphia and Alexandria, VA, 2007. [54] Kuncheva, L. I., and Vetrov, D. P. (2006) Evaluation of stability of k-means cluster ensembles with respect to random initialization. IEEE Trans. Pattern Anal. Mach. Intell. 28, 1798–1808. [55] Greene, D., Tsymbal, A., Bolshakova, N., and Cunningham, P. (2004) Ensemble clustering in medical diagnostics. IEEE Xplore: 17th Symposium on Computer-Based Medical Systems, June 24-25 CBMS 2004 Proceedings, 576–581. [56] Strehl, A., and Ghosh, J. (2002) Cluster ensembles– a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617. [57] Sayle, R. A., and Milner-White, E. J. (1995) RASMOL: biomolecular graphics for all. Trends Biochem. Sci. 20, 374–376. [58] Humphrey, W., Dalke, A., and Schulten, K. (1996) VMD: Visual Molecular Dynamics. J. Mol. Graph. 14, 33–38. [59] Racine, J. (2006) Gnuplot 4.0: a portable interactive plotting utility. J. Appl. Econ. 21, 133–141. [60] Banavali, N. K., and Roux, B. (2005) Free energy landscape of A-DNA to B-DNA conversion in aqueous solution. J. Am. Chem. Soc. 127, 6866–6876.
24
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
[61] Bell, J. A., Ho, K. L., and Farid, R. (2012) Significant reduction in errors associated with nonbonded contacts in protein crystal structures: automated all-atom refinement with PrimeX. Acta Crystallogr. D Biol. Crystallogr. 68, 935–952. [62] Zhang, H., Li, P., Zhong, H.-S., and Zhang, S.-H. (2013) Conservation vs. variation of dinucleotide frequencies across bacterial and archaeal genomes: evolutionary implications. Front. Microbiol. 4, 1–7. [63] Zheng, G., Lu, X.-J., and Olson, W. K. (2009) Web a web server for the analysis, reconstruction, and visualization of three-dimensional nucleic-acid structures. Nucleic Acids Res. 37, W240–W246. [64] Lu, X.-J., and Olson, W. K. (2008) 3DNA: a versatile, integrated software system for the analysis, rebuilding and visualization of three-dimensional nucleic-acid structures. Nat. Protoc. 3, 1213–1227. [65] Lu, X.-J., and Olson, W. K. (2003) 3DNA: a software package for the analysis, rebuilding and visualization of three-dimensional nucleic acid structures. Nucleic Acids Res. 31, 5108–5121. [66] Banavali, N. K. (2013) Partial base flipping is sufficient for strand slippage near DNA duplex termini. J. Am. Chem. Soc. 135, 8274–8282. [67] Banavali, N. K. (2013) Analyzing the Relationship Between Single Base Flipping and Strand Slippage Near DNA Duplex Termini. J. Phys. Chem. B 117, 14320–14328. [68] Manjari, S. R., Pata, J. D., and Banavali, N. K. (2014) Cytosine unstacking and strand slippage at an insertion-deletion mutation sequence in an overhang containing DNA duplex. Biochemistry 53, 3807– 3816.
25
ACS Paragon Plus Environment
Page 26 of 55
Page 27 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
Table 1: Neighboring nucleotide base stacking cluster populations for isolated DNA. SSD Cluster 1 2 3 4 5 6 7 8 9 10 11 Geometries Clusters
AA
AG
AC
AT
GA
292 8 6 6 6 4 3 1 1
52 15 14 3 2 2 2 1
129 73 5 3
384 6 4 2
194 47 45 13 4 2 1
327 9
91 8
210 4
GG
GC
GT
CA
CG
257 285 124 67 812 163 258 93 40 143 78 177 18 20 85 77 6 5 11 67 57 5 4 1 25 27 3 2 1 8 21 2 1 7 5 1 1 6 3 1 3 3 2 2 1 396 306 693 738 248 140 1159 4 7 11 9 8 6 11 Total: 5,920 SSD geometries; 120 clusters.
26
ACS Paragon Plus Environment
CC
CT
TA
TG
TC
TT
244 84 61 50 36 4 3 1 1 1
32 21 13 13 7 3
98 61 47 31 20 2
47 24 22 21 16 9
162 36 35 24 15 14 10 5 4
321 7 3 2 1 1
485 10
89 6
259 6
139 6
305 9
335 6
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 28 of 55
Table 2: Neighboring nucleotide base stacking cluster populations for isolated RNA SSD Cluster 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Geometries Clusters
AA
AG
AC
AU
GA
GG
GC
GU
CA
CG
CC
CU
UA
UG
UC
UU
173 79 53 50 47 20 19 9 6 3 3 2 1 1 1 1 468 16
163 58 54 44 24 19 19 8 8 7 7 4
389 22 17 13 13 6 4 3 3 1 1
114 66 19 11 8 8 5 4 4 1 1
94 74 53 46 23 17 3 3 2
597 45 38 32 21 11 6 4 2 2 1 1
767 11 8 5 3 1 1 1
370 83 18 12 8 3 3 3 2 1 1 1
230 191 16 13 7 6 4 2
575 24 22 12 6 1
506 71 5 2 1 1
190 135 59 19 10 2 2 2
91 74 8 3 1
214 137 74 32 25 5 2 1
215 114 30 14 10
153 11 3 3 3 2 1 1
415 12
472 11
241 11
315 760 797 505 469 640 586 9 12 8 12 8 6 6 Total: 7,314 SSD geometries; 145 clusters.
419 8
177 5
490 8
383 5
177 8
27
ACS Paragon Plus Environment
Page 29 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
Table 3: Neighboring nucleotide base stacking cluster populations for DNA in complex with protein SSD Cluster 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Geometries Clusters
AA
AG
AC
AT
GA
GG
GC
GT
1119 45 13 11 9 6 6 4 4 3 2 2
415 233 77 27 25 17 6 6 6 6 4 3 1 1 1 1
720 24 11 8 6 4 4 1 1 1
1260 23 14 13 7 5 4 4 3 1
640 128 85 18 13 6 6 3 3 2
466 97 60 11 10 5 4 3 2 1
472 200 174 83 12 7 7 6 5 4 3 1
743 18 16 7 7 4 3 2 2
1224 12
829 16
780 10
CA
CG
CC
CT
TA
TG
TC
TT
269 153 232 108 179 106 147 85 89 60 88 59 39 33 9 20 4 7 3 6 3 5 2 2 2 1 2 1 1 1 1 1 1334 904 659 974 802 1072 646 10 10 10 12 9 18 14 Total: 14,799 SSD geometries; 191 clusters.
411 84 65 59 40 6 3 3 2 1 1 1
478 89 80 57 52 31 10 9 5 2
270 151 118 83 69 60 55 32 16 14 5 5
414 179 166 118 76 45 10 8 8 7 5 2 2
724 74 58 16 5 4 3 2 1
1072 160 8 8 8 6 4 3 3 3 2 2 1 1
676 12
813 10
878 12
1040 13
887 9
1281 14
28
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 30 of 55
Table 4: Neighboring nucleotide base stacking cluster populations for RNA in complex with proteins SSD Cluster 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Geometries Clusters
AA
AG
AC
AU
GA
GG
GC
GU
CA
CG
CC
CU
UA
UG
UC
UU
491 68 42 22 20 6 5 5 3 2 2 1
167 94 65 64 58 57 53 46 23 14 8 5 4 3 3 3 2 2 1 1 1 1
543 76 29 14 12 10 9 8 8 7 4 4 3 2 1
746 132 57 31 23 16 14 6 5 4 4 3 2 2 1 1 1 1
683 245 86 32 12 6 5 2 2
1650 124 74 22 14 9 7 6 5 5 2 2
897 30 28 16 14 8 7 6 5 5 4 4 3 2 1 1 1
493 216 25 21 13 9 7 5 5 4 4 2 2 1
530 285 103 25 17 6 4 3 3 3 2 2 2 2
852 69 22 12 9 5 3 1 1 1
214 206 117 115 12 8 8 7 5 5 4 4 3 3 2 2 1
460 77 63 18 7 3 2 2
830 29 6 3 2 2
660 71 65 59 31 16 10 9 6 3 3 3 3 3 2 1 1
667 12
675 22
730 15
1049 18
1013 637 38 34 30 18 16 12 12 11 11 8 8 8 6 6 3 3 2 2 2 2 1 1 1884 24
716 17
632 8
872 6
946 17
1289 338 324 170 109 76 15 15 13 10 8 8 8 7 7 5 4 3 2 2 2 1 1 1 2418 24
1073 1920 1032 807 987 975 9 12 17 14 14 10 Total: 17,383 SSD geometries; 239 clusters.
29
ACS Paragon Plus Environment
Page 31 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
Table 5: Canonical form proximities of neighboring nucleotide base stacking clusters for DNA SSD Cluster 1 2 3 4 5 6 7 8 9 10 11 Geometries Clusters
AA
AC
AG
AT
GA
GG
GC
GT
CA
CG
CC
CT
TA
TG
TC
TT
B B B B A A -
B A B -
A AB A -
AB A B -
B B B A A -
B B A A A B AB -
AB A A A B -
B A B A -
B A B AB -
AB A B B A -
B AB A B B B
AB B A A -
AB A -
91 8
210 4
396 4
306 7
738 9
248 8
140 6
B B A A A B AB 1159 11
A AB B B A B -
327 9
AB A A B B 693 11
485 10
89 6
259 6
139 6
305 9
335 6
A = A-form, B = B-form, AB = between A and B forms
30
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 32 of 55
Table 6: Canonical form proximities of neighboring nucleotide base stacking clusters for RNA SSD Cluster 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
AA
AG
AC
AU
GA
GG
GC
GU
CA
CG
CC
CU
UA
UG
UC
UU
AB A B A B A B -
A A A A B -
A B A AB A -
A A A A -
A A A AB B A -
A AB A B B B -
A A AB -
A B A -
A A B B B
A A B -
A A A -
A A B B
A A B AB -
A A AB B AB A A -
A A -
A AB -
A = A-form, B = B-form, AB = between A and B forms
31
ACS Paragon Plus Environment
Page 33 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
Table 7: Canonical form proximities of neighboring nucleotide base stacking clusters for DNA in complex with proteins SSD Cluster 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
AA
AG
AC
AT
GA
GG
GC
GT
CA
CG
CC
CT
TA
TG
TC
TT
B A B B -
AB B B A A -
AB B A A A B -
AB B B B -
AB A B B -
AB B B B A A -
AB AB B B A B B -
AB B A B -
AB AB B B B B A B B B -
B B B B A AB A B B B -
AB B B B B -
AB B B B -
AB B AB B A B B A A -
B AB B B A B B -
AB B A A B -
AB B A B -
A = A-form, B = B-form, AB = between A and B forms
32
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 34 of 55
Table 8: Canonical form proximities of neighboring nucleotide base stacking clusters for RNA in complex with proteins SSD Cluster 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
AA
AG
AC
AU
GA
GG
GC
GU
CA
CG
CC
CU
UA
UG
UC
UU
A B A B A -
A A A B -
A B A -
A A A AB -
A A A AB A B
A AB AB B -
A A -
A A AB A A A AB B -
A A A B A B B -
A AB A A -
A A A AB A -
A A A AB AB -
A A A AB B -
A B A B
A A B -
A AB A A B A -
A = A-form, B = B-form, AB = between A and B forms
33
ACS Paragon Plus Environment
Page 35 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
Table 9: Common SSD stacking patterns across categories using 0.4 A˚ distance cutoff SSD R-R
Pattern 1 2
DNA AG2, GA4, GA5 GG2, GG3 AA1, GA1
R-Y
3 4 5 6 7 8 9 10 11 12 13 14 1
Y-R
2 3 4 5 6 7 8 9 10 11 1
GC2 GT8 CA2, CG4, TA2
2
CA1, CG1
3 4 5
CG3, TG3
Y-Y
6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13
DNAP
AA1, AG1, GA1, GG1 AG7 AG5, GA3
RNA AA4, AG1, AG3, GA1, GG1
RNAP AA1, AG1, AG2, GA1, GG1
AG8, GA5 AA7 AA1, GA4 AG2 GA3
AG12, GG9 GA3 GG4 AA2, GG3 AA4, AG3
AG4 AG5, GG9 AG10
AG6
AG1, GG5
AG10
GA3, GG4 GA2, GG3 AT1, GT1
AC1, AT1, GC1, GC2, GT1
GC3, GT2 GT5 AC1 AC2, GC1
GC6 GC3 CG7, TA5, TG5
GG3 GC4
GA5
AU1 GC1, GU1 AC1
AU1, GU1 GC1 AC1
GU4 AU7 GU5 AU5
AC2 AC5 AC6
CA1, UA1, UA2, UG1
GU4 CA1, UA1, UG1
CA1, CG1, UA2, UG2 CA3, UG3 UG2
CA4, CG3, UG2 CA2, UA4
CA2, CG1 CG3, UG3
CG1 CA3 CA13, UA7, UG5
CA1, CG3, TA3, TG3 CA4, TA4, TG4 CA5, CG2, TG7
CG6, UG2 CG5 TA4, TG1 CC1, CT2, TC4 CC3, TC1, TT1 TC8
TC7 TC3
UG5 CA7
CG7
UA4 CC1, CU1, UC1
UA5 UU1
CC2, CU2, UC2 UU1 CU3, UC4
CC2, CU1, UC1 CC3
CG5
CC1, TC1, TT1 TT5
CT5, TC6 CC3 CC4, TC4 CC1, CU4 CC12, UC7
CT3
CT2 CU4
UC6
TC2, TT2
Numbers after individual SSD names within categories refer to their cluster number, e.g. CG3 refers to the third cluster for CG base stacking. R: Purine, Y: Pyrimidine. 34
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 36 of 55
Table 10: SSD percentage frequencies in analyzed structures SSD Geometries DNA DNA-Protein RNA RNA-Protein Genomes Average Maximum Minimum
AA
AG
AC
AX
GA
GG
GC
GX
CA
CG
CC
CX
XA
XG
XC
XX
5.5 6.4 8.3 3.8
1.5 5.7 5.6 3.9
3.5 6.5 5.3 4.2
6.7 3.3 9.0 6.0
5.2 4.3 6.1 6.2
11.7 10.4 4.5 11.0
12.5 10.9 6.6 5.9
4.2 6.9 5.4 4.6
2.4 6.4 7.2 5.7
19.6 8.8 4.4 5.6
8.2 8.0 4.6 10.8
1.5 5.7 5.5 4.1
4.4 2.4 5.9 3.6
2.3 6.7 7.0 5.0
5.2 5.2 6.0 5.4
5.7 2.4 8.7 13.9
7.9 22.0 1.0
5.5 9.0 2.5
5.0 7.1 2.1
6.9 17.2 1.3
6.1 9.0 2.8
6.2 15.4 1.1
7.4 15.9 0.6
5.0 7.1 2.0
6.1 8.5 2.6
6.8 18.5 0.2
6.2 15.3 1.2
5.5 8.9 2.7
5.2 16.7 0.6
6.1 8.6 2.5
6.1 9.0 2.9
7.9 21.7 0.9
X = T for DNA, X = U for RNA. Genome SSD frequency statistics are from over 1300 archaeal and bacterial genomes. 62 If each SSD sequence had the same frequency, the percentage for each SSD sequence would be 6.25%.
35
ACS Paragon Plus Environment
Page 37 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
Figure 1: Stacking geometry representatives for all 16 DNA SSD sequences in crystal structures of resolu˚ containing only DNA. Each stacking geometry is oriented using the 5′ -base such that tion greater than 3 A stacking variability is visualized through the relative orientations of the 3’-base. For each SSD sequence, the top 5 most populated clusters are shown with thick lines with the colors in the order of decreasing population as follows: 1 - red, 2 - green, 3 - blue, 4 - yellow, 5 - orange. The other lower population clusters are shown with thinner lines in grayscale varying from dark grey (higher population) to light grey (lower popula˚ are included in the clustering used tion). Only geometries with a C5-C5 distance less than or equal to 5.6 A to obtain the representatives.
36
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Figure 2: Stacking geometry representatives for all 16 RNA SSD sequences in crystal structures of resolution greater than 3 A˚ containing only RNA. The coloring and representative choice is the same as Figure 1.
37
ACS Paragon Plus Environment
Page 38 of 55
Page 39 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
Figure 3: Stacking geometry representatives for all 16 DNA SSD sequences in crystal structures of resolu˚ containing DNA in complex with protein. The coloring and representative choice is the tion greater than 3 A same as Figure 1.
38
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Figure 4: Stacking geometry representatives for all 16 RNA SSD sequences in crystal structures of resolu˚ containing RNA in complex with protein. The coloring and representative choice is the tion greater than 3 A same as Figure 1.
39
ACS Paragon Plus Environment
Page 40 of 55
Page 41 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
Figure 5: Scatter diagram of DNA or RNA SSD stack cluster representatives in the 2D space of a normalized 9-dimensional distance from the A-form and the B-form canonical conformations. The plots are oriented to show the normalized distance from the interpolation line between the A- and B-form vectors on the Y-axis with this line itself oriented on the X-axis.
40
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Figure 6: Scatter diagram of cluster representatives in the 2D space of a 9-dimensional distance from the A-form and the B-form canonical conformations for DNA and RNA SSDs. Red squares - isolated DNA or RNA, green dots - DNA or RNA in complex with proteins. 41
ACS Paragon Plus Environment
Page 42 of 55
Page 43 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
Figure 7: Cluster representative geometries showing ”55” stacking for all DNA and RNA SSD sequences ˚ The coloring and representative choice is the same as in crystal structures of resolution greater than 3 A. Figure 1. DP - DNA in complex with protein, RP - RNA in complex with protein.
42
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Figure 8: Cluster representative geometries showing ”53” stacking for all DNA and RNA SSD sequences ˚ The coloring and representative choice is the same as in crystal structures of resolution greater than 3 A. Figure 1. DP - DNA in complex with protein, RP - RNA in complex with protein.
43
ACS Paragon Plus Environment
Page 44 of 55
Page 45 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
Figure 9: Cluster representative geometries showing ”33” stacking for all DNA and RNA SSD sequences ˚ The coloring and representative choice is the same as in crystal structures of resolution greater than 3 A. Figure 1. DP - DNA in complex with protein, RP - RNA in complex with protein.
44
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Stacking geometry representatives for all 16 DNA SSD sequences in crystal structures of resolution greater than 3 Å containing only DNA. Each stacking geometry is oriented using the 50-base such that stacking variability is visualized through the relative orientations of the 3’-base. For each SSD sequence, the top 5 most populated clusters are shown with thick lines with the colors in the order of decreasing population as follows: 1 - red, 2 - green, 3 - blue, 4 - yellow, 5 - orange. The other lower population clusters are shown with thinner lines in grayscale varying from dark grey (higher population) to light grey (lower population). Only geometries with a C5-C5 distance less than or equal to 5.6 Å are included in the clustering used to obtain the representatives. Figure 1
ACS Paragon Plus Environment
Page 46 of 55
Page 47 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
Stacking geometry representatives for all 16 RNA SSD sequences in crystal structures of resolution greater than 3 Å containing only RNA. The coloring and representative choice is the same as Figure 1. Figure 2
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Stacking geometry representatives for all 16 DNA SSD sequences in crystal structures of resolution greater than 3 Å containing DNA in complex with protein. The coloring and representative choice is the same as Figure 1. Figure 3
ACS Paragon Plus Environment
Page 48 of 55
Page 49 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
Stacking geometry representatives for all 16 RNA SSD sequences in crystal structures of resolution greater than 3 Å containing RNA in complex with protein. The coloring and representative choice is the same as Figure 1. Figure 4
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Scatter diagram of DNA or RNA SSD stack cluster representatives in the 2D space of a normalized 9-dimensional distance from the A-form and the B-form canonical conformations. The plots are oriented to show the normalized distance from the interpolation line between the A- and B-form vectors on the Y-axis with this line itself oriented on the X-axis. Figure 5
ACS Paragon Plus Environment
Page 50 of 55
Page 51 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
Scatter diagram of cluster representatives in the 2D space of a 9-dimensional distance from the A-form and the B-form canonical conformations for DNA and RNA SSDs. Red squares - isolated DNA or RNA, green dots - DNA or RNA in complex with proteins. Figure 6
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Cluster representative geometries showing ”55” stacking for all DNA and RNA SSD sequences in crystal structures of resolution greater than 3 Å. The coloring and representative choice is the same as Figure 1. DP - DNA in complex with protein, RP - RNA in complex with protein. Figure 7
ACS Paragon Plus Environment
Page 52 of 55
Page 53 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
Cluster representative geometries showing ”53” stacking for all DNA and RNA SSD sequences in crystal structures of resolution greater than 3 Å. The coloring and representative choice is the same as Figure 1. DP - DNA in complex with protein, RP - RNA in complex with protein. Figure 8
ACS Paragon Plus Environment
Biochemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Cluster representative geometries showing ”33” stacking for all DNA and RNA SSD sequences in crystal structures of resolution greater than 3 Å. The coloring and representative choice is the same as Figure 1. DP - DNA in complex with protein, RP - RNA in complex with protein. Figure 9
ACS Paragon Plus Environment
Page 54 of 55
Page 55 of 55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Biochemistry
Table of contents graphic 34x13mm (600 x 600 DPI)
ACS Paragon Plus Environment