The TMB Library: A Library of Nucleosome Simulations of

7 hours ago - Nucleosomes are the fundamental building blocks of chromatin, the biomaterial that houses the genome in all higher organisms. A nucleoso...
0 downloads 0 Views
Subscriber access provided by The University of Melbourne Libraries

Computational Biochemistry

The TMB Library: A Library of Nucleosome Simulations of DNA Sequence Effects Ran Sun, Zilong Li, and Thomas Connor Bishop J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.9b00252 • Publication Date (Web): 06 Sep 2019 Downloaded from pubs.acs.org on September 6, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

The TMB Library: A Library of Nucleosome Simulations of DNA Sequence Effects Ran Sun,†,‡ Zilong Li,†,¶ and Thomas C. Bishop∗,†,§ †College of Engineering and Science, Louisiana Tech University ‡Engineering Physics Program ¶Computational Analysis and Modeling Program §Chemistry and Physics Programs E-mail: [email protected] Phone: +1(318)-257-5209

Abstract

mean DNA helical parameter values obtained from simulations of nucleosomes are largely within the range of thermal motion of DNA free in solution. The library provides evidence of DNA kinking in the nucleosome and clearly demonstrates the effects of DNA sequence on the gross structure and dynamics of nucleosomes. These effects and mispositioning of the 601 super strong nucleosome positioning sequence can be detected in short simulations (10 ns). Collectively, the results provide a basis for comparative simulation studies of nucleosomes and extend our understanding of the binding of proteins and drugs to nucleosomal DNA. The TMB Library can be found at http://dna.engr.latech.edu/~tmbshare/

Nucleosomes are the fundamental building blocks of chromatin, the biomaterial that houses the genome in all higher organisms. A nucleosome consists of 145-147 base pairs of DNA wrapped 1.7 times around eight histones. Given a four-letter code (A, C, G, T), there are approximately 4147 or 1088 oligonucleotides that can form a nucleosome. Comparative, rather than comprehensive, studies are required. Here we introduce the TMB Library of nucleosome simulations and present a metaanalysis of over 20 microseconds of all atom molecular dynamics simulations representing 518 different realizations of the nucleosome. The TMB Library serves as a reference for future comparative, on-demand simulations of nucleosomes and a demonstration of iBIOMES Lite as a tool for managing a laboratory’s simulation library. For every simulation, dewatered trajectories, RMSD, and DNA helical parameter data are provided through iBIOMES Lite in a web browser and a file browser format. A novel view of nucleosomal DNA emerges from our meta-analysis. DNA conformation is restricted to a specific left-handed superhelix, but the range of conformations observed for individual bases and base pairs is not more restricted nor more highly deformed than DNA free in solution. With the exception of Roll,

Introduction Chromatin is the biomaterial that contains eukaryotic genomes. The fundamental building block of chromatin is the nucleosome. All genomic mechanisms are therefore either directly or indirectly affected by the structure and dynamics of nucleosomes. There are nearly 8,000 protein-DNA complexes in the Protein Data Bank(https://www.rcsb.org/ stats/growth/protein_na_complex). In eukaryotes, the proteins in these complexes either complement or compete with nucleosomes to

ACS Paragon Plus Environment

1

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

achieve DNA binding. Of the available proteinDNA complexes, approximately 160 contain nucleosomes. 1 Collectively they represent less than five of the 4147 sequences of DNA that can be wrapped around a histone octamer. As a method, x-ray crystallographic studies of nucleosomes are fundamentally limited by the symmetry of the DNA sequence. 2 To explore structure-function relationships governing genomic processes, methods for docking proteins to nucleosomes, workflows for computing thermodynamic ensembles, and tools for analyzing nucleosomes within the context of chromatin must be developed. Given the nearly infinite variations in the nucleosome, arising just from DNA sequence, comparative methods are required. This approach differs from exhaustive studies such as the Ascona B-DNA Consortium (ABC) effort to characterize all possible tetranucleotide steps by molecular dynamics simulation. 3–6 We previously demonstrated 7 that, with sufficient resources, comparative molecular dynamics studies of nucleosomes can be computed overnight. We also employed similar comparative techniques to study DNA kinking. 8 A recent study of 601 provides six simulations of nucleosomes with different sequences of DNA. 9 Each simulation is 2 microseconds (µs) long. This time scale is required to capture the breathing of nucleosomal DNA but fails to capture complete unraveling of nucleosomal DNA. A much larger time scale is needed. With current computing resources and techniques, it is more efficient to compute ensembles or comparative simulations of nucleosomes rather than a single trajectory because parallel scaling provides limited returns beyond what can be achieved with one or two nodes. 7 As Next Generation Sequencing 10 and Hi-C 11 like methods enable genome-wide association of function at or near base pair resolution, including the positioning of individual nucleosomes, 12 there will be increasing demand to understand sequencespecific variations in the structure and dynamics of individual nucleosomes. Here we present the TMB Library as both a reference for future comparative, on-demand simulations of nucleosomes and as a demonstration of iBIOMES Lite 13 as a tool for managing

Page 2 of 20

a laboratory’s simulation inventory. iBIOMES Lite enables users to curate a laboratory’s simulations into a library, to share raw and derived data in a manner that is accessible to experts and non-experts alike, to enable secondary use of simulation data by others, and to conduct meta-analyses of the library data. For the nucleosome experiments presented here, the metaanalysis provides insights that cannot be obtained from consideration of individual studies. The library provides evidence of DNA kinking in the nucleosome and clearly demonstrates the effects of DNA sequence on the gross structure and dynamics of nucleosomes. These effects and mispositioning of the 601 super strong nucleosome positioning sequence can be detected in short simulations (10 ns). The TMB Library contains over 20 µs of all atom molecular dynamics simulations for 518 different realizations of the nucleosome. The library includes studies of positioning and mispositioning of nucleosomes for 23 unique sequences of DNA. The TMB Library is presented in two formats: a convenient web browser format for obtaining summary data and as a file system tree for direct navigation and downloading. Below we provide a description of the TMB Library and a meta-analysis of the simulations.

Methods Modeling The simulations in the TMB Library, experiments in iBIOMES Lite nomenclature, 13,17,18 represent eight studies as indicated in Table 1. Each study was designed to address a specific question. Here, we present a meta-analysis of the studies to identify common and novel features. All models except systems in the Nuc146 and Nuc147 studies were based on Protein Data Bank entry 1KX5. 19 This structure is the highest resolution and most complete model of the nucleosome. Except for the studies labeled SIN, Nuc146 and Nuc147, the DNA fragment in 1KX5 has been replaced with DNA corresponding to a different sequence. For some systems, the histones are modeled with tails (WT),

ACS Paragon Plus Environment

2

Page 3 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 1: Summary of Studies Study ACGT NT NFR NT/T Pos NT/T NucA WT/T 601 WT/T 601 NT Nuc146 FMH/WT Nuc147 FMH/WT SIN M/WT

Model 1KX5 1KX5 NT/T 1KX5 1KX5

1KX5

System Size (atoms) 206,097206,097 157,550158,506 157,550157,550

Number of Unique Systems 16 105 336

254,394289,518 254,335297,274

21 21

214,211 2CV5 1KX5/2CV5 1KX5

1

266,038266,083 255,833260,216 387,061387,089

3 9 6

Force Field ff99SB 14 ff99bsc0 15 ff14SB 16 ff99bsc0 ff14SB ff99bsc0

Water Model TIP3P TIP3P TIP3P

Solvent (layer) NaCl (12Å) NaCl (12Å) NaCl (12Å)

ff14SB ff99bsc0

SPC/E

KCl (15Å)

ff14SB ff99bsc0

SPC/E

KCl (15Å)

ff14SB ff99bsc0 ff14SB ff99bsc0 ff14SB ff99bsc0

SPC/E SPC/E TIP3P

KCl (15Å) KCl (15Å) NaCl (10Å)

The TMB Library eight studies. The abbreviations beneath each study label are: "NT" no histone tails, "WT" with histone tail, "FMH" Frog, Mouse, and Human histone variants, "M" histone mutants, "T" DNA threading. and for some systems, there are no tails (NT). The studies labeled SIN, Nuc146 and Nuc147 included one or more point mutations or modifications to the histone core. Details regarding the choice of DNA sequence, the histone modifications, and the motivation are described below. The sequences utilized are graphically summarized in Figure 1. ACGT: This set of 16 experiments was designed to represent all possible DNA sequence variants at each of the 147 step pair locations in an octasome. 8 This study serves as an experimental control that locates each base pair and each base pair step at every location in the nucleosome. The initial structure for each experiment was determined by removing the DNA from 1KX5, folding the desired sequence (AA)n, (AC)n, ... (TG)n, or (TT)n into the 1KX5 superhelix and docking it back onto the histone core by aligning phosphate atoms of the modeled DNA superhelix to the phosphate atoms of the 1KX5 superhelix. All systems were solvated using copies of the same initial solvent

box and simulated for 16 ns, thus differences can be attributed almost entirely to DNA. NFR and Pos: This set of threading experiments was designed to study DNA sequence effects using sequences from s. cerevesiae. 7 The 21 sequences correspond to the most highly occupied and most well-defined nucleosome for each of the 16 chromosomes, denoted Pos, along with sequences corresponding to nucleosome free regions in chromosomes 2,4,5,8 and 16, denoted NFR. In both studies, a sequence corresponding to 147 base pairs was identified based on experimental data. 20 Each sequence was extended by 10 nucleotides in both the upstream and downstream directions. For each of the 21 sequences, 21 nucleosomes corresponding to 10 upstream, 10 downstream, and the central position of the selected sequence, Figure 2, were created using the DNA folding and docking scheme developed for ACGT. For these simulations, two additional base pair GG and CC were added to each nucleosome to stabilize the DNA ends. These bases are not included in our he-

ACS Paragon Plus Environment

3

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 20

Figure 1: Sequences used in each study. Graphical representation of the sequences used for each study. For the NFR and Pos studies, only 147 nucleotide subsequences of the indicated sequence were used. For all other studies, the entire sequence indicated was used for model building. The ACGT and SIN studies included 147 base pairs. The NucA and 601 studies included 177 base pairs. Three of the five NFR simulations included poly A tracts. lix parameter analysis. Each of the 441 unique nucleosomes was sampled for at least 20 ns by distributing the simulations across XSEDE resources. NucA: The experiments labeled NucA were designed to explore mispositioning of nucleosome A of the Mouse Mammary Tumor Virus promoter (MMTV). These simulations have not been reported elsewhere. The MMTV positions six nucleosomes referred to as nucleosomes A to F. Protein databank entry 5F99 21 was released in 2016 and is a 2.63Å resolution structure of nucleosome A of the MMTV. It is the only x-ray structure available that includes a naturally occurring DNA sequence. This structure was not available when our study was initiated. The NucA systems were modeled by docking the relevant MMTV sequence to 1KX5 as described above. For NucA, 21 systems were created corresponding to 10 upstream, 10 downstream, and ideal positioning of NucA from the MMTV (Figure 2). The 5F99 x-ray structure

thus provides a means of validating these simulations for the ideally positioned NucA system; however, x-ray crystallography cannot provide insights into the alternate positionings of NucA that are detected by base pair resolution nucleosome positioning assays. 22 601: These experiments were designed to explore the mispositioning of the super strong nucleosome positioning sequence known as 601. 23 These simulations have not been reported elsewhere. Two x-ray structures containing this sequence exist, 3LZ0 and 3LZ1. 24 Both are 2.5Å resolution. Models of the ideally positioned 601 nucleosome and 20 mispositioned nucleosomes (10 upstream and 10 downstream), Figure 2, were built using the Nuc147 mouse model described below with the sequence known as 601−177. 23 The systems were constructed with histone tails (WT) and with no histone tails (NT). Unlike the Pos and NFR simulations, which had 151(147 + 4) base pairs regardless of positioning, all 601 models included the en-

ACS Paragon Plus Environment

4

Page 5 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 2: Simulation Details CPUs

Simulations (numbers x ns)

Average time (ns/day)

Total duration (ns)

Volume of data(GB)

64

16 x 16

1.6

256

126

240

105 x 20

25.9

2,205

689

64

336 x 20

5.9

6,786

3,000

QB2 NAMD 2.10 ibverb cuda

180

21 x 20 10* x 20

13.0

510

192

601 WT/T

Blue Waters NAMD 2.10 cuda

98

21 x 20 10* x 100 1 x 1000

15.1

4,078

1,500

601 NT

QB2 NAMD 2.10 ibverb cuda

180

1 x 1000

18.8

1,000

316

180

3 x 499

12.8

1,497

506

180

6 x 499

13.5

2,994

1,015

256

6 x 20 10* x 15

3.8

1,020

357

20,346

7,701

Study

Resource

ACGT NT NFR NT/T Pos NT/T

LONI HP NAMD 2.6 mpi Lonestar NAMD 2.9 mpich Kraken NAMD2.9/mpi

NucA WT/T

Nuc146 FMH/WT Nuc147 FMH/WT SIN M/WT

QB2 NAMD 2.10 ibverb cuda QB2 NAMD 2.10 ibverb cuda QB1/Kraken NAMD2.9 ibverb/mpi

Summary

518

Simulation Details summarize the resources utilized and computation recognition of each study. This information can be found in the TMB Library under Execution info and Protocol tab. The abbreviations are: "QB1" and "QB2" denote Louisiana Optical Network Infrastructure supercomputers QueenBee 1 and QueenBee 2. A "*" denotes the number of replicas simulated. The length of each simulation was determined by the resources available at the time of the study. tire length of 177 base pairs of DNA without any additional base pairs added for stability. The ideally positioned 601 contained two equal length linkers of 15 base pairs. Threading from positions −1 to −10 incrementally increased one linker up to 25 base pairs and decreased the other down to 5 base pairs. Threading to positions +1 to +10 did the opposite: incrementally decreased one linker to 5 base pairs and increased the other to 25 base pairs, see Figure 2. For the ideal position, 1000 ns long simulations with and without tails are included in the library. The mispositioned nucleosomes were simulated for 100 ns. Nuc147 and Nuc146: The experiments labeled Nuc147 and Nuc146 were designed to study the convergence of simulations initiated from two different x-ray structures, 1KX5 and

2CV5. 25 These simulations have not been reported elsewhere. 2CV5 contains 146 base pair of DNA and 1KX5 contains 147. Protein data bank entry 1KX5 provides almost complete data for the unstructured histone tails; however, the sequences do not correspond to wild type. 2CV5 does not have histone tails. 2CV5 contains histones from homo sapiens. 1KX5 contains xenopus laevis (frog) histones. Using 1KX5 and 2CV5 as templates nine initial structures were generated. First, 1KX5 was modified to be 100% consistent with the frog sequence by adding missing amino acids to the tails as needed. The modified 1KX5 histone core was aligned with 2CV5’s histone core then the histone tails and 147 base pair superhelix from 1KX5 were transferred onto the histone core of 2CV5. Docking a 147 base pair superhelix onto

ACS Paragon Plus Environment

5

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 20

Figure 2: Nucleosome Structures. Left: Four nucleosome structures are found in the TMB Library: (i) contains eight histones with no tails (NT) and no linker DNA; (ii) contains eight histones with tails (WT) and no linker DNA; (iii) contains eight histones with no tails (NT) and linker DNA; (iv) contains eight histones with tails (WT) and linker DNA. Right: For nucleosomes with linker DNA, threading studies are conducted by incrementally mispositioning the DNA 10 base pairs in the upstream. Positions are numbered relative to ideal positioning at 0, indicated by the red rectangle. Positions -10 through +10 represent mispositioning from 10 base pairs upstream through 10 base pairs downstream of ideal positioning. 2CV5 was a deliberate mismatch since 2CV5 contains only 146 base pairs. For both of the 147 base pair frog structures, amino acid substitutions transforming the frog to human and the frog to mouse were introduced. The variants are denoted collectively as the FMH variants of Nuc147. The final set of systems in this study are the FMH variants of 2CV5 with the 146 base pair superhelix as in the x-ray structure. These systems are denoted as FMH variants of Nuc146. For each of the 9 systems, 499 ns of continuous molecular dynamics were computed. SIN: The experiments labeled SIN were designed to study single-point mutations of the histones associated with disruption of the SWI/SNF-independent (SIN) complex. 26 The mutations correspond to histone H3 E105K, histone H3 R116H, histone H3 T118T, histone H4 R45H, and histone H4 V43I. The systems are labeled accordingly. An additional unmodified structure represents the wild type nucleosome. All nucleosomes were modeled on 1KX5. As indicated in Table 2 an initial sampling of 20 consecutive nanoseconds was obtained for each system, and then 10 independent replicas for each system were utilized to accumulate an additional 150 ns (10* x 15 ns) for each system,

here and in Table 2 a "*" denotes the number of replicas.

Simulations Parameters: All of the simulations include explicit solvent (TIP3 27 or SPC/E 28 ) and ions (NaCl or KCl) with an approximate molar concentration of 150mM and initial solvent layer ranging from 10Å to 15Å depending on the study. The systems were all built with Amber’s tleap 29 module using the force field parameters indicated in Table 2. In all cases NAMD2 30 was used as the compute engine with periodic boundary conditions, PME based long range electrostatics, 31 a 2 f s time step, rigid constraints for all hydrogen atoms, Berendsen pressure regulation 32 at 1.0 atm and Langevin temperature control 33 at 300 K. Full simulation details are available from the Protocol tab in the web view of the TMB Library or by downloading the NAMD log or configuration simulation files from the filesystem view of the library. Continuous simulation trajectories range from 20 ns to over 1 µs. All trajectories are computed in 1 ns simulation tasks. Several sets of independent parallel replicas are included for

ACS Paragon Plus Environment

6

Page 7 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

statistical comparison of single versus multiple trajectory approaches. Each replica trajectory was computed as a series of 1 ns simulation tasks independent of all other replicas. Workflows: Simulations were run on widely varying architectures including some with and some without GPU acceleration. Benchmarks indicate that parallel scaling efficiency dropped off significantly beyond 100-200 processors depending on the computer system. 34 In most cases, this was significantly less than the total number of processors available. The number of processors chosen for each computer system represented a balance between parallel scaling efficiency and throughput. Texas Advanced Computing Center’s Lonestar system provided the shortest time to completion of any single task. However, the simulations run on Blue Waters utilized only a small fraction of available resources. Thus, running multiple concurrent simulations on Blue Waters provided significantly higher daily throughput than any other system. The timings reported by Smith 7 clearly demonstrate that simulation time is deterministic once resources are allocated. The variation between predicted time to completion and actual run time provides a simple metric for task success or failure. Jobs taking longer than 110% of the predicted time to completion can be proactively killed and restarted with a high degree of confidence that it was a hardware glitch rather than an intrinsic failure of the modeling. Caution must be exercised here since a successful computation is not necessarily a successful simulation. The next section provides sufficient, but minimal, criteria for determining simulation success.

dewatered trajectories is included in each directory. Automated tools are recommended for downloading individual trajectories from the library. Each nanosecond trajectory is approximately 300 M B. The entire collection of trajectories is over 6 T B. Conformational Analysis: Two types of analyses were applied to all experiments to determine the validity of the simulations. The metrics were chosen such that they can be used to compare any two simulations or any group of simulations regardless of which experiment it is associated with. The two methods of comparison employed are the root mean square deviation (RMSD) and DNA helical parameter (HP) analysis. RMSD of the histone core, the DNA core, and the nucleosome core were computed to analyze the structural stability of each system. For this purpose, the histone core is defined using residue numbering as in PDB id 1KX5: H3 (T45 to A135), H4 (N25 to G122), H2A (V27 to L96) and H2B (Y37 to K122). All heavy atoms (non-hydrogen) in the histone core were used for histone core RMSD analysis, thus FMH and SIN vary slightly from others in terms of the atoms selected. This selection is chosen to specifically avoid the mobility of the unstructured histone tails. The inclusion of tails raises the RMSD in a manner that is unpredictable and non-informative. 35 The DNA core is defined as the 146 or 147 base pairs in direct contact with the histone core of each system. All heavy atoms (non-hydrogen) were selected for DNA core RMSD analysis. The atom count differs from simulation to simulation for the DNA core because of sequence differences. Nonetheless, the number of base pairs and their relative positioning on the histone octamer is the same. This allows a comparison of two systems regardless of the length of linker DNA included with each. The nucleosome core is defined as the union of atoms contained the histone core and the DNA core. To determine RMSD values for the histone core, the DNA core and the nucleosome core, all frames of a trajectory were fit to the initial structure using the three atom selections defined above for each fitting. This allows com-

Analysis Post Processing: All trajectories were dewatered, and the coordinates were wrapped to the unit cell before root mean square (RMS) fitting all frames for a given system to the initial conformation. All trajectories in the TMB Library include periodic box information, ions, histones, and DNA. All are available for downloading in DCD format. An Amber formatted parmtop file, nowat.parm, compatible with the

ACS Paragon Plus Environment

7

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 20

Figure 3: Workflow used to publish the TMB Library. Left to Right: Data Validation: Collect and organize all trajectory data using a systematic naming convention based on the idea of an experiment(folder) as a collection of simulations (subfolders). Simulation Validation: Compute desired analyses within each subfolder. Meta-analysis data resides in a top level folder named Summary. Library Generation: Run iBIOMES Lite to generate static HTML pages that categorize all data. Data Publication: Configure data sharing as public or private. paring RMSD values between any two simulations with the caveat that there are subtle differences in the atom selection as described above. The DNA helical parameters provide a description of DNA stacking (inter base pair parameters) and pairing (intra base pair parameters). This metric is a local measure of structure. 36,37 Values from different simulations can be compared if the data is aligned to represent the same location along the nucleosome superhelix. For the analysis presented here, only the 146 or 147 base pairs in direct contact with the histone octamer are considered. Thus, helix parameter data from any system containing 146 or 147 base pairs can be compared to any other system containing 146 or 147 base pairs, respectively. The NASTRUCT utility, part of AmberTools package, 29 is used to calculate the DNA helix parameter values. NASTRUCT provides the same values as 3DNA 38 and is based on El Hassan’s algorithm. 39 The output is stored in self-describing Hierarchical Data Format version 5 (HDF5) 40 files that include all inter and intra base pair helix parameter values for each simulation. In all cases, analysis began with an initial structure that allowed NASTRUCT to identify proper base pairing. The NASTRUCT outputs and HDF5 formatted data files are available for download from

the TMB Library’s Browse files tab.

Publication The workflow used to generate the TMB Library is a four-step process, Figure 3. Data Validation: Regardless of where the simulations were computed, the raw or the dewatered trajectory files for each experiment must be organized on disk in a tree structure using a systematic naming convention. For every simulation the following raw trajectory files are available: initial Amber formatted parameter and topology files (*.parm) and coordinate files (*.crd) with and without solvent (sys.* and nowat.*, respectively), NAMD2 formatted configuration (*.conf) and output files (*.log), dewatered DCD trajectory files (1000 frames at 1 frame/ns in nowat.*.dcd). Once processed by iBIOMES Lite the web browser view of the library provides an organized summary of simulations that can be easily checked by nonexperts for missing files, file consistency, and successful job completion. Simulation Validation: For the TMB Library, two types of analyses were computed to validate the simulations: RMSD analysis and helical parameter analysis. Other libraries may require different data sets to verify simulation success. The RMSD trajectory data is pro-

ACS Paragon Plus Environment

8

Page 9 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

vided as a plot (rmsd*.png) and in CSV format (rsmd*.dat). The inter and intra base pair stacking and step parameters are provided as NASTRUCT output (*.nastruct.dat) and in HDF5 files (hps.hd5). Mean helical parameter values are provided as images(six-plot*.png). A summary folder exists for each study with the meta-analysis of DNA helical parameter values and Cartesian coordinate RMSD. Here, the RMSD values are plotted as violin plots. This data provides users the ability to compare simulations in a study and identify anomalies. Library Generation: The ibiomes-litepublish function is used to parse all the trajectory and analysis files into static HTML files. The XML and about.html templates provided by iBIOMES Lite must be modified to link to the user’s laboratory and logo rather than iBIOMES. We also modified the HTML to include URLs for the direct download of raw data. Data Publication: All data in the library is available in a web browser and file browser format. The library enables novice and expert users to determine if the data is valid (has a consistent set of successful computations) and if the simulations are valid (characteristic properties are within acceptable limits). The library is shared publicly by exposing the files to an HTTP server. It can be accessed locally without the HTTP server by merely directing a web or file browser to the local file system. For each simulation, the web browser interface organizes the data as Summary, Browse files, Execution info, and Protocol. The Summary tab provides a sample structure that is displayed with JSmol, analysis data, a summary of dynamics, details of the molecular system and details of the computational tasks. The Browse files tab provides a collapsible view of files organized by type. The Execution info tab provides a summary of all computational tasks. The Protocol tab summarizes and displays all simulation variables. iBIOMES Lite provides additional tools for managing data consistency and processing tasks, including tools to update analysis when new data is entered into the library. All data in the library can be accessed directly by remote users with various tools, e.g. python’s urllib.request functions or with

command-line wget tools. Because the library provides organized read access to our raw and derived data, experienced modelers can extract whatever analysis they desire from our library. Direct access to raw data can be more powerful than a database.

Results All RMSD and helix parameter data in the TMB Library is reduced to three summary graphs. Histone core and DNA core RMSD data are plotted in Figure 4. The range and mean of helix parameter values across all simulations are plotted in Figure 5 and Figure 6, respectively. Histone Core RMSD: All simulations except SIN and 601 NT resulted in mean RMSD values for the histone core below 2.5Å. The highest RMSD values observed for the histone core for any system were for 601 NT (1000 ns). The mean RMSD was 3.3Å. This system is discussed in greater detail in the novel observations section. For the SIN study, the range of RMSD values is 2.4 to 3.0Å for the individual replicas, consistently higher than for all other simulations. For H4-V43I the RMSD values range from 2.4 to 2.5Å (in the lowest cluster) and H3-E105K exhibit the highest range of RMSD values 2.9 to 3.0Å (in the highest cluster). The wild type nucleosome in the SIN study exhibits intermediate RMSD values (2.6 to 2.7Å). The large deviations observed for these simulations may reflect the known destabilizing effect of the SIN mutations or differences in assembly or simulation protocols. SIN is the only study assembled outside of our lab. The simulation protocol utilized a 20 ns continuous trajectory followed by 10∗ × 15 ns independent replicas. The fact that the RMSD values of the wild type nucleosome of SIN are higher than the Nuc147, an identical system in theory, suggests the differences reflect differences in build or simulation protocols. Caution should be exercised in comparing the SIN simulations to others in the library, but we expect that the comparison of SIN mutants to SIN mutations is valid.

ACS Paragon Plus Environment

9

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 20

Figure 4: RMSD Analysis. Left: Solid dots represent average RMSD values for DNA cores and crosses represent average RMSD values for histone cores observed during the simulations. Simulations are labeled as in Tables 1 and 2. ACGT, NFR, and Pos (group A) used similar build and simulation protocols. NucA and 601 (group B)used similar build and simulation protocols. The build and simulation protocols for SIN, 146 and 147 experiments (group U) were unique. All simulations except ACGT used the same force field. Groups A and B include mispositioned nucleosomes. Group U does not. Data should be compared accordingly. Right: The average RMSD values of the DNA core for the mispositioned 601 WT simulations (from position −10 to +10) using different time intervals. The blue square line represents the mean RMSD values obtained using the full trajectory (100 ns) for each position. The red diamond line used only the first 20 ns. The yellow line used only the first 10 ns. The overall trend is independent of sampling time. Lower DNA core RMSD values are an indication of positioning. ACGT, Pos, and NFR (group A) used identical build and simulation protocols. The force fields and total simulation times differed. The only differences between the 336 Pos simulations and the 105 NFR simulations are the DNA fragments docked to the histone core. The histone RMSD values for NFR are consistently lower than for Pos indicating that variations in DNA material properties alone can alter the structure of the histone core even on the 20 ns timescale. The magnitude of the sequence effect on the histone core is related to the level of sequencespecific positioning associated with the fragment. The NucA and 601 studies (group B) utilized identical build and simulation protocols and provide further support for this hypothesis. The NucA sequence is not a super strong positioning sequence; 601 is. The RMSD values observed for the histone core in NucA are lower than for the 601 simulations. DNA Core RMSD: In all cases, the RMSD values for the DNA core are higher than for

the histone core. Since DNA wraps around the outside of histones, it can act independently of histone core. If structural transitions of the histone core occur then the DNA must also somehow adapt. Thus, we expect, and indeed observe, DNA core RMSD values to be higher than those observed for the histone core in a given simulation. DNA RMSD is used as a measure of nucleosome superhelix stability as a function of the DNA sequence. The expectation in the threading simulations is that DNA RMSD correlates with mispositioning. This is the general trend. DNA RMSD for Pos (336 simulations) is higher than for NFR (105 simulations) and NFR is higher than ACGT (16 simulations). These studies (group A) used nearly identical build and simulate protocols. In terms of sequence complexity, ACGT is based on dinucleotide repeats while NFR and Pos are of biologic origin. Group B simulations show the same trend. DNA core RMSD values for 601 (21 simulations), a super strong positioning sequence, are

ACS Paragon Plus Environment

10

Page 11 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 5: Range of the DNA Helix Parameters. Left represents the inter base pair helical parameters. Right represents the intra base pairs helical parameters. The blue line represents the max values and the green line represents the min values of the individual helical parameter for all simulations in the library. The three red dotted lines represent maximum, minimum and mean values of helical parameters obtained from simulations of DNA free in solution. 6 higher than for NucA (21 simulations), a natural positioning sequence that supports multiple positions. Collectively, these results suggest that long-range sequence complexity plays a role in the stability of DNA on the histone octamer, i.e. the material properties of DNA affecting nucleosome positioning extend beyond dinucleotide length scale. In case of 601 mispositioning is strongly related to DNA core RMSD, see also Figure 4. Violin plots are available in the 601 summary folder. A plot of DNA RMSD values versus mispositioning distance indicates that ideal positioning corresponds to a local maximum that appears within a broader minimum. This trend can be observed using the first 10 ns or 20 ns or the entire 100 ns sampling. This simple pattern is not observed for Pos or NucA data. An inspection of the violin plots or raw data for NucA indicates that an RMSD of 2.5Å can be used to cluster the simulations into two groups. All NucA replicas near the ideal positioning fall under the 2.5Å cut-off. Only replicas representing mispositioned nucleosomes are in the above 2.5Å group. There are at least two possible reasons why the biologic results are not as simple as 601. First, the experimental data used to determine the ideal positioning for Pos is not necessarily accurate to base pair resolution, and NucA is known to support multiple positions.

Second, the positioning effect for the biologic sequences is not as strong as it is for 601. Simulation time alone is not a distinguishing factor as evidenced by Figure 4. Moreover, the RMSD for the DNA in the 500 ns long Nuc147 simulations spans the same range as the 20 ns ACGT simulations. The highest DNA RMSD observed for Nuc147 is for 2CV5Human. In this system, the structure of DNA from 1KX5 was docked onto 2CV5. The 2CV5 structure contains only 146 base pairs. Given this mismatch, the DNA RMSD is expected to be high, but it is not higher than values associated with the mispositioned NFR or Pos simulations. Pos and NFR used similar assembly and simulation protocols but contain different DNA sequences. For the Nuc146 systems (500 ns) the DNA RMSD varies significantly considering that differences in these structures are limited to modifications of the histones (FMH variants) rather than DNA. The range is similar to the range observed in SIN (150 ns) in which single point mutations were introduced into the histone core. Helix Parameters: We consider the range of helix parameter values observed at any location along the nucleosome superhelix regardless of the sequence and regardless of the simulation from which the values were obtained, see Figure 5 and Figure 6. If the histones are solely

ACS Paragon Plus Environment

11

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 20

Figure 6: Mean value of the DNA Helix Parameters. Left represents the inter base pair helical parameters, Right represents the intra base pairs helical parameters. The cyan line represents the mean values of all the simulations. The red dotted line represents the mean values of free DNA, and the two black dotted lines (from top to bottom) represent the +/- 1 standard deviation from the mean values of free DNA. 6 responsible for DNA structure and dynamics, the values from different simulations will agree. To the extent DNA structure and dynamics are determined by DNA sequence, the values will differ. Data for individual simulations can be found on the Summary tab of each simulation in the TMB Library and compared to Figure 5 and Figure 6. In Figure 5, intra and inter base pair helical parameter values are graphed as two sets of six plots with location reported relative to the nucleosome dyad. The reported range of values represents over 20 µs of nucleosome simulations and is compared to the range of values observed for DNA free in solution (dotted red lines) from a similarly large set of simulations. 6 What emerges from this comparison is that rather than being tightly constrained, the pairing and stacking of bases in the nucleosome appear to explore a similar range of conformational space as DNA free in solution. In many instances, the range of values observed for nucleosomal DNA is actually larger than for free DNA. The exceptions are the inter base pair helical parameters Slide and Rise and the intra base pair helical parameters Shear, Stretch and Stagger. The parameters exhibit tendencies to be more restricted than free DNA, but only at some locations along the superhelix. There are no easily discernible patterns as to where these

restrictions occur. These observations are consistent with a stress release mechanism in which transient kinks or other large scale deformations at one or more sites allow other sites to relax or adhere to relatively strict conformational tolerances. 41 For the ACGT simulations, we demonstrated that such kinking is dynamic on the nanosecond timescale. 8 In Figure 6, the average of the mean values obtained from all 518 simulations is presented to identify patterns indicative of superhelix conformation. As a summary statistic, regardless of sequence and state of positioning or mispositioning intra and inter helix parameter mean values tend to remain within one standard deviation of the values obtained during simulations of free DNA. Caution must be employed to not over-interpret this summary statistic since individual data sets may vary considerably. The variation of mean Roll values along the nucleosome superhelix maintains a sinusoidal pattern. Tilt and Propeller also exhibit regular patterns. Such regular patterns are not as pronounced for any of the other parameters. Roll and Opening are the only parameters to exhibit average values on the nucleosome that are greater than one standard deviation from the average values obtained for free DNA. One standard deviation is the range expected just from the thermal motion of DNA.

ACS Paragon Plus Environment

12

Page 13 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 7: Self-looping and DNA linking observation. Left is 601 NT. It shows that at 178 ns of 1000 ns, the DNA linker will link between virtual images. Right is 601 WT at posm05. It shows that at t=0 ns and t = 64 ns of 100 ns, the DNA linker will link to self-closure. The data presented in Figure 5 and 6 support further development of metrics for DNA kinking as proposed by Mukherjee 8 and Ponomarev 42 to investigate how the material properties of DNA encoded in its sequence alter the structure and dynamics of nucleosomal DNA. Novel Observations: Something unique occurred during the 601 NT simulation (1000 ns). An inspection of the trajectory shows that within 200 ns the DNA linkers on opposite sides of the nucleosome line up between periodic cells to yield near-native inter-nucleosome base pair stacking, Figure 7. These images were obtained from the trajectories before RMSD fitting. The underlying cause of the anomalously large RMSD for the histone core observed in this simulation appears to arise from linkerlinker interactions between periodic images. The linker-linker interactions are dynamic on a 100 ns timescale. They form during the first 200 ns and explore near-native stacking conformations until eventually repelling each other from 500 ns to 1000 ns. We also observe linkerlinker interactions in the 601 WT simulations (100 ns) at positions +5 and −5. However, rather than interactions across periodic boundaries, the long linker arm loops back for selfinteraction with the shorter linker arm within the periodic cell. Both simulations contain 177

base pairs, thus 30 base pairs are sufficient for self-linkage interactions. These linker-linker interactions depend not only on the length of DNA but sequence-dependent bending of DNA and, in case of 601 NT, the periodic boundary conditions. We propose that systems can be constructed with bonds that pass across the periodic boundaries as a novel means of simulating regular conformations of nucleosome arrays. This is similar in spirit to the infinitely long segment of DNA simulated by Bishop. 43 For systems constructed in this manner rescaling the unit cell introduces artificial forces into the system. But otherwise, the magnitude of the forces and their temporal variations are consistent with all other forces present in the simulation. The simulations demonstrate that the inter-nucleosome forces propagated through the linker DNA are sufficient to disrupt the histone core structure. The forces in our simulations are fundamentally different from the external forces that arise in pulling simulations 44 or atomic force microscopy experiments. 45,46

Conclusion The TMB Library: The resources available at today’s supercomputing centers far exceeds

ACS Paragon Plus Environment

13

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

the parallel scaling limits of medium size systems (∼500,000 atoms). For systems of this size, computing resources are more efficiently utilized for comparative studies. However, this introduces the additional burden of managing not just the simulations but the analysis and data sharing. We demonstrated that such simulations can be managed across widely distributed computing resources. 7 Here, a method for managing, publishing, and sharing the trajectories and analysis data using iBIOMES Lite is presented. Using iBIOMES Lite to manage a laboratory’s simulations produces a web-based library accessible to local and, if desired, remote users. It requires no system level configuration or services if shared locally and only HTTP access to the library to enable public sharing. The library itself is portable. It can be generated on one system and moved to another (e.g. generated on a remote supercomputer and moved to a local resource). Because the HTML pages generated are static, backup and restore strategies are also simple. No additional software or resource configuration is required to maintain the library. However, the library is not as feature-rich as a database that provides search and data entry functions. Other tools and databases exist for managing molecular dynamics simulation data, such as DynOmics, 47 iBIOMES, 17 Bookshelf, 48 BIGNASim, 49 Sidekick, 50 WholeCellSimDB 51 and BioSimGrid. 52 Each of these provides functionalities not available in an iBIOMES Lite library, but maintaining or altering these functionalities may require a significant investment of time or technology. The TMB Library has no such requirements. As presented here, an iBIOMES Lite library enables minimally trained lab assistants to validate computations, novice researchers to validate simulations, and advanced users or lab managers to curate the laboratory’s simulation studies. We believe such libraries benefit not only individual lab but also the broader research communities without significant investment of laboratory time or resources. Such a library makes all simulation data accessible for validation and secondary use by others. The TMB Library demonstrates that compar-

Page 14 of 20

ative studies of mononucleosomes on the nano to micro second time scale can be achieved with widely available computing resources and shared as a resource for the community. Even a desktop workstation can now compute over 10 ns of nucleosome dynamics in a single day. Atomic details of the variations in the stacking and pairing of bases that affect the binding of proteins and drugs to DNA can be observed on this time scale, but not breathing or unraveling. From this point of view, comparative molecular modeling and dynamics is an ideal complement to Next Generation Sequencing and nucleosome positioning techniques that can rapidly hone in on a limited region of interest of any genome. 12,53 Coupled with advanced sampling protocols, simulation, and analysis workflows, and access to medium scale supercomputing resources, comparative modeling can be integrated into bioinformatics pipelines to directly probe structure-function relationships in atomic detail. The data in the TMB Library serves as a validation suite for the development of more efficient sampling protocols, and our genome dashboard, G-Dash, 54 provides a missing link to bioinformatics workflows that can be used to inform or to leverage the modeling studies in such workflows. Biology: Our simulations of vastly different sequences of DNA docked onto a histone octamer suggests a novel view of nucleosomal DNA. Even within the limitations of our modeling approach, namely inaccuracies due to sampling and force field deficiencies, it is clear that nucleosomal DNA is remarkably similar to DNA free in solution. By this we mean the required variations in the structure associated with superhelix formation are well within the range of conformations associated with thermal fluctuations of free DNA. Even though nucleosomal DNA is restricted to a characteristic superhelix conformation, the DNA helical parameters, a measure of structure at the level of individual base pairs, indicate that DNA is not highly deformed or tightly attached to the histone core. A kinking mechanism that is dynamic on the 1 − 10 ns time scale and that is not restricted to specific locations along the superhelix reconciles this apparent contradiction.

ACS Paragon Plus Environment

14

Page 15 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Such kinks provide a stress release mechanism as proposed by Kornberg and Lorch 41 and explored by Mukerjee and Bishop. 8 The meta-analysis presented here indicates that DNA sequence complexity plays an important role in the stability of nucleosomal DNA, that mispositioning of the 601 sequence can be identified in simulations by simple RMSD analysis of the DNA, and that the dynamics of nucleosomal DNA can be significantly affected by even a single point mutation to the histones. All of these results can be observed with relatively short simulations (10 − 100 ns). Refinement of these techniques will enable comparative studies of the effects of nucleosome positioning and mispositioning on the structure and dynamics of DNA and histones to be introduced into informatics pipelines. We conclude that the histone core is not a rigid three-dimensional object to which DNA must passively conform. Rather the material properties of DNA, encoded in its sequence, affect the conformation and dynamics of nucleosomal DNA and even the structure and dynamics of the histone octamer. The dynamic response of nucleosome structure to DNA sequence is a reason that rules for sequence based nucleosome positioning, beyond the gross rules proposed by Widom, 55 have proven so elusive. The results presented here affect our understanding of protein-DNA and drug-DNA interactions. It not only matters that the DNA to which a protein or drug is binding is part of a nucleosome but also exactly where the DNA is positioned on the nucleosome.

their comments and suggestions.

Supporting Information Available All simulation and analysis data is available in a web browser view of the TMB Library at http: //dna.engr.latech.edu/~tmbshare/ and in a file browser view of the TMB Library at http: //dna.engr.latech.edu/~tmbshare/sims/. The file browser address can also be used to directly access all data with command line tools such as python’s urllib.request.

References (1) Zhou, K.; Gaullier, G.; Luger, K. Nucleosome Structure and Dynamics are Coming of Age. Nat. Struct. Mol. Biol. 2019, 26, 3–13. (2) Tan, S.; Davey, C. A. Nucleosome Structural Studies. Curr. Opin. Struct. Biol. 2011, 21, 128–136. (3) Beveridge, D. L.; Barreiro, G.; Byun, K. S.; Case, D. A.; Cheatham, T. E.; Dixit, S. B.; Giudice, E.; Lankas, F.; Lavery, R.; Maddocks, J. H.; Osman, R.; Seibert, E.; Sklenar, H.; Stoll, G.; Thayer, K. M.; Varnai, P.; Young, M. A. Molecular Dynamics Simulations of the 136 Unique Tetranucleotide Sequences of DNA Oligonucleotides. I. Research Design and Results on d(CpG) Steps. Biophys. J. 2004, 87, 3799–3813.

Acknowledgement The TMB library web server is part of the Bioinformatics, Biostatistics, and Computational Biology Core at Louisiana Tech University supported in part by an Institutional Development Award (IDeA) from the National Institute of General Medical Sciences of the National Institutes of Health under grant numbers 5-P20-GM103424-15 and 3P20-GM103424-15S1 for the Louisiana Biomedical Research Network. Sun, Li, and Bishop received support from the NSF and the Louisiana Board of Regents through cooperative agreement OIA-1541079. We thank the reviewers for

(4) Dixit, S. B.; Beveridge, D. L.; Case, D. A.; Cheatham, T. E.; Giudice, E.; Lankas, F.; Lavery, R.; Maddocks, J. H.; Osman, R.; Sklenar, H.; Thayer, K. M.; Varnai, P. Molecular Dynamics Simulations of the 136 Unique Tetranucleotide Sequences of DNA Oligonucleotides. II: Sequence Context Effects on the Dynamical Structures of the 10 Unique Dinucleotide Steps. Biophys. J. 2005, 89, 3721–3740.

ACS Paragon Plus Environment

15

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 20

(5) Lavery, R.; Zakrzewska, K.; Beveridge, D.; Bishop, T. C.; Case, D. A.; Cheatham, T.; Dixit, S.; Jayaram, B.; Lankas, F.; Laughton, C.; Maddocks, J. H.; Michon, A.; Osman, R.; Orozco, M.; Perez, A.; Singh, T.; Spackova, N.; Sponer, J. A Systematic Molecular Dynamics Study of Nearest-Neighbor Effects on Base Pair and Base Pair Step Conformations and Fluctuations in B-DNA. Nucleic Acids Res. 2010, 38, 299–313.

(12) Chereji, R. V.; Ramachandran, S.; Bryson, T. D.; Henikoff, S. Precise Genome-Wide Mapping of Single Nucleosomes and Linkers in vivo. Genome Biol. 2018, 19, 19.

(6) Pasi, M.; Maddocks, J. H.; Beveridge, D.; Bishop, T. C.; Case, D. A.; Cheatham, T.; Dans, P. D.; Jayaram, B.; Lankas, F.; Laughton, C.; Mitchell, J.; Osman, R.; Orozco, M.; Pérez, A.; Petkeviči¯ ut˙e, D.; Spackova, N.; Sponer, J.; Zakrzewska, K.; Lavery, R. µABC: a Systematic Microsecond Molecular Dynamics Study of Tetranucleotide Sequence Effects in BDNA. Nucleic Acids Res. 2014, 42, 12272–12283.

(14) Hornak, V.; Abel, R.; Okur, A.; Strockbine, B.; Roitberg, A.; Simmerling, C. Comparison of Multiple Amber Force Fields and Development of Improved Protein Backbone Parameters. Proteins 2006, 65, 712–725.

Genomes. Methods (San Diego, Calif.) 2012, 58, 268–276.

(13) Thibault, J. C.; Cheatham, T. E.; Facelli, J. C. iBIOMES Lite: Summarizing Biomolecular Simulation Data in Limited Settings. J. Chem. Inf. Model. 2014, 54, 1810–1819.

(15) Zgarbova, M.; Sponer, J.; Otyepka, M.; Cheatham, T. E.; Galindo-Murillo, R.; Jurečka, P. Refinement of the SugarPhosphate Backbone Torsion Beta for AMBER Force Fields Improves the Description of Z- and B-DNA. J. Chem. Theory Comput. 2015, 11, 5723–5736.

(7) Smith, J. A.; Romanus, M.; Mantha, P. K.; El Khamra, Y.; Bishop, T. C.; Jha, S. Scalable Online Comparative Genomics of Mononucleosomes: A BigJob. Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery. 2013; p 23.

(16) Maier, J. A.; Martinez, C.; Kasavajhala, K.; Wickstrom, L.; Hauser, K. E.; Simmerling, C. ff14SB: Improving the Accuracy of Protein Side Chain and Backbone Parameters from ff99SB. J. Chem. Theory Comput. 2015, 11, 3696–3713.

(8) Mukherjee, R.; Bishop, T. Nucleosomal DNA: Kinked, Not Kinked, or SelfHealing Material? ACS Symp. Ser. 2011, 1082, 69–92.

(17) Thibault, J. C.; Facelli, J. C.; Cheatham, T. E. iBIOMES: Managing and Sharing Biomolecular Simulation Data in a Distributed Environment. J. Chem. Inf. Model. 2013, 53, 726–736.

(9) Winogradoff, D.; Aksimentiev, A. Molecular Mechanism of Spontaneous Nucleosome Unraveling. J. Mol. Biol. 2019, 431, 323–335.

(18) Thibault, J. C.; Roe, D. R.; Facelli, J. C.; Cheatham, T. E. Data Model, Dictionaries, and Desiderata for Biomolecular Simulation Data Indexing and Sharing. J. Cheminf. 2014, 6, 4.

(10) Metzker, M. L. Sequencing Technologies - the Next Generation. Nat. Rev. Genet. 2010, 11, 31–46. (11) Belton, J.-M.; McCord, R. P.; Gibcus, J. H.; Naumova, N.; Zhan, Y.; Dekker, J. Hi-C: a Comprehensive Technique to Capture the Conformation of

(19) Davey, C. A.; Sargent, D. F.; Luger, K.; Maeder, A. W.; Richmond, T. J. Solvent Mediated Interactions in the Structure of

ACS Paragon Plus Environment

16

Page 17 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

the Nucleosome Core Particle at 1.9 Å Resolution. J. Mol. Biol. 2002, 319, 1097– 1113.

(28) Berendsen, H.; Grigera, J.; Straatsma, T. The Missing Term in Effective Pair Potentials. J. Phys. Chem. 1987, 91, 6269– 6271.

(20) Jiang, C.; Pugh, B. F. Nucleosome Positioning and Gene Regulation: Advances Through Genomics. Nat. Rev. Genet. 2009, 10, 161–172.

(29) Case, D.; Ben-Shalom, I.; Brozell, S.; Cerutti, D.; III, T. C.; Cruzeiro, V.; Darden, T.; Duke, R.; Ghoreishi, D.; Gilson, M.; Gohlke, H.; Goetz, A.; Greene, D.; Harris, R.; Homeyer, N.; Izadi, S.; Kovalenko, A.; Kurtzman, T.; Lee, T.; LeGrand, S.; Li, P.; Lin, C.; Liu, J.; Luchko, T.; Luo, R.; Mermelstein, D.; Merz, K.; Miao, Y.; Monard, G.; Nguyen, C.; Nguyen, H.; Omelyan, I.; Onufriev, A.; Pan, F.; Qi, R.; Roe, D.; Roitberg, A.; Sagui, C.; SchottVerdugo, S.; Shen, J.; Simmerling, C.; Smith, J.; Salomon-Ferrer, R.; Swails, J.; Walker, R.; Wang, J.; Wei, H.; Wolf, R.; Wu, X.; Xiao, L.; York, D.; Kollman, P. AMBER 2018. University of California, San Francisco, 2018.

(21) Frouws, T. D.; Duda, S. C.; Richmond, T. J. X-ray Structure of the MMTV-A Nucleosome Core. Proc. Natl. Acad. Sci. U. S. A. 2016, 113, 1214–1219. (22) Flaus, A.; Richmond, T. J. Positioning and Stability of Nucleosomes on MMTV 3’LTR Sequences. J. Mol. Biol. 1998, 275, 427–441. (23) Lowary, P. T.; Widom, J. New DNA Sequence Rules for High Affinity Binding to Histone Octamer and SequenceDirected Nucleosome Positioning. J. Mol. Biol. 1998, 276, 19–42.

(30) Phillips, J. C.; Braun, R.; Wang, W.; Gumbart, J.; Tajkhorshid, E.; Villa, E.; Chipot, C.; Skeel, R. D.; Kalé, L.; Schulten, K. Scalable Molecular Dynamics with NAMD. J. Comput. Chem. 2005, 26, 1781–1802.

(24) Vasudevan, D.; Chua, E. Y. D.; Davey, C. A. Crystal Structures of Nucleosome Core Particles Containing the ’601’ Strong Positioning Sequence. J. Mol. Biol. 2010, 403, 1–10. (25) Tsunaka, Y.; Kajimura, N.; Tate, S.-i.; Morikawa, K. Alteration of the Nucleosomal DNA Path in the Crystal Structure of a Human Nucleosome Core Particle. Nucleic Acids Res. 2005, 33, 3424–3434.

(31) Darden, T.; York, D.; Pedersen, L. Particle mesh Ewald: An N·log(N) Method for Ewald Sums in Large Systems. J. Chem. Phys. 1993, 98, 10089–10092. (32) Berendsen, H. J. C.; Postma, J. P. M.; van Gunsteren, W. F.; DiNola, A.; Haak, J. R. Molecular Dynamics with Coupling to an External Bath. J. Chem. Phys. 1984, 81, 3684–3690.

(26) Vijayalakshmi, M.; Shivashankar, G. V.; Sowdhamini, R. Simulations of SIN Mutations and Histone Variants in Human Nucleosomes Reveal Altered Protein-DNA and Core Histone Interactions. J. Biomol. Struct. Dyn. 2007, 25, 207–218.

(33) Huenenberger, P. H. Advanced Computer Simulation; Springer Berlin Heidelberg, 2005; pp 105–149.

(27) Jorgensen, W. L.; Chandrasekhar, J.; Madura, J. D.; Impey, R. W.; Klein, M. L. Comparison of Simple Potential Functions for Simulating Liquid Water. J. Chem. Phys. 1983, 79, 926–935.

(34) Mukherjee, R.; Thota, A.; Fujioka, H.; Bishop, T. C.; Jha, S. Running Many Molecular Dynamics Simulations on Many Supercomputers. Proceedings of the 1st Conference of the Extreme Science and

ACS Paragon Plus Environment

17

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 20

Engineering Discovery Environment on Bridging from the eXtreme to the campus and beyond - XSEDE '12. 2012.

(43) Bishop, T. C. Geometry of the Nucleosomal DNA Superhelix. Biophys. J. 2008, 95, 1007–1017.

(35) Biswas, M.; Langowski, J.; Bishop, T. C. Atomistic Simulations of Nucleosomes. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2013, 3, 378–392.

(44) Langowski, J.; Heermann, D. W. Computational Modeling of the Chromatin Fiber. Semin. Cell Dev. Biol. 2007, 18, 659–667. (45) Leuba, S. H.; Yang, G.; Robert, C.; Samori, B.; van Holde, K.; Zlatanova, J.; Bustamante, C. Three-Dimensional Structure of Extended Chromatin Fibers as Revealed by Tapping-mode Scanning Force Microscopy. Proc. Natl. Acad. Sci. U. S. A. 1994, 91, 11621–11625.

(36) Dickerson, R. E. Definitions and Nomenclature of Nucleic Acid Structure Components. Nucleic Acids Res. 1989, 17, 1797– 1803. (37) Olson, W. K.; Bansal, M.; Burley, S. K.; Dickerson, R. E.; Gerstein, M.; Harvey, S. C.; Heinemann, U.; Lu, X. J.; Neidle, S.; Shakked, Z.; Sklenar, H.; Suzuki, M.; Tung, C. S.; Westhof, E.; Wolberger, C.; Berman, H. M. A Standard Reference Frame for the Description of Nucleic Acid Base-Pair Geometry. J. Mol. Biol. 2001, 313, 229–237.

(46) Leuba, S. H.; Karymov, M. A.; Tomschik, M.; Ramjit, R.; Smith, P.; Zlatanova, J. Assembly of Single Chromatin Fibers Depends on the Tension in the DNA Molecule: Magnetic Tweezers Study. Proc. Natl. Acad. Sci. U. S. A. 2003, 100, 495–500.

(38) Lu, X.-J.; Olson, W. K. 3DNA: a Software Package for the Analysis, Rebuilding and Visualization of Three-Dimensional Nucleic Acid Structures. Nucleic Acids Res. 2003, 31, 5108–5121.

(47) Li, H.; Chang, Y.-Y.; Lee, J. Y.; Bahar, I.; Yang, L.-W. DynOmics: Dynamics of Structural Proteome and Beyond. Nucleic Acids Res. 2017, 45, W374–W380. (48) Vohra, S.; Hall, B. A.; Holdbrook, D. A.; Khalid, S.; Biggin, P. C. Bookshelf: a Simple Curation System for the Storage of Biomolecular Simulation Data. Database 2010, 2010, baq033.

(39) El Hassan, M. A.; Calladine, C. R. The Assessment of the Geometry of Dinucleotide Steps in Double-Helical DNA; a New Local Calculation Scheme. J. Mol. Biol. 1995, 251, 648–664.

(49) Hospital, A.; Andrio, P.; Cugnasco, C.; Codo, L.; Becerra, Y.; Dans, P. D.; Battistini, F.; Torres, J.; Goñi, R.; Orozco, M.; Gelpí, J. L. BIGNASim: a NoSQL Database Structure and Analysis Portal for Nucleic Acids Simulation Data. Nucleic Acids Res. 2016, 44, D272–D278.

(40) Koziol, Q.; Robinson, D. HDF5. [Computer Software] https://bitbucket. hdfgroup.org/scm/hdffv/hdf5.git., 2018; https://doi.org/10.11578/dc. 20180330.1. (41) Kornberg, R. D.; Lorch, Y. Twenty-Five Years of the Nucleosome, Fundamental Particle of the Eukaryote Chromosome. Cell 1999, 98, 285–294.

(50) Hall, B. A.; Halim, K. B. A.; Buyan, A.; Emmanouil, B.; Sansom, M. S. P. Sidekick for Membrane Simulations: Automated Ensemble Molecular Dynamics Simulations of Transmembrane Helices. J. Chem. Theory Comput. 2014, 10, 2165–2175.

(42) Ponomarev, S. Y.; Putkaradze, V.; Bishop, T. C. Relaxation Dynamics of Nucleosomal DNA. Phys. Chem. Chem. Phys. 2009, 11, 10633–10643.

ACS Paragon Plus Environment

18

Page 19 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(51) Karr, J. R.; Phillips, N. C.; Covert, M. W. WholeCellSimDB: a Hybrid Relational/HDF Database for Whole-cell Model Predictions. Database 2014, 2014 . (52) Tai, K.; Murdock, S.; Wu, B.; Ng, M. H.; Johnston, S.; Fangohr, H.; Cox, S. J.; Jeffreys, P.; Essex, J. W.; Sansom, M. S. P. BioSimGrid: Towards a Worldwide Repository for Biomolecular simulations. Org. Biomol. Chem. 2004, 2, 3219–3221. (53) Flaus, A.; Richmond, T. J. Base-pair Resolution Mapping of Nucleosome Positions Using Site-directed Hydroxy Radicals. Methods Enzymol. 1999, 304, 251– 263. (54) Li, Z.; Sun, R.; Bishop, T. C. G-Dash: A Genome Dashboard Integrating Modeling and Informatics. bioRxiv 2018, (55) Widom, J. Role of DNA Sequence in Nucleosome Stability and Dynamics. Q. Rev. Biophys. 2001, 34, 269–324.

ACS Paragon Plus Environment

19

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Graphical TOC Entry

ACS Paragon Plus Environment

20

Page 20 of 20