RNA Folding Dynamics: Computer Simulations by a Genetic Algorithm

Nov 26, 1997 - Genetic algorithms exploit principles of natural evolution for optimization procedures. It is shown that RNA folding pathways may be si...
4 downloads 9 Views 2MB Size
Chapter 14

RNA Folding Dynamics: Computer Simulations by a Genetic Algorithm Downloaded by UNIV OF CALIFORNIA SAN DIEGO on March 30, 2016 | http://pubs.acs.org Publication Date: November 26, 1997 | doi: 10.1021/bk-1998-0682.ch014

1,2

2

1

A . P. Gultyaev , F. H. D. van Batenburg , and C. W. A. Pleij 1

Leiden Institute of Chemistry, Department of Biochemistry, Leiden University, P.O. Box 9502, 2300 RA Leiden, Netherlands Theoretical Biology Section, Institute of Evolutionary and Ecological Sciences, Leiden University, Kaiserstraat 63, 2311 GP Leiden, Netherlands

2

Genetic algorithms exploit principles of natural evolution for optimization procedures. It is shown that RNA folding pathways may be simulated by a genetic algorithm. In addition to improvements of RNA structure predictions, an implementation of genetic algorithm allows one to simulate essential features of RNA dynamics, like e.g. the folding of a molecule during its synthesis, pseudoknot formation, functional metastable structures and conformational transitions. Thus it is shown that such an approach can also serve as a tool to study RNA dynamics.

Physical-Chemical Features of RNA Folding Process RNA Folding Dynamics. For more than 30 years, it is known that the nucleotide sequence is not the only factor to determine R N A structure formation. In 1966, alternative conformations were revealed for a tRNA molecule, yeast tRNA ^ (i). Furthermore, it was shown that the alternative non-functional structure was not in thermodynamic equilibrium with the more stable functional folding, i.e. the RNA was captured in a metastable state during the folding process. Thus at physiological conditions the energy barrier between the metastable structure and the stable one resulted in a very long time (2 days) of transition to the energy minimum. Changes in temperature and ionic strength lowered the barrier so that the transition occurred faster (2 minutes). The conclusion was made that such a transition was not a simple superposition of tertiary structure on an already formed secondary structure, but rather some refolding that required disruption of a misfolded conformation (7). Actually this means that the process of the functional structure formation may depend on a sequence of folding events and on their kinetics. In other words, some folding pathway seems to guide a molecule to the native state through a conformational space which may have an astronomic number of structures. This problem is extensively studied for protein folding, but similarities in folding dynamics of R N A and proteins are only starting to be recognized (2-4). A n interest in RNA 1

© 1998 American Chemical Society

Leontis and SantaLucia; Molecular Modeling of Nucleic Acids ACS Symposium Series; American Chemical Society: Washington, DC, 1997.

229

MOLECULAR MODELING OF NUCLEIC ACIDS

230

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on March 30, 2016 | http://pubs.acs.org Publication Date: November 26, 1997 | doi: 10.1021/bk-1998-0682.ch014

folding kinetics has been revived recently, mostly because of data on kinetics of tertiary structure formation in catalytic RNAs (4-10). The functional activity of ribozymes turned out to be dependent on specific structures which are formed stepwise with some steps relatively slow (on the timescale of minutes), and the rates of intermediate transitions can be measured by various methods. Kinetics of RNA Secondary Structure Formation. At the level of secondary structure, RNA folding also occurs through intermediate conformations. Even in such a short molecule as tRNA (about 75 nucleotides), intermediate stages of the folding pathway can be detected and kinetic parameters for some separate steps were measured by N M R experiments with E.coli tRNA™* ( 11). In this case folding is rather fast, with relaxation times in the range between a few microseconds and ten milliseconds. Therefore it can be concluded that the folding and thermal unfolding are reversible processes so that the final structure is the lowest energy state which is reached by a relatively fast pathway. In general, there is an hierarchy in the R N A folding process with initial formation of secondary structure followed by the folding of the less stable tertiary structure (77,72). However, in relatively long RNAs the folding may be more complicated. For example, a pseudoknotted mRNA fragment of 127 nucleotides from the regulatory region in a ribosomal protein operon was shown to fold via a pathway where a simple model of sequential formation of secondary and tertiary structures does not hold (73). Furthermore, RNA folding is not necessarily a reversible process even at the step of secondary structure formation and dynamic factors are very important, especially for large molecules. Functional secondary structures of 16S-like ribosomal RNAs (more than 700 nucleotides) deviate from the lowest energy states, predicted by current methods for minimizing the free energy of secondary structure. In many cases the functional folding and the minimum energy structure share less than 50% of their helices (14). Although a correlation between such a deviation and RNA length is not straightforward, a contribution of kinetic factors seems to increase with the length, whereas relatively small domains still fold into structures of minimal free energy (75). There is some hierarchy in the formation of RNA secondary structure. A rate of single helix (stem) formation depends mostly on the energy barrier of a loop which is formed while the reverse reaction of stem melting is determined by its stacking energy (16-18): kfo,din -exp(-AG /RT) g

loop

^ddng-expiAG^/RT). This approximation assumes that the rate-limiting step for stem folding is the formation of the first basepair, followed by rather fast "zippering" of the stem by formation of adjacent pairs. The destabilizing loop contribution has mainly entropie character and increases with loop size. Thus, short-range hairpins are formed much faster than long-range interactions and the formation of some of the long-range pairings would require the disruption of the previously folded structures. Even if these structures are not the most stable ones, the disruption may be rather slow, being mostly determined by stacking energies, often with rather high values. Furthermore, if structural rearrangement requires disruption of several helices, the effective barrier is the sum of stacking values and may result in a very long time of disruption of all stems. This effect is seen in deviations from the lowest energy states in large domains of rRNAs, in contrast to small domains in the same molecules, which have rather low free energies ( 75). The analysis of the favourable structures of long mRNA transcripts

Leontis and SantaLucia; Molecular Modeling of Nucleic Acids ACS Symposium Series; American Chemical Society: Washington, DC, 1997.

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on March 30, 2016 | http://pubs.acs.org Publication Date: November 26, 1997 | doi: 10.1021/bk-1998-0682.ch014

14.

GULTYAEV ETAL.

RNA Folding Dynamics

231

also indicates that they seem to fold sequentially, keeping the stable hairpins and rearranging the long-range stems (79). Therefore initially formed hairpins guide a subsequent folding so that the final structure may strongly depend on the first nucleation steps. A change in conditions of the RNA folding process, e.g. in renaturation experiments, may capture many RNA molecules in misfolded conformations ("alternative conformer hell") for a long time (20), A very important factor in RNA folding is the transcription rate. The folding process starts during the synthesis, when R N A transcription is not yet completed. Therefore some structures may be folded that are favourable only at an intermediate RNA length. In some cases they are disrupted when more stable structures become possible upon subsequent RNA elongation. However, many RNAs are trapped in such conformations for a considerable time. Furthermore, these metastable structures may be functional and sometimes they represent the more active conformation in contrast to the less active final folding. One of the best studied examples is represented by viroid RNAs which require metastable hairpins for efficient replication. Multibranched hairpin-containing conformations of a viroid R N A are formed during transcription and can be detected right after R N A synthesis (27). Although at physiological conditions viroid RNAs are folded into very stable rod-like structures, the alternative metastable hairpins were shown to be very essential for replication intermediates (22). A similar situation was observed for SV11 (23), a small R N A species (115 nucleotides), efficiently replicated into a metastable hairpin-rich conformation by a phage replicase while the final folding into a single stable stemloop structure was inactive (see also below). Monte Carlo simulations of the SV11 folding kinetics indicate that the metastable structure formation depends on the folding during the RNA synthesis, because this structure is predicted for the growing chain, but the stable conformation can be yielded when folding is simulated for the entire molecule (24). It seems that an accurate formation of many functional RNA structures require a coordination between the rates of synthesis and folding. This is well illustrated by experiments with molecules synthesized by T7 RNA polymerase, which transcribes RNA at a high speed: about 250 nucleotides per second in vitro compared with 50 for E.coli RNA polymerase (25). Such a high transcription rate might capture an RNA in a metastable structure if the molecule did not have enough time to refold during transcription. This is observed for various RNAs. For example, misfolded conformations were suggested for T7 transcripts of Tetrahymena pre-rRNA, which were relatively slowly processed (6). The rate of splicing could be raised by a thermal denaturation/renaturation treatment, indicating the presence of non-equilibrated foldings after transcription. Similarly, inactive ribosomal particles were obtained when T7 RNA polymerase was used for rRNA synthesis (26). A n elevated transcription rate was shown to be responsible for the effect, because an almost normal pattern was observed upon slower transcription at a lowered temperature by the same enzyme. Another example is overproduction of non-functional molecules of plasmid R N A primer, when T7 RNA polymerase is used (27). In some cases the rate of R N A folding may play a functional role and be exploited by natural systems. Such kinetic models were proposed for the control of translation by R N A secondary structure (28, 29). In these cases the translation is inhibited by a structure in the mRNA that sequesters the translation initiation region. Therefore the level of protein synthesis is very sensitive to the time of folding which actually regulates translation. For example, a long-distance interaction (LDI) was suggested to repress the synthesis of one of the proteins in phage MS2 (29). However,

Leontis and SantaLucia; Molecular Modeling of Nucleic Acids ACS Symposium Series; American Chemical Society: Washington, DC, 1997.

232

MOLECULAR MODELING OF NUCLEIC ACIDS

changing the thermodynamic stability of the LDI did not alter the repression. On the other hand, the mutations thought to change the folding rate did result in different levels of protein, indicating that the main regulatory factor is the time required to form the final structure rather than the stability of this folding at equilibrium (29). Such effects are not unique for translational control: a metastable structure of ColEl plasmid RNA primer was suggested to regulate the plasmid copy number by a delay in the formation of folding that is sensitive to the antisense RNA inhibition (30).

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on March 30, 2016 | http://pubs.acs.org Publication Date: November 26, 1997 | doi: 10.1021/bk-1998-0682.ch014

Predictions of RNA Structure by Simulation of Folding Pathways Optimal and Suboptimal RNA Structures. Upon implementation of recursive, or dynamic programming, algorithms, the calculation of the secondary structure with the minimal free energy has become possible (31,32). Such algorithms minimize free energy, using known or approximated energy parameters for elementary structural elements (helical stacks and loops). Thanks to the improvements in these parameters (33, 34) and to the progress in hardware development, prediction of the lowest free energy structure is performed in reasonable time even for relatively long R N A sequences and is widely used in laboratory practice. However, it turns out that for a given sequence numerous foldings exist with energy values which are very close to the optimal one. S uch suboptimal solutions can also be computed and finding structures within some vicinity of the energy minimum can partly compensate for uncertainties in the thermodynamic data (35). Nevertheless, kinetic effects in R N A folding may trap a molecule in a metastable structure with an energy that is essentially higher than the global minimum. This suggests that a proper simulation of the folding process may predict the functional structure when it is not in the lowest energy state. Recently, it has been shown that predictions of suboptimal structures can be improved considerably if the hierarchy of RNA folding is mimicked interactively using folding constraints known from experiments or theoretical analysis, e.g. phylogeny (36). Simulations of Folding Pathways. The algorithms for R N A folding simulation usually consider the folding as a stepwise process with elementary steps of formation and disruption of base pairs. The nucleation of a double helical region (stem) is much slower than its extension, therefore folding can be approximated with stems as elementary units. The simplest way is to start simulation from the most stable stem and to continue adding at every step the most stable stem from those compatible with the previously formed structure (37-39). Such an approach simulates some folding pathway, starting mostly from the most favourable short-range pairings followed by long-range interactions. A caveat of these algorithms is that sometimes they trap R N A in a wrong structure because of preference for single, relatively strong, stems instead of more stable combinations of stems that have smaller individual energy contributions. The favourable stem combinations can be considered by using a Monte Carlo procedure, which is less deterministic and allows to evaluate local structures prior to stem addition (38, 40). However, the main advantage of stochastic approaches is the opportunity to implement kinetic features of elementary folding steps (17, 24). As mentioned above, a rate-limiting step for single stem folding is probably the formation of the first basepair, therefore the energy barrier may be approximated by the energy of the loop. In this approximation, the rate of the reverse reaction, i.e. stem melting, is determined by the stacking energy. If elementary folding steps are defined as reactions of formation or disruption of single stems, RNA folding may be simulated by a Monte

Leontis and SantaLucia; Molecular Modeling of Nucleic Acids ACS Symposium Series; American Chemical Society: Washington, DC, 1997.

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on March 30, 2016 | http://pubs.acs.org Publication Date: November 26, 1997 | doi: 10.1021/bk-1998-0682.ch014

14.

GULTYAEV

ETAL.

RNA Folding Dynamics

233

Carlo routine with probabilities of single events proportional to the calculated rates. Such a procedure can predict structures of relatively short RNAs and estimate kinetic ensembles of molecules (77, 24). One of the essential problems in Monte Carlo simulations is a very broad range of rate constants in RNA folding, which vary by several orders of magnitude. This results in a serious technical difficulty of repeated iterations that follow the fastest transitions back and forwards without essential improvement of free energy that may be achieved by a slower reaction. Therefore an accurate simulation would require an enormous computation time. One of the ways to solve the problem is to define groups of structures that are in local equilibrium due to very fast transitions between them, and to represent the folding process as a network of slower reactions between these clusters (41). It remains to be seen how to implement this idea for successful predictions in relatively large RNAs. A promising approach to circumvent a tendency of Monte Carlo simulations to halt in local minima is the procedure of "simulated annealing" (42). Simulations of RNA folding allow one to implement sequential folding during transcription, because a routine usually has some iterations of folding calculations which may be repeated for successively longer transcripts (see e.g. 24, 40, 42, 43). Another important advantage of folding pathway simulations, compared to the search for the lowest energy, is an opportunity to include the formation of RNA pseudoknots, the simplest elements of tertiary structure (39, 40). We have recently proposed to simulate RNA folding pathways by using a socalled genetic algorithm, which is also based on a stochastic procedure (44, 45). Below we describe possible ways of implementation and show how the main features of RNA dynamics can be simulated by the algorithm. It can be suggested that the principles of the genetic algorithm make it one of the most powerful approaches for R N A structure prediction and a very flexible routine to implement the data and ideas about the RNA folding process. Implementation of the Principles of Genetic Algorithm for RNA Folding Simulation Principles of Genetic Algorithms (GA). Genetic algorithms are a class of algorithms that have got their name because they exploit the basic principles of natural genetic evolution for optimization procedures (see e.g. 46, 47). GA's are a very powerful technique, mostly in combinatorial problems that require a lot of computation for an exhaustive search. Details of the algorithms may vary, but basically all of them mimic Darwinian evolution of problem solutions in the computer, dealing with a population of these solutions that are mutated and selected according to some criterion. A n individual solution ("chromosome") is stored as a combination of some elements ("genes"). Chromosomes are reproduced by random changes of genes ("mutations") and recombinations ("crossover") of subsets of genes from parent chromosomes. One of the most simple examples is the population of bit strings with random changes of ones to zeros and vice versa. If a combination of elements in every solution can be evaluated according to some "fitness", mutational operations can be followed by selection of the fittest solutions, and the next iteration of G A starts with a new population. There is a lot of freedom for modifications of the algorithm to improve its effectiveness, e.g. in deviations from an uniform random process, maintaining the diversity of the population etc. In an ideal case, repeated iteration of G A converges all

Leontis and SantaLucia; Molecular Modeling of Nucleic Acids ACS Symposium Series; American Chemical Society: Washington, DC, 1997.

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on March 30, 2016 | http://pubs.acs.org Publication Date: November 26, 1997 | doi: 10.1021/bk-1998-0682.ch014

234

MOLECULAR MODELING OF NUCLEIC ACIDS

solutions to the optimal combination, but the convergence can also result in a suboptimal solution that is relatively good in terms of problem definition. GA's have received much attention in recent years. In chemistry, they can be used for a search in conformational space which very often involves combinations of many parameters. In particular, there were several attempts to use them for protein structure prediction (reviewed in 48). Recently, GA's were suggested to use in three rather different aspects of R N A structure:conformational search for stem-loop structures (49), prediction of optimal and suboptimal secondary structures (50) and simulation of RNA folding pathways (44,45). In the case of RNA folding simulation, a GA is also very attractive because it allows to simulate the process, in addition to obtaining a final solution. GA for RNA Folding. The main principles of GA for simulation of RNA folding (44, 45) are rather simple (Fig. 1). Every solution (structure) can be described as a set of stems. The operation of mutation is then disruption of some of these stems followed by formation of new ones. Crossover is the construction of a new structure that contains stems from parent solutions. The free energy of a given structure is the measure of fitness. Apparently the folding during transcription can be easily included, if structures are initially built for the 5-proximal region with a gradual increase of chain length in subsequent G A iterations. Apparently, the core of the procedure is the mutation operation, because this routine folds new structures, derived from those previously formed. Also, it is clear that this is the part of the algorithm, which allows to implement kinetic effects into the simulation. As mentioned above, disruption and formation of stems may be considered as elementary reactions in multistep R N A folding. Similar to the Monte Carlo procedure, at any given step of simulation a choice between different possible reactions can be made using their probabilities that depend on rate constants. Therefore addition and removal of stems with non-uniform probability distributions can mimic dynamic R N A folding, and different approximations may increase the efficiency of the algorithm. We consider separately some of these approximations, because they represent essential features of RNA dynamics. Formation of Stems. Many algorithms for RNA folding simulation contain a routine that adds a new stem to the previously formed structure. Essentially this is done by calculating a list of stems that are compatible with the structure, together with free energies that could be gained (or lost) if particular stems are formed. The next step, to choose a particular stem, actually determines the differences between algorithms. For example, in the "greedy" algorithm (39) the most energetic stem is formed, in Monte Carlo procedure the selection is done using probabilities that depend on energy gains and/or energy barriers of loops (17, 24, 38, 40). Repeated additions of new stems represent folding steps. We tested different ways to calculate probabilities of such steps. When stem selection followed a uniform probability distribution, that is, at every step the chances were equal for all stems from a list, the free energies of the structures did not improve very much from iteration to iteration. This is explained by the low probability of encountering an improving stem among others. However, a mere preference for the strongest (at a given step) stems did not produce correct structures either: a procedure to add stems with probabilities proportional to individual energy gains usually yielded incorrect long-range interactions that evolved just after the formation of the most stable hairpins.

Leontis and SantaLucia; Molecular Modeling of Nucleic Acids ACS Symposium Series; American Chemical Society: Washington, DC, 1997.

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on March 30, 2016 | http://pubs.acs.org Publication Date: November 26, 1997 | doi: 10.1021/bk-1998-0682.ch014

GULTYAEV ET AL.

RNA Folding Dynamics

Generate the initial population of structures representing unfolded RNA starting from the 5' end ι

Perform a GA simulation I Mutate every structure of the population Remove some stems from a structure

ι Add new stems to the structure

i GA crossover Generate new structure(s) using the stems from parental foldings

ι Sort structures according to fitness (AG) and select new population

ι