Foldability and the Amino Acid Compositions of Exons and Introns

introns, and a large difference between exons and totally randomized exons. Keywords: foldability • structural motif • amino acid diversity • ex...
1 downloads 0 Views 171KB Size
Foldability and the Amino Acid Compositions of Exons and Introns Saul G. Jacchieri Fundac¸ a˜o Antoˆnio Prudente, Rua Prof. Antoˆnio Prudente 211, Sa˜o Paulo SP 01509-090, Brazil Received April 25, 2002

Various procedures are employed to relate the structural tendencies of polypeptide chain fragments to amino acid residues that in general have low background frequencies. A numerical evaluation of the content of these amino acids, named amino acid diversity, is defined. Distributions of the amino acid diversity parameter in databases containing exons, introns, and randomized exons show that there is a small difference between exons and shuffled exons, a detectable difference between exons and introns, and a large difference between exons and totally randomized exons. Keywords: foldability • structural motif • amino acid diversity • exons • introns

Introduction

Table 1. Proportions (ei) between the Background Frequencies of the Natural Amino Acidsa

It is an established fact that the overall (also named background) frequencies1 of amino acid residues in proteins are not similar. As shown in Table 1, the proportions between these frequencies may vary from about 0.0142 for tryptophan to 0.0843 for leucine. The causes and consequences of diverse amino acid frequencies are not completely clarified. We are concerned with the questions of its relation to the interactions that determine protein structure and of how this is reflected on the distribution of low-frequency amino acids in protein sequences. Assuming that the diversity of frequencies seen in Table 1 has important implications for the folding of protein structures, this distribution should vary for random protein sequences and the original ones. It should also be possible to find a difference between exons and introns, regarding the distribution of the amino acids that correspond in Table 1 to low background frequencies. We also want to know if low-frequency amino acids are uniformly dispersed in protein sequences or if there are regions where their content is higher. All these questions are approached with the hypothesis that some peptide fragments behave as folding units, or foldons,2 and that this property is affected by the presence of low-frequency amino acids. The term foldon does not imply that the short peptide fragments in the study assume exclusively certain conformations. There is a gradation of structural tendency, or foldability, and in some cases it may be relatively high, although in no case it is absolute. In the present work, distinct procedures have been employed to investigate the dependence of foldability on the presence of low-frequency amino acids and its consequences in the distribution of low-frequency amino acids in exons and introns. A theoretical calculation of the foldabilities3 of all components of a library of tripeptide peptide fragments has been accomplished, and the results have been classified in accordance with the expected frequencies (listed in Table 1) of the amino acids that form each peptide fragment. These same peptide fragments have been searched in a database of protein 10.1021/pr0255295 CCC: $22.00

 2002 American Chemical Society

ei

W C M H Y F Q N P R

0.0142 0.0164 0.0229 0.0230 0.0356 0.0395 0.0374 0.0447 0.0451 0.0498

ei

D I K T S V G A E L

0.0580 0.0589 0.0599 0.0582 0.0587 0.0705 0.0755 0.0820 0.0833 0.0843

a The following relations apply: frequency ) e *(database size). ∑e ) 1. i i Adapted from ref 1.

structures in order to determine their probabilities of formation of structural motifs. Finally, distributions of protein fragments in large sequence databases have been determined. The discrimination of exons and introns is of great practical importance, and presently there are a number of methodologies4 designed with this intent. Here, we investigate a fundamental difference based on the assumption that exons contain structural information, whereas in introns this information does not exist or has been progressively lost. As will be shown, although this difference is clearly demonstrated and statistically significative, it is small for the purpose of distinguishing isolated exons from isolated introns.

Theoretical Basis The calculations presently described are more related to fragments of protein structure than to whole structures, since we want to identify protein fragments that behave as foldons, to relate this property to the presence of low-frequency amino acids, and to find a difference between exons, introns, and random sequences that is based on these principles. Evaluation of the Content of Low-Frequency Amino Acids. To evaluate quantitatively the participation of these amino acids in the composition of peptide fragments, we have defined Journal of Proteome Research 2002, 1, 515-519

515

Published on Web 09/21/2002

research articles

Jacchieri

the amino acids diversity parameter

[ ( )]

l δa ) log l/ Πe i i)1

(1)

where l is the fragment length (or window size) and ei are the factors listed in Table 1. δa is uniquely related to the amino acid composition of each peptide fragment and is an increasing function of the content of low-frequency amino acids. Calculation of Foldabilities. Peptide chains and, more generally, polymer chains may have a tendency to predominantly adopt a certain chain conformation. This property, namely foldability, has been studied in detail3 for protein chains. Although short peptide chains are not expected to have strong foldabilities, the present calculations have shown that for a few peptide fragments the foldability may be relatively high. The following statistical mechanics definition of foldabiltiy is adopted Fc )

1

(2)

ηi

∑ η *exp(-(E - E )/RT) i

i>0

0

0

where Ei’s are energies of conformational states, ηi is the density of states corresponding to Ei, and Fc is the foldability. Fc is an evaluation of the predominance of a certain chain conformation or group of conformations. It is a theoretically obtained physical chemical parameter that is not based on the biological properties of peptide chains. To evaluate eq 2, it is necessary to know the energies Ei of conformational states. In the present work, we make use of force field5 calculations and a conformational analysis algorithm6 to obtain the Ei’s. As will be shown, the observation that a certain foldability is high, or low, is easily explained, since the energy distribution of chain conformations is known from the beginning It does not suffice to know the foldabilities of a few peptide fragments. To investigate the dependency of Fc on δa, it is necessary to calculate the foldabilities of all components of a peptide library. The calculation of δa is easily accomplished for every library component, since we only need to know its sequence. Thus, the usefulness of eq 2 in this work is that it may be used to establish a relation between the foldability Fc and the amino acids diversity δa. It also allows an explanation of the causes of this relation. The disadvantage is that although the conclusions obtained with eq 2 are applied to protein fragments, they are directly related to peptide chains. Knowledge-Based Structural Propensities. There is another parameter that does not have the same disadvantage. We have previously obtained7,8 the knowledge-based probabilities Pχ of adoption of the structural motif χ by protein fragments. Pχ’s are calculated with a database9 of protein sequences and structures and are closely related to biology, since they are based in structural data about protein chains. Distribution of the δa Parameter in Polypeptide Sequences. Comparisons between plots of Fc × δa and Pχ × δa enable an interpretation of the role played by low background frequency amino acids in protein structures. This role is also revealed in the distribution of δa in protein sequences. Let us consider that f(δa) is the frequency in a sequence database of peptide fragments corresponding to δa. f(δa) is 516

evaluated simply by counting the δa’s corresponding to the fragments 1...l, 2...l + 1, ..., where l is the window size. f(δa) × δa plots have been obtained for exon and intron sequence databases and for databases in which these sequences have been randomized. δa may also be considered a positional function. The variation of δa along a protein sequence indicates the regions where there is a high content of amino acids corresponding to low overall frequencies. Following the argument described below, these regions should have high foldabilities. Randomization of Exon Sequences. Two procedures have been employed to randomize polypeptide sequences. Considering that a1a2a3a4a5... is the original sequence, this sequence is shuffled by making ai T aj exchanges. The shuffled and the original sequences have, of course, exactly the same amino acid composition, and a shuffled database of exons sequences agrees with Table 1. A different procedure is employed to obtain totally randomized sequences. Each amino acid in the original sequence is randomly replaced by one of the natural amino acids. The resulting sequence is not related to the original one, and a randomized database is not in accordance with Table 1.

Journal of Proteome Research • Vol. 1, No. 6, 2002

Results and Discussion A comparison between plots representing the foldability parameter versus the amino acids diversity (Fc × δa) and the structural motif probability versus the amino acids diversity (Pχ × δa) is shown in Figure 1. An important feature of Pχ × δa plots is that, for a given value of δa, we may consider that Pχ varies within the interval Pχmin e Pχ e Pχmax, and on the right-hand side of these plots, the limits Pχmin and Pχmax tend to increase with δa. The increasing tendency of Pχ with δa is even more clear for reverse turns I, II, and IV, as shown in Figure 2. A similar conclusion is obtained by examining the Fc × δa plot, although the increasing tendency of Fcmax (Fcmin e Fc e Fcmax) is verified for δa values below 7.7, and there is in Figure 1 one point among 203 (δa ) 6.4, Fc ) 0.36) that disagrees with the general conclusion. It is concluded that the plots depicted in Figures 1 and 2 show that foldabilities and probabilities of adoption of structural motifs have a tendency to increase with δa. As already stated, Fc, which is derived from physical-chemical and statistical mechanics principles, does not depend on biological facts, whereas Pχ is the result of a data mining7,8 in protein sequences and structures and is, therefore, closely related to biology. The point that we wish to stress with the comparison shown in Figure 1 is that the theoretical principles employed in the calculation of Fc are important for an understanding of the dependence of Pχ on δa and, consequently, of how δa is distributed in protein sequences. The results are illustrated for the NH2G-LCY-GCOOH and NH2G-PWC-GCOOH tripeptide fragments that have, respectively, low (Fc ) 0.0072) and high (Fc ) 0.283) foldabilities, as well as low (PR ) 0.22, Pβ ) 0.33) and high (PR ) 0.25, Preverse turn viii ) 0.50) Pχ probabilities. It is seen in Figure 3 that in the energy distribution of NH2G-LCYGCOOH the energies of conformational states are closely spaced and that in the energy distribution of NH2G-PWCGCOOH there is a large gap between the lowest energy states and the next states. This is in agreement with proposals that intended10 to explain structural tendencies in the R-helix* E-mail: [email protected].

Foldability and Amino Acid Composition

Figure 1. (a) Foldability (Fc) versus amino acid diversity (δa) plot representing the tripeptide fragment library NH2G-X1X2X3GCOOH (Xi ) W, C, M...); (b) R-helix probability (PR) versus amino acid diversity (δa) plot representing the X1X2X3 (Xi ) W, C, M...) library of tripeptide chain fragments; (c) β-sheet probability (Pβ) versus amino acids diversity (δa) plot representing the same library. δa and Fc are defined in eqs 1 and 2. PR and Pβ are defined in ref 7. In (b) and (c), 2.25 has been added to the δa scale.

random coil transition and the acquisition11,12 of structure by polypeptide chains. These energy distributions have been generated for all 203 components of the NH2G-X1X2X3-GCOOH (Xi ) W, C, M...) library, and the sole explanation for the dependence of Fc on δa is based on them. These are properties of short peptide fragments, and indeed, only short-range interactions are taken into account in the calculation of Fc. It is presumable that peptide fragments corresponding to a high δa are located along protein sequences in places where they contribute to the protein-folding process due to their higher foldabilities. Independent evidence13 in support of this assumption will be discussed below. In that case, there should exist a detectable difference between shuffled protein sequences and the original ones regarding the distribution of δa. Even between introns and exons this difference should exist. A comparison between the variation of δa along polypeptide sequences is shown in Figure 4 for original and shuffled exon sequences. Although there is a clear difference between the plots corresponding to original and shuffled sequences, to detect a pattern in this difference it is necessary to analyze many thousands of sequences. This is achieved with the frequency of δa occurrences (f(δa)).

research articles

Figure 2. Structural motif probability (Px) versus amino acid diversity (δa) plots representing the X1X2X3 (Xi ) W, C, M...) library of tripeptide fragments formed by the 20 natural amino acids. Structural motifs: (a) reverse turn I, (b) reverse turn II, (c) reverse turn IV. Px, obtained by data mining in protein sequences and structures, is defined in ref 7. δa is defined in eq 1. 2.25 has been added to the δa scale to make it comparable to the plots in Figure 1.

Figure 3. Energy distributions of two tripeptide fragments: (a) low-foldability fragment NH2G-LCY-GCOOH, δa ) 6.6, Fc ) 0.0072, R-helix probability PR ) 0.22, β-sheet probability Pβ ) 0.33; (b) high-foldability fragment NH2G-PWC-GCOOH, δa ) 7.7, Fc ) 0.28, R-helix probability PR ) 0.25, reverse turn VIII probability Px ) 0.5. δa, Fc, and Px are, respectively, defined in eqs 1 and 2 and in ref 7. The energy was rescaled to make the lowest energy coincide with 0 cal/mol.

The distribution of δa in protein sequences was investigated with a database containing 8566 exons downloaded from the Genbank web site (http://www.ncbi.nlm.nih.gov) containing Journal of Proteome Research • Vol. 1, No. 6, 2002 517

research articles

Figure 4. Amino acid diversity δa depicted as a positional function of the sequence for an exon sequence downloaded from the Genbank (http://www.ncbi.nlm.nih.gov). The same exon sequence was shuffled, as shown in the Theoretical Basis section. The window size is nine amino acids residues. Genbank identification no. 17986268.

a total of 3 998 459 amino acid residues. The results are represented in Figure 5, where f(δa) × δa plots are shown (see

Jacchieri

the Theoretical Basis Section for a definition of f(δa). The introns and randomized exons databases contain the same number of amino acid residues. There is in Figure 5 a small difference between the exons curve and the shuffled exons curve. As shown in Figure 4, as long as the proportions listed in Table 1 are maintained when a protein sequence is shuffled some δa maxima disappear and other δa maxima are created. The overall δa distribution is, nevertheless, maintained to the limit of a small variation in the right-hand side of the δa distribution where the curve corresponding to shuffled exons is above the curve corresponding to exons. It has been argued in the above discussion that maximum δa regions have a tendency to carry a greater foldability. In fact, Figures 1 and 2 are a demonstration of this tendency for short peptide fragments. The location of these regions along a protein sequence is, however, very important. Thus, even if an exon and a shuffled exon have approximately the same number of δa maxima, in one case the distribution of δa maxima imparts structure to the exon sequence, and in the other case it does not. Figure 5 also shows that the distributions corresponding to exons and to totally randomized exons are the most different. The maximum of the totally randomized exons curve is shifted to the right, a consequence of the proportions in the upper part of Table 1 being increased. Although this is only a test case, it shows that the intron distribution of δa is midway of the exon and totally randomized exon distributions. Exons and introns are, therefore, different with respect to the amino acid diversity function defined in eq 1. The average value of δa for exons is 14.4 and the standard deviation of δa in the exons distribution is 0.413. In Figure 5, the difference between f(δa) maxima for exons and introns is above one standard deviation of δa. The difference between exons and introns is clear and statistically significative. It is not large enough, however, to distinguish isolated exons from isolated introns with a reliability close to 75%.

Figure 5. Frequency of 9mer peptide fragments f(δa) corresponding to the amino acid diversity δa versus δa: black, exons; blue, introns; red, shuffled exons; brown, totally randomized exons. Na, the database size in 9mer peptide fragments, is 3 998 972 for exons, shuffled exons, and totally randomized exons and 3 117 778 for introns. 518

Journal of Proteome Research • Vol. 1, No. 6, 2002

research articles

Foldability and Amino Acid Composition

Conclusion Figure 1 is ultimately a comparison between experimental and theoretical data that reveals an increasing tendency of the foldability with the content of low background frequency amino acids. Independent evidence in support of this tendency is given by the finding13 of persistently conserved positions in sequentially dissimilar and structurally similar proteins. In ref 13, the relative importance of persistently conserved amino acids is evaluated by their log odds ratios that follow the order W > C > G > Y > F > P > D > I > V, N, L > M > E > R > H, K, T, A > Q > S. A comparison with Table 1 shows that there is a partial agreement with the order of background frequencies. Furthermore, it has been shown that persistently conserved positions contribute to the stabilization of secondary structure, the same conclusion obtained from Figures 1 and 2. As part of the theoretical argument, the comparison between random and original exon sequences is used to demonstrate a deviation from randomness that has also been found in previous14 investigations.

References (1) Barrai, I.; Volinia, S.; Scapoli, C. Int. J. Pept. Protein Res. 1995, 45, 326-331. (2) Panchenko, A. R.; Luthey-Schultzen, Z.; Cohe, R.; Wolines, P. G. J. Mol. Biol. 1997, 272, 95-105. (3) Klimov, D. K.; Thirumalai, D. Proteins 1996, 26, 411-441. (4) Burset, M.; Guigo, R. Genomics 1996, 34, 353-367. (5) Zimmerman, S. S.; Pottle, M. S.; Nemethy, G.; Scheraga, H. A. Macromolecules 1977, 10, 1-9. (6) Jacchieri, S. G.; Jernigan, R. L. Biopolymers 1992, 32, 1327-1338. (7) Jacchieri, S. G. In Combinatorial Materials Development; Malhotra, R., Ed.; ACS Publications: Washington, DC, 2002; Chapter 6. (8) Jacchieri, S. G. Mol. Diversity 2000, 5, 145-152. (9) Hobohm, U.; Sander, C. Protein Sci. 1994, 3, 522-524. (10) Jacchieri, S. G. Int. J. Quantum Chem: Quantum Biol. Symp. 1992, 19, 255-272. (11) Sali, A.; Shakhnovich, E.; Karplus, M. J. Mol. Biol. 1994, 235, 16141636. (12) Wilbur, W. J.; Major, F.; Spouge, J.; Bryant, S. Biopolymers 1996, 38, 447-459. (13) Friedberg, I.; Margalit, H. Protein Sci. 2002, 11, 350-360. (14) Holmquist, R. J. Mol. Evol. 1978, 11, 349-360.

PR0255295

Journal of Proteome Research • Vol. 1, No. 6, 2002 519