Latent Periodicity of Protein Families, Identified with the Indel-Aware

But with the identified latent periodicity cases, we are able to generate the initial ... or BLOSUM,20-23 where the weights for similar amino acids ar...
1 downloads 0 Views 182KB Size
Latent Periodicity of Protein Families, Identified with the Indel-Aware Algorithm Andrew A. Laskin, Konstantin G. Skryabin, and Eugene V. Korotkov* Bioengineering Center of Russian Academy of Sciences, Prospect 60-tya Oktyabrya, 7/1, 117312, Moscow, Russia Received June 30, 2006

Abstract: Latent amino acid repeats seem to be widespread in genetic sequences and to reflect their structure, function, and evolution. We have recently identified latent periodicity in more than 150 protein families including protein kinases and various nucleotide-binding proteins. The latent repeats in these families were correlated to their structure and evolution. However, a majority of known protein families were not identified with our latent periodicity search algorithm. The main presumable reason for this was the inability of our techniques to identify periodicities interspersed with insertions and deletions. We designed the new latent periodicity search algorithm, which is capable of taking into account insertions and deletions. As a result, we identified many novel cases of latent periodicity peculiar to protein families. Possible origins of the periodic structure of these families are discussed. Summarizing, we presume that latent periodicity is present in a substantial portion of known protein families. The latent periodicity matrices and the results of Swiss-Prot scans are available from http:// bioinf.narod.ru/del/. Keywords: periodicity • repeats • functional annotation • information decomposition • cyclic alignment • profile • protein families

Introduction The problem of functional annotation of amino acid sequences is a prevalent issue researchers face currently. Two reasons contribute to this. First, the high-throughput genome projects have resulted in a large number of organisms with known genomes. To fully appreciate these data, we need to identify functions of annotated protein sequences to assign them to functional families. Second, our current abilities in predicting protein function are not very high; for example, only 44% of proteins in the current version of the Swiss-Prot1,2 are assigned to families using the PIR tools.3 In addition, many prediction errors are detected, and more are expected to appear,4 when only one method of prediction (usually sequence homology or motif searching) is used. The reliability of annotation may be greatly increased if two or more algorithmically independent techniques give the same functional predic* To whom correspondence should be addressed. Tel: +7-495-1352161. E-mail: [email protected].

862

Journal of Proteome Research 2007, 6, 862-868

Published on Web 01/03/2007

tion for a protein. Thus, there is a strong need in investigations of function-specific sequence features and properties that would not be reduced to homology or motif searching. Periodicities in amino acid sequences were recently investigated in detail.5-13 The techniques utilized in most of those investigations were based on Fourier transformations or dynamic programming. However, these techniques were shown to miss many biologically important cases of periodicity we called latent.14 Latent periodicity of genetic sequences is one of such function-specific sequence features, since we have previously shown that amino acid sequences with identical biologicalfunctionoftenhaveidenticaltypesoflatentperiodicity.15-17 We may propose that latent periodicity originated from the evolution of genetic sequences via multiple tandem duplications18,19 and may be therefore specifically correlated to a biological function. If the presence of function-specific latent periodicity would be fairly common for protein families, the feature of latent periodicity may be of great use since one would be able to identify the enzymatic function of an amino acid sequence using the set of function-specific periodicity profiles and the Cyclic Alignment (CA) technique.15 For this to be realized, we have to demonstrate that functionally similar proteins share common types of periodicity; it was partly demonstrated in our recent investigations (cited above). Two novel techniques were used in these investigations, namely, the Information Decomposition (ID) and the Noise Decomposition (ND) techniques. We have combined these two techniques because the ID technique, in its original form, is not able to identify periodicity in an amino acid sequence in the case that it is intercepted with insertions and deletions. We thereby were able to identify latent periodicity only in a minor fraction of amino acid sequences in a family. But with the identified latent periodicity cases, we are able to generate the initial periodicity matrix, and then we may use the CA and ND techniques to adjust the matrix and to identify periodicity in a major fraction of the family. These techniques, used in conjunction, allowed us to find about 150 latently periodic protein families, ranging form tens to thousands of protein sequences. In all cases, the periodicity was identified in more than 70% of proteins in the family. Each of these periodicity cases has its own period length, periodicity matrix, and the position-weight matrix derived from it. These position-weight matrices may be used to identify amino acid sequences with identical biological function in Swiss-Prot or any other sequence database with high (R

where M is some amino acid affinity matrix, such as PAM or BLOSUM. This expression may be rewritten in the form of sum by amino acid types Wi )

1

∑n

2 j,k

i,j(ni,k

- δkj )M(j,k)

(3)

where j and k are amino acid types and ni,j are amino acid frequencies, that is, the numbers of amino acids of type j at position i. We proposed14,24 another measure of similarity, which is based on concepts of information theory and called “information content” Wi )

ni,j ni,j

∑Nf lnNf j

j

(4)

j

where fj is the frequency of occurrence of amino acids of type j in the whole set of sequences. These measures are undoubtedly different; thus, an alignment may score high using information-theoretic measure while scoring low using homology-based measure, and vice versa. But the phrase “highscoring” does not mean anything, especially when comparing weights calculated with different measures. We have to make sure that the obtained value of W is far above those calculated with sets of random, unrelated sequences. To do this, we shuffle initial sequences and calculate either p-value or Z-value of obtained alignment; low p-values, or high Z-values, showsignificant similarity between sequences S1, S2, ..., SN. One usually sets up some threshold value, which if exceeded is believed to mean that the similarity is not arisen by chance. When S1, S2, ..., SN are consecutive equal-length slices of the sequence under investigation, we may say that significant similarity between them means significant periodicity in this sequence. As we said before, different similarity measures result in different weights and different significance values; in some cases, periodicity of a sequence may be apparent from the information-theoretical viewpoint, while omitted by homology searches. In our studies, we call this effect latent periodicity. In our previous studies14,24-26 we have shown that such latent periodicity is present in many sequences of biological importance. Let us illustrate this with an example. Assume that the latent period is 7 symbols long, and there is equal probability to encounter the amino acids shown in Table 1 for each position of the period, for example, letters R, M, I, E, H, T, P, N, D, A, Q, G, Y, C, K, and L are equiprobable at positions (1 + 7*N), and other letters are absent at these positions. This table shows the artificial matrix used to represent the concept of latent periodicity. One of possible sequences satisfying the conditions in the table is the following: GGLPPCDPTSWDSTLKVVPFQMHYYMQTCNFHRFSTVPTACNHRTIPAIPGMYRRYKPEIRCFRVQIFAIRVLQKRDHANCAYRGAYKFAYIAPIDAMHHYYRQQTALHTHVHVYTFILGHTYPTQDKDJournal of Proteome Research • Vol. 6, No. 2, 2007 863

technical notes

Latent Periodicity of Protein Families Table 1. The Set of Amino Acids Used for Generation of a Latently Periodic Sequence with 7 aa Long Perioda position in the period

1

2

3

4

5

6

7

Set of the amino acids at the position

RMIE HTPN DAQG YCKL

RTKM FLGW VACD PNSH

RMTS APDL VQIF KEYN

QAHK PLCY DWTS VEIG

RDTW VIAK NQPM FCGY

PCGT AYHS MQDI RKEF

LDEH NTFV QWYR SPMI

a At each position, the frequencies of mentioned amino acids were chosen equal.

we have randomly chosen only 200 domain consensus sequences. Only sequences ranging from 50 to 300 aa in length were used for analysis. Our exhaustive search covered up to 3 possible indels of up to 13 amino acids in these sequences (we did not make larger numbers of indels because of vast amounts of computational time required for straightforward enumeration). Since the actual number of identified ProDom consensus sequences is about 1000 times higher, we believe that the actual number of periodicities still hidden in the domain databases is enormous, but it is the question of both improvements to the algorithm and extending of processing capabilities. The obtained sequences with indels were analyzed for the presence of latent periodicity as follows: for all possible period lengths from 2 aa to 1/3 of the sequence length, we performed the ID procedure.14,16,24 The basic idea of the Information Decomposition of sequence a ≡ {ak} of length Lseq over an alphabet A ≡ {Ai} of cardinality n is to compare this sequence with an artificial periodic sequence. If we are searching for periodicity of length L, the artificial sequence is defined over the formal alphabet S ≡ {Sj} of cardinality L as follows: s ) S1S2...SLS1S2...SLS1...; its length is equal to the length of the analyzed sequence, that is, Lseq. To determine the quantitative measure of similarity, we fill the coincidence matrix M with dimensions n*L, its (i,j)-th element being equal to the number of co-occurrences of symbols Ai and Sj at the same positions in sequences a and s

Figure 1. The framework of searching for latent periodicity using the iterated profile analysis.

SMGQDDKVIKNTNRVAGTTTTGTPRYSAQPHTKCAGCMHGMKERIVTGPGRDPPKNATKQDCMQISYDAPQDADNHSSYYE Let us assess the probabilities R (of meeting such, or better, homology between repeats by chance) and β (of obtaining the observed, or higher, value of mutual information by chance) for this sequence using the Monte Carlo technique. For this purpose. we calculated the alignment score W from eqs 1-3 using the BLOSUM50 affinity matrix, as well as the mutual information of the sequence24 (4). Then we generated 200 random shuffles of the sequence with the same amino acid composition. For each of these sequences, the values of similarity score W from eq 1 and mutual information were obtained. Then we calculated the means and the variances of both W and mutual information and obtained Z-scores using the following formula: Z)

Sreal - E(S)

xD(S)

(5)

For our sample sequence, the calculated Z-values were equal to 1.2 and 3.4, respectively. That is, the probability R is roughly estimated as 0.12, and the probability β is roughly estimated as 0.00087. The shown values of R and β illustrate that homology-based algorithms of searching for repeats are unable to identify such periodicities at a statistically significant level. Modified Information Decomposition. The outline of the combination of techniques we used in this study to search for latent periodicities is presented in Figure 1. We searched for initial cases of latent periodicity in the set of consensus sequences of protein domains taken from the ProDom database.27 Because of limitations of calculation time, 864

Journal of Proteome Research • Vol. 6, No. 2, 2007

S1 m11 m21 ··· mn1

A1 M ) A2 ··· An

S2 m12 m22 ··· mn2

··· ··· ··· ··· ···

Sm m1m m1m ··· mnm

Using this matrix we calculate the mutual information n

I)

L

∑∑

n

mij × ln mij -

i)1 j)1



L

xi × ln xi -

i)1

∑ y × ln y + j

j

j)1

Lseq ln Lseq where xi and yj are the frequencies of occurrence of different symbols in sequences a and s L

xi )

∑ j)1

n

mi,j;

yj )

∑m

i,j

i)1

To estimate statistical significance of the obtained value of mutual information, the sequence a is shuffled, and the mutual information values for the shuffled sequences are calculated. The mean and the variance of mutual information and the Z-value are calculated as usual. High values of Z (Z > 6.0) were considered as evidence for the presence of latent periodicity in the analyzed sequence. Nearly 20% of the found amino acid sequences contained homologous periodicities; that is, homology search techniques were also able to identify repeats in them. We excluded those cases from our analysis because RADAR22 and REPRO28 have been good tools to investigate them. We also excluded the cases where latent periodicity was identifiable without indels because these cases were investigated in our earlier studies.15-17,24 Then, the identified periodicities provided the initial data for searching for similar periodicities. Cyclic Alignment and Scanning of the Swiss-Prot Data Bank. The results of Modified Information Decomposition were

technical notes

Laskin et al.

presented in the form of matrix M, which contains the numbers of occurrences of different amino acids at different positions in a period of certain length. These matrices were converted to position-weight matrices (also called profiles) using the logodds formula;29 the profiles were then used for scanning of the Swiss-Prot data bank to find sequences with similar types of latent periodicity. The scanning utilized a substantially faster technique called locally optimal cyclic alignment.15,16 The cyclic alignment may be considered as matching of a sequence to virtual periodically elongated pattern such as “...QWERTYQWERTYQWERTYQWERTY...”; of course, in most cases, only a part of the sequence will be matched (i.e., the cyclic alignment will be local). Our main idea is to present cyclic alignment in the form of a path that connects the nodes of a two-dimensional cylindrical lattice, where one of the coordinates (the linear one) corresponds to position in the sequence, and another (the cyclic one) corresponds to position in the cyclic profile (compare to conventional sequence alignment, which can be presented in the form of a path between the nodes of a flat two-dimensional lattice, coordinates being the positions in the compared sequences). This path contains diagonal steps, which describe matching of a symbol from the sequence and a position of the profile, as well as steps along the axes, which describe indels. Every such path has a total score, which is the sum of gap costs and weights of symbolto-position matches. The optimal cyclic alignment is the path with the highest possible total score. We have shown15 that it can be found by means of cell-by-cell filling of the similarity matrix Si,j, in which one of the indices (for instance, i) is cyclic or wrapped, namely, Si-L,j ≡ Si+L,j ≡ Si+2L,j ≡ ... ≡ Si,j, where L is the period length, just as we find the best linear alignment using the SmithWaterman formula.30 The formulas for recursive filling of Si,j are Si,j ) max{S′i,j, max [S′(i-k),j - dk]} 1ekeL-1

where

Noise Decomposition and Profile Refinement. Iterative searches were performed to improve sensitivity and specificity of Swiss-Prot scanning. We also applied the Noise Decomposition technique,16 where it was required to separate two (or more) types of latent periodicity, corresponding to separate protein families, while having the same period length. The effectiveness of this technique was demonstrated in the abovementioned paper, where it has allowed us to separate periodicity profiles for serine-threonine and tyrosine protein kinases with 97% specificity. The formulas for Noise Decomposition are N

πi,j ) c0fi +

∑c q k

k)1

k

i,j,

ri,j Wi,j ) ln , πi,j

(7)

where ri,j are weighted frequencies of occurrence of different amino acids (enumerated with i) at different positions in the period (j) in the protein family of our interest, qki,j are weighted frequencies of occurrence of different amino acids at different positions in the k-th family having latent periodicity with the same period length, fi are amino acid frequencies in SwissProt, Wi,j are the elements of the new position-weight matrix (profile), and ck are some weighting coefficients summing to 1.0 (usually we chose c0 ) 0.75/0.8, and other coefficients were proportional to the share of the corresponding family in the results of scanning). Then, we used the new profile for the next iteration of Swiss-Prot scanning. We stopped iterations when we identified as many family members as possible (usually 3-5 iterations were enough). We considered the periodicity to be family specific if we achieved at least 90% specificity, that is, 90% or more of the results of scanning belonged to the same family (the affiliation of a protein with a family was determined from InterPro31 identifiers and Swiss-Prot descriptions.2 All the algorithms were implemented in C++ for clustertype supercomputer (Modified Information Decomposition) and for single PC (cyclic alignment and Noise Decomposition). The software is available upon request.

Results and Discussion S′i,j ) max {0,Si-1,j-1 + wi,j,max[Si,j-k - dk]} 1ekej

(6)

Here wi,j is the weight of the j-th symbol in the sequence at the i-th position in the cyclic profile, dk is the gap penalty for insertion/deletion of k successive symbols (we used affine penalties with values of 3.8 for gap opening and 0.7 for gap extension). As usual, to find the optimal local alignment, we have to identify the highest element of S-matrix and recreate the path down from it to the first zero element. The value of the highest element is the total score of the optimal local alignment; this is the value we utilized to check whether the alignment is statistically significant. The Monte Carlo technique was used to assess the statistical significance of alignments. The assessment was performed independently for each sequence, taking into account its length and composition. To assess statistical significance of an alignment, we aligned shuffled sequences against the same cyclic profile. The mean and the variance of alignment score and the Z-value are calculated as usual. High values of Z (Z > 6.0) were considered as evidence for the presence of latent periodicity in the analyzed sequence. Our numerical experiments showed that we are unlikely to observe Z-values greater than 6.0 in a random test sequence set with the total number of symbols equal to the total number of amino acids in Swiss-Prot.

Table 2 represents the novel identified families with familyspecific latent periodicity. The data show that we have identified latent periodicity in more than 80% of members of each family. The presented results reveal that the “direct”, indelunaware ID technique actually omits many cases of latent periodicity, which may be function-specific. In these cases, the initial periodicity matrix16 remains uncalculated. The data shown in Table 2 demonstrates that the property of latent periodicity may be rather prevalent in protein families. In the papers15,16,24 we have noticed that, in many cases, latent periodicity is associated with secondary structure regularities (as is the case for nucleotide-binding R/β domains and parallel β-helixes) or the peculiarities of domain inner substructure (as is the case for protein kinases). This is also true for some of the new periodicities found in this study. For example, in the family of cyclins, which control cell division cycle and represent R-proteins, the period of 23 amino acids comprises a helix and a loop. Examples of cyclic alignments of four latently periodic protein sequences, belonging to different periodic families with various period lengths, are presented in Figure 2. The presence of latent periodicity in these sequences after Cyclic Alignment is confirmed by Information Decomposition and the decomposition spectra are shown in Figure 3 Journal of Proteome Research • Vol. 6, No. 2, 2007 865

technical notes

Latent Periodicity of Protein Families Table 2. The Protein Families with Found Latent Periodicitiesa family

period L

identified

total

InterPro

Cytochrome c oxidase, subunit 1 Chaperone, Hsp70-like Tubulin Maturase K (Intron maturase ???) Trans-activating protein X Guanine nucleotide-binding protein G(f), alpha subunit NADH-plastoquinone oxidoreductase, chain 5 MHC class I, alpha chain, alpha1 and alpha2 ATP synthase F1, beta subunit FK506 binding protein Retroviral nucleocapsid protein Gag Myosin head, motor region Cyclin Response regulator receiver Myosin heavy chain E1-E2 transporting ATPase UMUC-like DNA-repair protein Kinesin, motor region Glycoprotein VP7 Chaperonin Cpn60

10 10 11 12 13 15 15 16 18 19 21 21 23 26 27 27 29 31 31 34

147 411 331 458 18 161 137 140 147 65 82 112 134 395 107 249 81 113 48 503

193 474 361 499 18 169 149 151 151 65 87 119 160 448 119 282 87 118 48 530

IPR000883 IPR001023 IPR000217 IPR002866 IPR000236 IPR001019 IPR003945 IPR001039 IPR005722 IPR001179 IPR000721 IPR001609 IPR006670 IPR001789 IPR001609 IPR008250 IPR001126 IPR001752 IPR001963 IPR002423

a The families are defined with their Swiss-Prot descriptions and InterPro identifiers. ‘Total’ represents the count of family members in Swiss-Prot release 47, and ‘identified’ represents the count of identified family members with latent periodicity. Together with previously discovered protein families with latent periodicity, we now have about 200 family-specific periodicity cases.

Figure 2. Examples of cyclic alignments of four latently periodic protein sequences, belonging to different periodic families presented in Table 1. Positions in periods are denoted by numbers and small letters (‘a’ denotes position 10, etc.). Information decompositions of these cases are shown in Figure 3.

In this and previous16,17 studies of latent periodicities, we have shown that members of over 150 protein families may be recognized with our software. The probability of correct identification of function ranges from 80% to 100%, and the probability of misrecognition is roughly estimated by us as 10-6. When the list of investigated latently periodic protein families is extended, our techniques could serve as a useful tool for annotation of various genomes. Summarizing, in this study, we have presented that the Modified Information Decomposition technique has extended 866

Journal of Proteome Research • Vol. 6, No. 2, 2007

our abilities to identify family-specific latent periodicities in protein databases. We may conclude that repeats and periodicity are far more common properties of protein sequences than were known before; we just have limited abilities to identify them. Our up-to-date results show that more than 200 different protein families have latent periodicity, and the number of these families is increasing each week as our investigations continue.17 Gatherer and McEwan32 investigated the periodicity in proteomes of various species and found that eukaryotic

technical notes

Laskin et al.

Figure 3. Information decomposition spectra for the cases presented in Figure 2. Peaks are clearly seen in these spectra, and they correspond to period lengths (highest peaks) and their multiples.

proteins are significantly richer in repeats than prokaryotic and archaic ones, and that periodicity is unlikely a common phenomenon in proteomes. Indeed, it is easy to find that 9.6% of human proteins in Swiss-Prot contains explicit repeats (marked with the FT REPEAT tag), while only 0.8% of Escherichia coli proteins contains them. But the prevalence of feebly marked periodicity is almost the same in either of E. coli, yeast, or human proteomes, and it stays at similar level for all the species they have investigated (even in archaea). Since the evolution of proteins in higher eukaryotes seems to be more rapid than that of lower organisms (with the human line on

the peak), we suggest that they are richer in recent repeat events, especially in eukaryote-specific families; but the “background” latent periodicity seems to be almost equally expressed among the kingdoms. Hence, latent periodicity is a common feature of all living organisms, duplications have played a great role in the creation of contemporary protein families, and our ability to identify their traces highly depends on the techniques we use. At this time, the exact origin of the observed latent periodicities is not clear; however, we can propose a few possible explanations of this phenomenon. First, we may suppose that Journal of Proteome Research • Vol. 6, No. 2, 2007 867

technical notes

Latent Periodicity of Protein Families

catalytic domains were initially much smaller than what we observe now.33,34 However, they were able to duplicate, and the duplications were properly arranged to form even more catalytically active domains. It is a fact that DNA sequence repeats facilitate replication errors at their location, thereby promoting new tandem repeats. We suppose that, as the number of repeats grew, the ancestor protein benefited; that is, its catalytic activity and structural stability increased. Subsequent mutations formed even better packed structure of these domains and fine-tuned the functionality, and at the same time, the mutations resulted in the blurring of periodicity, the loss of homology between distinct repeats. The periodicities we revealed also may be significant for protein functioning from the physical viewpoint. The matter is that the presence of periodicity in an amino acid sequence may lead to the formation of resonant oscillation spectrum, determined by the type of periodicity. Thermal energy may be accumulated at the peak frequencies of this spectrum, and the efficient temperature at these peaks may be sufficiently higher than the environment temperature.35,36 Computation of these spectra is a part of the FPU problem.37 In the case that these resonant frequencies are also substrate-specific, room temperatures will be sufficient for reactions that usually require high temperatures to be carried out in vitro. If this hypothesis is correct, enzymes with new activities are easy to create, and it is sufficient to try out a certain number of tandemly duplicated DNA chains as gene candidates in order to obtain the desired enzymatic activity in reasonable time. This hypothesis does not look self-contradictory; it comprises the origin of genes by tandem duplication18 as well as the possibility of emergence of essential enzymatic activities within a finite amount of time. Latent periodicity may be also involved in the stabilization of protein structure and in its proper folding. It is well-known that protein folding is supervised by the chaperone proteins that bind to growing polypeptide chains.38,39 This binding is not strictly specific, but there are certain binding preferences, the main factors being charge and hydrophobicity of amino acid sequence sites.40,41 We suppose that the periodic distribution of these properties along the sequence facilitates uniform distribution of chaperones, and such uniformity is required (or desirable) for fast and proper folding.

References (1) Wu, C. H.; Apweiler, R.; Bairoch, A.; Natale, D. A.; Barker, W. C.; Boeckmann, B.; Ferro, S.; Gasteiger, E.; Huang, H.; Lopez, R.; Magrane, M.; Martin, M. J.; Mazumder, R.; O’Donovan, C.; Redaschi, N.; Suzek, B. Nucleic Acids Res. 2006, 34, D187-D191. (2) Junker, V. L.; Apweiler, R.; Bairoch, A. Bioinformatics 1999, 15, 1066-1067. (3) Wu, C. H.; Huang, H.; Yeh, L. S.; Barker, W. C. Comput. Biol. Chem. 2003, 27, 37-47. (4) Bork, P.; Koonin, E. V. Nat. Genet. 1998, 18, 313-318. (5) Dodin, G.; Vandergheynst, P.; Levoir, P.; Cordier, C.; Marcourt, L. J. Theor. Biol. 2000, 206, 323-326. (6) Jackson, J. H.; George, R.; Herring, P. A. Biochem. Biophys. Res. Commun. 2000, 268, 289-292.

868

Journal of Proteome Research • Vol. 6, No. 2, 2007

(7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (28) (29) (30) (31)

(32) (33) (34) (35) (36)

(37) (38) (39) (40) (41)

Rackovsky, S. Proc. Natl. Acad. Sci. U.S.A. 1998, 95, 8580-8584. Coward, E.; Drablos, F. Bioinformatics 1998, 14, 498-507. McLachlan, A. D. J. Phys. Chem. 1993, 97, 3000-3006. Herzel, H.; Trifonov, E. N.; Weiss, O.; Grosse, I. Physica A 1998, 249, 449-459. Weiss, O.; Herzel, H. J. Theor. Biol. 1998, 190, 341-353. Conway, J. F.; Parry, D. A. Int. J. Biol. Macromol. 1990, 12, 328334. Heringa, J. Curr. Opin. Struct. Biol. 1998, 8, 338-345. Korotkov, E. V.; Korotkova, M. A.; Kudryashov, N. A. Phys. Lett. A 2003, 312, 198-210. Laskin, A. A.; Chaley, M. B.; Korotkov, E. V.; Kudryashov, N. A. Mol. Biol. 2003, 37, 663-673. Laskin, A. A.; Kudryashov, N. A.; Skryabin, K. G.; Korotkov, E. V. Comput. Biol. Chem. 2005, 29, 229-243. Turutina, V. P.; Laskin, A. A.; Skryabin, K. G.; Kudryashov, N. A.; Korotkov, E. V. J. Comput. Biol. 2006, 13, 946-964. Ohno, S. Evolution by Gene Duplication; Springer-Verlag: Berlin, 1970. Ohno, S. J. Mol. Evol. 1984, 20, 313-321. Benson, G. J. Comput. Biol. 1997, 4, 351-367. Benson, G. Nucleic Acids Res. 1999, 27, 573-580. Heger, A.; Holm, L. Proteins 2000, 41, 224-237. Andrade, M. A.; Ponting, C. P.; Gibson, T. J.; Bork, P. J. Mol. Biol. 2000, 298, 521-537. Korotkova, M. A.; Korotkov, E. V.; Rudenko, V. M. J. Mol. Model. 1999, 5, 103-115. Korotkov, E. V.; Korotkova, M. A. DNA Sequence 1995, 5, 353358. Korotkov, E. V.; Korotkova, M. A.; Tulko, J. S. CABIOS, Comput. Appl. Biosci. 1997, 13, 37-44. Servant, F.; Bru, C.; Carrere, S.; Courcelle, E.; Gouzy, J.; Peyruc, D.; Kahn, D. Briefings Bioinf. 2002, 3, 246-251. George, R. A.; Heringa, J. Trends Biochem. Sci. 2000, 25, 515517. Gribskov, M.; McLachlan, A. D.; Eisenberg, D. B. Proc. Natl. Acad. Sci. U.S.A. 1987, 84, 4355-4358. Smith, T. F.; Waterman, M. S. J. Mol. Biol. 1981, 147, 195-197. Mulder, N. J.; Apweiler, R.; Attwood, T. K.; Bairoch, A.; Bateman, A.; Binns, D.; Bradley, P.; Bork, P.; Bucher, P.; Cerutti, L.; Copley, R.; Courcelle, E.; Das, U.; Durbin, R.; Fleischmann, W.; Gough, J.; Haft, D.; Harte, N.; Hulo, N.; Kahn, D.; Kanapin, A.; Krestyaninova, M.; Lonsdale, D.; Lopez, R.; Letunic, I.; Madera, M.; Maslen, J.; McDowall, J.; Mitchell, A.; Nikolskaya, A. N.; Orchard, S.; Pagni, M.; Ponting, C. P.; Quevillon, E.; Selengut, J.; Sigrist, C. J.; Silventoinen, V.; Studholme, D. J.; Vaughan, R.; Wu, C. H. Nucleic Acids Res. 2005, 33, D201-205. Gatherer, D.; McEwan, N. R. J. Mol. Evol. 2005, 60, 447-461. Trifonov, E. N.; Berezovsky, I. N. FEBS Lett. 2002, 527, 1-4. Trifonov, E. N.; Kirzhner, A.; Kirzhner, V. M.; Berezovsky, I. N. J. Mol. Evol. 2001, 53, 394-401. Mirnov, V. V.; Lichtenberg, A. J.; Guclu, H. Physica D 2001, 157, 251-282. Fujisaki, H.; Bu, L.; Straub, L. E. In Normal Mode Analysis: Theory and Applications to Biological and Chemical Systems; Cui, Q., Bahar, I., Eds.; Chapman and Hall/CRC Press: Boca Raton, FL, 2005; pp 301-323. Ulam, S. Collected Papers of E. Fermi, Vol. 2; University of Chicago Press: Chicago, IL, 1965. Ruddon, R. W.; Bedows, E. J. Biol. Chem. 1997, 272, 3125-3128. Thulasiraman, V.; Yang, C. F.; Frydman, J. EMBO J. 1999, 18, 8595. Takenaka, I. M.; Leung, S. M.; McAndrew, S. J.; Brown, J. P.; Hightower, L. E. J. Biol. Chem. 1995, 270, 19839-19844. Knarr, G.; Modrow, S.; Todd, A.; Gething, M. J.; Buchner, J. J. Biol. Chem. 1999, 274, 29850-29857.

PR0603203