Classification of Proteins Based on Minimal Modular Repeats: Lessons from Nature in Protein Design Brett M. Barney* Department of Chemistry and Biochemistry, 0300 Old Main Hill, Utah State University, Logan, Utah 84322 Received April 14, 2005
Proteins containing internal repeats within their primary sequence have received increased attention recently, as the extent of their presence in various organisms is recognized more fully, and their role in evolution is more thoroughly studied. Presented here is a technique used to detect and classify proteins based on a modular evolutionary phenomenon that results in a series of small internal repeats. The parameters chosen are based on a minimum segment of seven residues that result in simple functional scaffolds. The genomes and corresponding proteomes of a variety of eubacteria and archaea have been analyzed using an algorithm that searches prokaryotic genomes for proteins containing small conserved repeats assembled in a modular fashion similar to a recently characterized protein from the organism Nitrosomonas europaea. This analysis has revealed additional proteins present in N. europaea with similar modular characteristics. A further survey of a variety of organisms demonstrates that this evolutionary pathway has been utilized in other organisms as well, to yield a broad assortment of small modular proteins. A thorough description of the sequential characteristics of these modular proteins follows, along with a selection and discussion of the various proteins uncovered through this expanded search and analysis. Several databases of the proteins uncovered from this work and the program used to perform the search are available. Keywords: internal protein repeats • modular design • algorithm • small proteins • database
Introduction Evolution is a process of selective improvement based on inheritance, mutation, and reorganization of genes to meet changing needs. In the microbial world, genes are readily transferred between species through a variety of processes.1,2 and transferred genes that become beneficial to the survival of a rapidly evolving organism are often incorporated directly into the genome. Many genes evolve from previous functional scaffolds, which already exist as another protein within the genome. The diversity of this process is broad, and enhances the ability of organisms to adapt to a variety of environments. The incidence of genes that have borrowed significant sequence similarity from other genes is illustrated by the extensive number of proteins that have high similarity to other proteins with separate functions within a single genome. These can occur either through duplications of entire genes as protein paralogs,3 or as repeated sequences of amino acids within the same gene (internal repeat proteins)2-4 as discussed here. In contrast to the many families of proteins with similarities in either related or unrelated species, many of the genes that are uncovered with each new genome sequencing project are unique hypothetical proteins, or singletons, that have no known homologues, even in closely related species.2 In some cases, as described previously and addressed here, these unique * To whom correspondence should be addressed. Tel: (435) 797-7392. Fax: (435) 797-3390. E-mail:
[email protected]. 10.1021/pr050103m CCC: $33.50
2006 American Chemical Society
hypothetical proteins also contain internal repeats,5 resulting in a new class of internal repeat protein. Protein internal repeats come in a variety of forms, ranging from single residue repeats, through small tandem repeats, up to large domain repeats of hundreds of residues.6 The concept of protein evolution occurring by duplication of specific sequences within genes goes back several decades7 and cellular mechanisms resulting in repetitive proteins are well established.1,2,4,8 The topic of sequence repeats within single proteins has been reviewed in recent years,2,9 and a variety of algorithms3,10-13 have been developed to search for regions of high similarity, not only between different proteins,14 but also within individual proteins.3,10,13 Classes of protein repeats such as Leucine rich and ankyrin repeats are well established in the literature,9,15,16 as well as the relationships of many repeat proteins to different cellular functions.8 Repeating sequences have also been implicated as a major feature in certain classes of proteins, such as intrinsically unstructured proteins.17 In recent years, the topic of repeating sequences has become relevant for proteins related to various diseases,4,6 including the prion proteins.17-19 Estimates based on previous assessments show that the existence of repeats in the primary structure of proteins from several species is significant (about 14%).3,8 With the advent of modern whole genome sequencing projects, a great deal of information is now available for large scale data mining, and has been used to make many general and specific observations Journal of Proteome Research 2006, 5, 473-482
473
Published on Web 02/08/2006
research articles
Barney
about the processes used throughout evolution to transfer information and develop new proteins. Previous work with a small metal binding protein (SmbP) (NCBI protein accession number NP_842452) revealed a pattern of repeats based on a seven-residue motif with high conservation of specific residues.5 As a model, the seven-residue repeat unit is a common feature of coiled-coils,17,20,21 though coiledcoils are not believed to be a feature of SmbP. Seven-residue repeats have also been shown to be a feature of the transmembrane regions from a number of viral proteins.22,23 In this work, the modular features of SmbP were used to define the criteria for a simple modular protein, and to construct an algorithm which would analyze the primary sequences of small proteins to identify other proteins sharing similar features of modular design. This effort has resulted in a simple and easily amenable algorithm to search for similar modular proteins within the host organism, Nitrosomonas europaea. The search was further expanded to include a broad selection of prokaryotic genomes. A summary of these findings along with a database of proteins identified is provided. The program used to perform the searches is also freely available at http://cc.usu.edu/∼bbarney/ homepage.htm.
Materials and Methods Protein Identification and Analysis Software. All proteins are cross-referenced in this work according to the National Center for Biotechnology Information (NCBI) accession numbers. The translated protein sequences in FASTA format for individual prokaryotic genomes were obtained through the NCBI microbial genomes website (http://www.ncbi.nlm.nih. gov/genomes/lproks.cgi), and were subsequently used as the query database for each genome. In addition, sequences from the Research Collaboratory for Structural Bioinformatics (RCSB) protein data bank24 were obtained as the raw FASTA protein primary sequence data, and modified to remove redundant sequences. Individual structures for a selection of the proteins identified were then obtained and viewed for comparison and determinations of the roles repeats played in each structure. Multiple alignments of protein sequences were done using Multalin,25 while BLAST14 searches were accomplished using the NCBI web-based search engines. Identification of Proteins with Simple Modular Characteristics. The aim of this study was to search the large database of genome sequences for proteins with a primary sequence containing modular characteristics similar to that revealed in the protein SmbP.5 To perform such a search, the characteristics that describe the features of the primary sequence of this modular protein had to be defined. Four characteristics were selected as search criteria which define the modular nature of this protein. First, SmbP is a small protein of 117 residues (93 following removal of an N-terminus leader sequence which directs it to the periplasm).5 Size was selected as a search constraint due also to the fact that increased size leads to an increased probability that additional sequences with similar compositions might occur simply by chance. The size requirement was thus a factor of the search, since small proteins with multiple repeats of high similarity were the target of this work. So as not to bias the search unintentionally, and still include the parameter of size as a restriction, a limit of 250 residues was selected in this initial assessment to meet the requirement of the protein being small (300 residues was used as the criteria for proteins from 474
Journal of Proteome Research • Vol. 5, No. 3, 2006
Figure 1. Description of Program Algorithm. Above is an example of the calculations made using the Internal Repeat Search Program for the hypothetical protein NP_870412.1 from Rhodopirellula baltica. The top shows the protein sequence with specific residues labeled. The algorithm is performed using the specific parameters of score (based on the BLOSUM62 matrix in this example) using the seven residue sequence starting at residue 74 as the parent sequence (red), and listing any other seven residue segments before or after this sequence that meet a minimum score (here selected as 13). In addition, sequences followed immediately by an additional sequence meeting the same score requirement, get an additional score (here 10, given to segments beginning at 11, 94, and 74, the parent segment, which does not get scored against itself, but in this instance, receives the additional score for the segment at 81). Segments such as CCRTCEQ starting at 97 with a score of 17 are thrown out due to overlap of a higher scoring sequence (starting at residue 94 with a score of 34). The algorithm performs this calculation for each segment of seven residues, starting at 1, and ending here at 104, and generates the highest total scoring window of seven residues. Proteins meeting a minimal total score (150 using BLOSUM45 and 120 using BLOSUM62 in this work), were selected for further analysis.
the RCSB protein data bank to expand slightly the selection of proteins identified). The next characteristic of this modular protein was the seven-residue repeat. The Internal Repeat Search Program is built around a simple algorithm that analyzes the primary sequence of a protein in increments of seven tandem residue windows, each of which could serve as the parent sequence. The rest of the protein is then compared to the parent sevenresidue sequence, searching for other sets of seven tandem residues within the protein with a high degree of similarity to the parent sequence, and quantifying a similarity score using the values obtained from either the BLOSUM45 or BLOSUM62 Scoring Matrix.26,27 Following each analysis, a score is generated for the specific parent seven-residue sequence window, and the parent sequence window shifts down one residue to repeat the entire analysis, so that the highest scoring window of seven residues for each protein can be determined (See example in Figure 1). In addition to SmbP containing a seven-residue repeat, it was also evident that some of these repeats occurred in tandem with the next repeat, and could include two or three sets of
research articles
Small Modular Proteins
Figure 2. Proteins from Nitrosomonas europaea with features similar to SmbP. This figure shows SmbP (A), and a selection of three other proteins uncovered from Nitrosomonas europaea using the algorithm described in this work. Shown are SmbP (NP_842452.1) (A), a hypothetical protein (NP_842243.1) (B), a hypothetical protein (NP_842232.1) (C), and a hypothetical protein (NP_842440.1) (D). Additional sequence regions not shown are depicted as (..) for clarity. The seven-residue sequence(s) used as the parent sequence which other segments were scored against are shown in red, as are all absolutely conserved residues. All were scored using the BLOSUM45 matrix.
seven-residue repeats, along with linker regions which also contain similarities to other linker regions in the protein (Figure 2A). On the basis of this attribute, an additional score was given to each repeat followed immediately by an additional sevenresidue repeat to bias the selection of repeats which were arranged in a tandem manner. The algorithm may be biased in this manner, which adds to the score of the current set of seven-tandem residues. The program allows this feature to be selected, as was done in this work. The final criteria were chosen to eliminate overlap and multiple scoring arising from proteins with long sequences of low complexity (polyalanine regions or incremental repeats such as GAGAGAGA), which do not fit the desired unique seven residue repeats found in SmbP. This problem was a common occurrence within many of the proteins found based on the initial three selection criteria, and included the additional problem associated with single-residue repeating regions, or proteins consisting primarily of only two or three residues. While the algorithm does not ignore this possibility, it does eliminate the multiple-scoring problem that is a consequence of this problem. This is accomplished by scoring the highest frame first, and then eliminating any score for the six frames of seven-residues prior to, and following each scored sevenresidue segment, and retaining only those sequences which do not overlap one another. An example of how a protein is scored for an actual protein sequence using this technique is illustrated in Figure 1. Software and Platform. The algorithm was designed for use on a single PC computer as a stand-alone application using Visual Basic (Microsoft) and the .net framework. Text files of the translated protein sequences from each of the genomes analyzed in this work were obtained through the NCBI genomes
site (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi). Specific genome sequences analyzed in this work are presented in the supporting information. Sequences in the FASTA protein format were first processed by a format correction algorithm, to correct for any misuse of the “]” symbol, which is used by the search algorithm to denote the end of the protein title, and the beginning of the coding sequence. Once processed and formatted, the entire proteome was analyzed based on the selection criteria described above. The software used to perform this analysis is available at http://cc.usu.edu/∼bbarney/ homepage.htm. In addition, all proteins uncovered in the current database arranged in FASTA format are also available at the same site. Statistical Analysis. The statistical significance of protein similarity can be calculated by a number of techniques.28-30 One preferred technique is to generate multiple random sequences of the appropriate length and composition. This method was selected here, to determine the likelihood that multiple sets of sequences with similarity to the sequence identified as the parent seven-residue sequence from each protein, would occur simply by random events. To analyze the likelihood that this is the case, the Z-score for each identified parent sequence was calculated from a specific number of randomizations of the initial protein sequence. The search for sequences with a similarity to the parent sequence determined from the actual high score of the original sequence was done for each of the randomized sequences. In this manner, the composition and size of each analysis was maintained. The Z-score was calculated using the common statistical equation and the parent sequence found for the actual sequence. The same parent sequence was used to calculate a score for each of the randomizations, which were then used to calculate a final Z-score. For the statistics calculated by the program, 100 randomizations were performed, and the data were plotted so that the percent of proteins above a specific Z-score in the two databases generated here could be compared for the analysis using either BLOSUM45 or BLOSUM62.
Results and Discussion Algorithm Parameters and Development. The aim of this work was to analyze the features of the primary sequence of a recently characterized modular protein, and use these features as a model to develop a search algorithm to uncover other proteins with similar modular characteristics. The physical properties of the modular protein used as the model for this search have been reported,5 and characterization of this unique protein is continuing, while the interesting features of the primary sequence offer further avenues of study. To search for additional proteins with small modular characteristics, the characteristics were first defined in terms of general criteria, to develop an algorithm to search the extensive protein databases for other proteins with similar features. At first glance, the model protein SmbP can be described as a small protein with a unique seven-residue motif with an absolutely conserved histidine in the fourth position (Figure 2A). This motif is found 10 times in the primary sequence, existing in tandem in several locations. A closer examination of the repeats, aligning each repeat below the next and considering residues outside the seven-residue repeat region, reveals similarities between these “linker regions” as well. Viewed in the context of modular design, SmbP might rather be defined as a simple protein made up of seven or twelve residue modules, with each module containing the same level Journal of Proteome Research • Vol. 5, No. 3, 2006 475
research articles of conservation within the first seven residues (Figure 2A). The algorithm used in this work was designed to search for repeats of seven-residue length, based on this minimal sized module from the model sequence on which it was based. As described under the materials and methods section, the selection criteria in this work is based on four specific requirements; (1) a maximum size of 250 residues for proteins surveyed in this work from various genome sequencing projects (300 residues from the RCSB protein data bank24), (2) a similarity comparison of seven-residue segments based on either the BLOSUM45 or BLOSUM62 matrix26,27 to calculate a simple score for each seven-residue segment, (3) an added score for sequences meeting the minimal score that are followed immediately by another set of seven-residues that meet the same minimal score requirement, and (4) selection of only the highest scoring sequence of seven residues when multiple frames meet the minimum score requirement and overlap one another. This final requirement selects the highest scoring repeat within each thirteen-residue frame, and eliminates excessive scores for regions of low conservation (i.e., AAAGGGGGGAA). These selection rules allow the algorithm to generate a score for each protein in a genome, and narrow the analysis down to the few proteins which fit the numerical scoring requirements in this classification. A more extensive analysis as well as statistical comparisons for the significance of the repeat identified may then proceed on each individual basis. Database Construction From Prokaryotic Genomes. Once the algorithm was developed and demonstrated to be capable of identifying the repeats in the model sequence, the first aim was to search for additional proteins from Nitrosomonas europaea with similar modular features to those found in the model sequence of SmbP, as specified from the criteria stated above. Though the algorithm is biased for repeats of seven residues, it is also capable of uncovering proteins with larger repeating units, though the scoring in this work is based only on the first seven. The specifics of the seven-residue segment analysis make it inferior in determining segments larger than 10, where other programs such as RADAR or TRUST10,13 perform much better, though in some instances, it will find these as well. An analysis of known and putative proteins from the genome of N. europaea revealed a number of proteins with small modular features. Two of these hypothetical proteins (Figure 2B,C), are similar in nature, with the former containing a highly conserved DQRF sequence within the repeat, and the latter containing a highly conserved DKRF sequence, and both are assembled in modules of seven or eleven residues. The other hypothetical protein (Figure 2D) is an assembly of 9, 10, or 11 residues, with varied degrees of conservation at each of the residues. In total, nine proteins were found in the N. europaea genome that met the score requirements used for the initial scan of all genomes, based on similarities as defined by the BLOSUM45 Matrix.26,27 Following the analysis of the N. europaea genome, a broadened search of other completed genomes was instigated to develop a more extensive database of these modular proteins containing similar features for further characterization and comparison. The complete database compiled from the analysis of more than one hundred bacterial and archaeal genomes is available online (http://cc.usu.edu/∼bbarney/homepage.htm), and a list of the genomes searched in this work is available as supplementary information. This work yielded a large variety 476
Journal of Proteome Research • Vol. 5, No. 3, 2006
Barney
of proteins with vastly different characteristics, and further classification could be handled in a variety of ways. Determination of Repeat Significance. There are a number of different techniques that are used by BLAST searches for determining the significance of alignments.28-30 Several different approaches were applied to the databases generated here to assess the significance of the repeats found. The problem of bias is of great concern, especially for a protein sequence consisting primarily of repeating units, where the complexity of the sequence (here defined as the percent of various amino acid residues in the specific protein versus relative percents in all proteins as a whole), can in many cases be weighted very heavily in favor of only a few specific residues. The approach used here to assess the significance of repeats was to first determine the seven-residue “parent sequence” from each protein with the highest similarity score based on either the BLOSUM45 or BLOSUM62 similarity matrix, and then randomize the sequence and determine the similarity score of the same “parent sequence” again for the new sequence following randomization. In each case, the score was calculated using all the same parameters. Based on this analysis, performed 100 times for each sequence meeting the minimum score requirements, a Z-score was calculated. One flaw in this calculation is that the original “parent sequence” does not get scored against itself (though additional absolutely conserved sevenresidue segments are scored). However, in the randomizations, the entire sequence is scored against the parental sequence. This statistical check was favored as a means of investigating sequences which might arise by random events due to low sequence diversity. The Z-score allows one to estimate the probability that a value is significantly different from another value. In this work, the value obtained based on the parental sequence for the actual protein, and the mean and standard deviation for the scores obtained from 100 randomizations scored again versus the parental sequence are used to give a numeric value which can be used to approximate probabilities of the same score being obtained by random chance. In this manner, higher Z scores indicate a higher confidence for the significance of the repeat. From the plot shown in Figure 3, it is clear that the majority of proteins from the database generated using BLOSUM62 have a much higher Z-score than those obtained using BLOSUM45. This is largely related to the lower overall score given to similar, though not identical residues from the two matrixes, resulting in a large number of the proteins from the BLOSUM45 database being discarded following reanalysis using BLOSUM62. Though both databases contain some proteins with Z-scores below 1.0, many of these contained larger repeats, for which residues beyond the first seven were not accounted for in the statistical calculation. For this reason, the statistical test was not applied here to screen and filter out such proteins, in favor of retaining the database for further analysis once the algorithm is modified and improved upon to include other sizes between 5 and 10 residues, and once more extensive statistical tests are developed based on these improved algorithms. As an example of this point, the hypothetical protein from Pseudomonas aeruginosa (NP_250880.1, not shown) yielded a very low Z-score based only on these seven residues of the repeat segment. A simple visual inspection indicated that this protein contained larger repeats, as identified by both the RADAR13 and TRUST10 programs. When the same calculation was performed based on a 10 residue segment, the Z-score improved dramatically.
Small Modular Proteins
research articles
Figure 3. Statistical Z-Scores. Shown above is a curve representing the percent of proteins from the databases generated using either the BLOSUM45 (Blue) or the BLOSUM62 (Red) similarity matrix and the corresponding Z-Score calculated (for the identified seven-residue parent sequence) from 100 randomizations of the same primary sequence. The number of sequences in the current database is shown for each on the graph.
One of the primary problems associated with defining repeats, is the determination of sequence complexity. Though techniques exist to deal with such complexity, these may be highly biased and subjective. This problem is circumvented by some web-based BLAST searches by ignoring regions of low complexity through the use of filters. Attempts were made to remove proteins from these databases based on low complexity (here defined as proteins consisting of 50% or greater content of three or less specific amino acids), but visual inspection and the Z-Score determined for many proteins with low complexity supported the belief that for many of these proteins, the repeats are significant enough as to not have been formed by simple random events, and are instead the result of a modular mechanism. Examples of low complexity proteins are shown in Figure 4, and include the glycine rich RNA binding region of RNP-1 (Figure 4A), which has a total content as translated of almost 40% glycine, and nearly 58% glycine in the region shown. Many of the glycine residues in the protein occur in tandem with other runs of glycine, resulting in a protein that lacks significant complexity, and could easily be aligned in many ways to appear as though it contains repeats. The properties would make such a protein a prime target for removal by use of a filter. However, these low-complexity proteins also include the morphogenetic protein from Bacillus subtilis (Figure 4C), which contains a series of thirteen-residue and seven-residue modular repeats with very high conservation, even though nearly 59% of the total protein consists of lysine, histidine and serine residues. Histone protein H1 from Xanthomonas axonopodis (Figure 4D) and a hypothetical protein from Mycobacterium leprae (Figure 4B) are also shown, which also exhibit modular characteristics for proteins with low-complexity. All of these repeats had high Z-scores, and would likely be much higher if scored as nine, six, or eight residue repeats. They were also identified by either the RADAR13 or TRUST10 programs, though RADAR always
Figure 4. Proteins with Low Complexity. Shown above are proteins with a high percentage of only three residues. The proteins presented are the RNA-binding region RNP-1 (RNA recognition motif) from Synechococcus sp. (NP_896112.1) (A), a hypothetical protein from Mycobacterium leprae (NP_301480.1) (B), a morphogenetic protein from Bacillus subtilis (NP_391488.1) (C), and histone H1 from Xanthomonas axonopodis (NP_643367.1) (D). Additional sequence regions not shown are depicted as (..) for clarity. The seven-residue sequence(s) used as the parent sequence which other segments were scored against are shown in red, as are all absolutely conserved residues. All were scored using the BLOSUM45 matrix.
identified these as larger repeats composed of several tandem segments and TRUST only identified a portion of the repeats identified by this algorithm. These results should not be surprising, if one again considers SmbP, the model used to develop this algorithm. Following cleavage of the leader sequence,5 more than 47% of SmbP is composed of only alanine, glutamate and histidine, and is completely devoid of six of the other 20 standard residues. This simplicity is an overall theme of such a modular design for repeating sequences in proteins. The Seven Residue Repeat Unit. An initial goal of this work was to specifically investigate the utility of the seven-residue unit, and the prominence of this specifically sized repeat unit in various proteins. The simplest modular protein of sevenresidues would be a tandem repeat of seven-residues, for which the algorithm may be, and in this work was, purposely biased. Though there were many proteins containing sets of tandem repeats, several found exhibited a sequence predominantly of seven residue repeats (Figure 5A-C), illustrating the utility of the simple seven-residue repeat. Larger repeats of 8 (Figure 5D), 9, 10, and 11 residues were also found (not shown, but easily located within the database generated here). One disappointing feature of these extensive tandem repeats is that all are Journal of Proteome Research • Vol. 5, No. 3, 2006 477
research articles
Figure 5. Extensive Tandem Proteins. Shown above are small proteins dominated by a simple tandem repeated sequence. The proteins presented are a hypothetical protein from Xylella fastidiosa (NP_297916.1) (A), a conserved hypothetical protein from Thermoanaerobacter tengcongensis (AAM23913.1) (B), a hypothetical protein from Yersinia pestis (NP_670964.1) (C), and a hypothetical protein from Xylella fastidiosa (NP_297344.1) (D). Additional sequence regions not shown are depicted as (..) for clarity. The seven-residue sequence(s) used as the parent sequence which other segments were scored against are shown in red, as are all absolutely conserved residues. All were scored using the BLOSUM45 matrix.
classified as hypothetical proteins, indicating that no known functions for these have been established, and confirmation that such proteins are actually expressed by the cells is as yet uncertain. The seven residue repeat has been found in coiledcoils17,20,21 and also in transmembrane regions from some viral proteins,22,23 and was recently shown to be utilized in a metal binding protein.5 It is likely that further functions for this repeat size will eventually be determined as well. Predominance of Repeats in Certain Species. The broadened search of genomes was performed to build a database of proteins with modular properties, and examine the extent of small modular proteins in various organisms. This analysis yielded many interesting features of these modular proteins in regards to their predominance throughout the eubacterial and archaeal worlds. In many organisms, such as Escherichia coli, proteins meeting these scoring requirements represented less than 1% of the total proteins smaller than 250 residues (16 of 1852 for Escherichia coli K-12), whereas in others, such as several species of Mycoplasma, these proteins account for 5-8% of the proteins smaller than 250 residues (16 of 287 in Mycoplasma pneumonia, and 34 of 405 in Mycoplasma penetrans), and yet, in other species of Mycoplasma, these are far less prevalent. 478
Journal of Proteome Research • Vol. 5, No. 3, 2006
Barney
For species such as Mycoplasma, which are among the smallest genome sizes and physical dimensions for any known cells, and are of great interest from the perspective of the minimum genome concept,31 the role of internal-repeat proteins is of considerable interest. Several lipoproteins with internal-repeat sequences from Mycoplasma have been shown to play a role in ciliary binding,32 and have been linked to the virulence of these strains. Virulence in some Mycoplasmas can be attenuated by successive passage in laboratory media,33-35 and extensive passes can lead to significant reductions in cilia adherence.36 Other findings for a region of tandem repeats from a 94 kDa antigen from Mycoplasma hyopneumoniae which is purportedly not necessary for ciliary binding, but will raise antibodies, has led to speculation that such repeats might serve as immune decoys.32 More recent work has linked variation in lengths of repeat regions of Mycoplasma proteins to virulence and resistance to complement killing,37,38 though it has been stated that a clear understanding of the functional consequences of the variation in numbers of repeats remains relatively unclear,39 and the study of the role that internalrepeat regions play in pathogenicity remains a significant avenue for future study. Part of the intrigue of small genomes is linked to the evolution of Mycoplasma and similar organisms, which are believed to have evolved from more complex bacteria through the loss of genes of unnecessary functions, resulting in a minimal set of essential genes.31 Genome sequencing projects have become an invaluable source of information for understanding the required components to sustain life, and a large number of genome sequencing projects have been completed, including 10 for Mycoplasmas, or related organisms. Yet, it is obvious that the raw genomic information alone is not sufficient to fully characterize a complete genome. This is perhaps best illustrated by examining the recently sequenced Mycoplasma hyopneumoniae, where only 44% of the predicted proteins could be assigned functions, and 18% of the hypothetical proteins found are unique to this specific species.40 Thus, unique hypothetical and conserved hypothetical proteins, including those containing internal repeats, are likely to play an important role in a genome specialized to function on a minimal set of genes. Several examples of small proteins containing internal repeats from Mycoplasma species are shown in Figure 6. All of these proteins contain extensive internal repeats, similar to the tandem repeats reported for various lipoproteins discussed above. All of the proteins presented in Figure 6 fit the requirements of a modular design using either 7 or 14 residue modules in various degrees to arrive at the final primary sequence. The specific cellular location or functions of these hypothetical proteins is difficult to ascertain with any confidence using standard signaling motifs, as many species of Mycoplasma contain abbreviated membrane protein secretory systems,40 and may contain additional, as yet uncharacterized systems of protein trafficking. Further characterization of the internal repeat proteins described here will likely require biochemical techniques. However, in view of the needs of an organism which must evade or fight off an immunologic attack from the host in which it resides, the presence of such repeats is an intriguing feature. Dilemma of Proteins with Unknown Function. A significant number of the proteins uncovered during this work fit within the category of protein singletons (proteins with no other homologous protein in the current databases), and are often
Small Modular Proteins
research articles been characterized as a hypothetical protein prior to its isolation and characterization.5 It is likely that many of these hypothetical proteins described here are also actual functional proteins with a unique role in the life of the organism from which they originate. Further analysis of the proteins found here based on this modular design, and focusing on the highly conserved residues of various repeats could assist in assigning probable functions for individual proteins. This is well illustrated with the hypothetical protein YhjQ (NP_388941.1) from Bacillus subtilis, which contains a repeat with conserved cysteine or methionine residues at positions one and five (not shown). Many examples of proteins containing this repeating set of sulfur residues were found, and align together with YhjQ, including a proposed ferredoxin from Clostridium acetobutylicum (NP_346720.1) (Figure 7H). Though the level of conservation in many of these proteins was fairly low for residues besides these cysteines, the multiple alignment showed many similarities, indicating divergence from a common ancestor, while also providing a possible function for proteins such as YhjQ. The investigation of proteins from the RCSB protein data bank24 found an example of cysteines spaced in a similar manner in a modular protein from the trypsin inhibitor from barley (RCSB protein data bank file 1C2A41), though these cysteines are not involved in the binding of any small molecules, but instead are part of an extensive set of disulfide bridges.
Figure 6. Proteins from Mycoplasma with modular features. Shown above are small proteins dominated by a simple tandem repeated sequence. The proteins presented are a hypothetical protein from Mycoplasma pneumoniae (NP_109825.1) (A), a hypothetical protein from Mycoplasma pneumoniae (NP_109788.1) (B), a conserved hypothetical protein from Mycoplasma penetrans (NP_758139.1) (C), a hypothetical protein from Mycoplasma pneumoniae (NP_109826.1) (D), a hypothetical protein from Mycoplasma pneumoniae (NP_110212.1) (E), and a conserved hypothetical protein from Mycoplasma pneumoniae (NP_110153.1) (F). Additional sequence regions not shown are depicted as (..) for clarity. The seven-residue sequence(s) used as the parent sequence which other segments were scored against are shown in red, as are all absolutely conserved residues. All were scored using the BLOSUM45 matrix.
labeled “unique hypothetical” or simply “hypothetical” versus “conserved hypothetical” proteins, where homologues have been found in other species. While these findings are always susceptible to change, especially in light of the many genome sequence projects underway, it is likely that many of the proteins evolved within specific species, indicating that the mechanism of modular protein design based on repeats is utilized throughout the microbial world, and can arrive at unique and functional proteins within individual species. Many of the proteins (slightly more than half of those included in the database using the BLOSUM45 matrix for similarity determinations) uncovered during this analysis are hypothetical or conserved hypothetical proteins, illustrating how little is known about the functions of proteins containing modular repeats. Like many of the proteins highlighted in this work, the model protein used to develop the algorithm had
Known Proteins with Modular Features. The broad variety of proteins with primary sequences found to contain this modular design of small component pieces (available within the database) consists predominantly of hypothetical or conserved hypothetical proteins. Though functions are not assigned to the majority of these proteins, other proteins with known roles in the cell were also uncovered. In addition to proteins from the microbial species selected in this work, the search was also performed on nonredundant sequences from the RCSB protein data bank.24 It should be clarified that this search includes eukaryotes as well as prokaryotes, and that the search parameters for this database were broadened to include proteins of a slightly larger size (300 residues). The analysis of proteins with known functions provides information on the variety of roles for proteins containing modular repeats. In addition, for proteins where a structure is available, these repeats can provide an indication of whether the repeats served a structural or functional role in the individual proteins. Table 1 provides some examples of some proteins with known function that were identified as containing small repeats using this algorithm, including some proteins for which a structure has been solved. A review of these proteins, highlighting the repeats, illustrates that these repeats can serve a role both in binding, as found for the choline binding proteins,42 or serve a structural role, such as in transmembrane regions, as is found in reaction centers.43-45 There were no specific distinctions between secondary structural elements of the many structures surveyed as part of this analysis, with repeats found in both R-helices and β-sheets. A more thorough discussion of all the proteins found is beyond the scope of this paper. Versatility and Implications of Modular Proteins and Small Repeat Units. There are a number of classifications of repeat containing proteins,2,8,9,17,46 and new families of repeats are being discovered routinely.47 In many cases, these internal repeats are large, and often referred to as domain repeats, though smaller tandem repeats of four or five residues have also been described.47 Repeat proteins have received greater Journal of Proteome Research • Vol. 5, No. 3, 2006 479
research articles
Barney
Table 1. Examples of Proteins Identified with Known Functions role
Proteins Binding Small Molecules
Structural Proteins
Nucleotide Associated Proteins
Proteins Associated with Environmental Responses Catabolism and Pathogenicity Redox Proteins
BLOSUM45 similarities26
Secreted Calcium Binding Protein Internal Calcium Binding Protein Divalent Heavy-Metal Cation Transporter Chromate Transport Protein Zinc Uptake Protein Glycerol Uptake Facilitator Protein Choline Binding Protein Ferric Siderophore Transport System Periplasmic Binding Protein Various DNA Binding Proteins Aquaporin Permeases Photosynthetic Reaction Centers Lipoproteins Inner and Outer Spore Coat Proteins Histones Ribosomal Proteins Subunits of ATP Synthase RNA Polymerase Subunits and Factors ATPase DNA Topoisomerase Antifreeze Proteins Morphogenetic Protein General Stress Proteins Proteases Hemolysins Xylanase Urease Accessory Protein Cytochromes Formate-Dependent Nitrite Reductase NADH-Ubiquinone Oxidoreductase Thiol:Disulfide Interchange Proteins
attention in recent years,3,4,8,10 particularly in the role of small molecule binding.9 The search criteria described and utilized here are direct, based on several key parameters that define this classification in regards to the modular nature of simple repeating sequence proteins, and allows one to search through an entire genome for proteins satisfying these requirements. The algorithm uses the BLOSUM45 or BLOSUM62 scoring matrixes,26 which were selected based on the ability to identify the repeats of the model protein. While this yielded the repeats in the desired manner, future efforts may be aimed at applying or developing other matrixes, which would be more specific for small modular repeat searches, or more specific sequences (i.e., repeats containing tyrosine). The program is easily amenable to make such an analysis, by allowing the user to bias the matrix for specific residues. One resulting bias of the algorithm developed in this work is that in order for the score requirement to be met, the search must typically find several regions of high similarity (usually at least 4-5), versus algorithms which might overlook this level of similarity for smaller repeating modules in favor of larger domain repeats. A survey of proteins identified using this algorithm, when analyzed using several established algorithms for analyzing repeat containing proteins,3,10,12,13 revealed some of the differences and benefits of applying multiple algorithms for the analysis of these repeat containing proteins. The primary benefit of this algorithm was in identifying the smallest basic repeat unit, whereas other algorithms often combined these smaller units into several larger single repeat units (14 versus the 7 used here). The ability of proteins to reach function in a rapid manner, even over evolutionary time scales, is a daunting task. Evidence in support of a modular mechanism in protein evolution versus simple sets of highly conserved tandem repeats (Figure 5) is 480
RCSB Protein Data Bank24
Journal of Proteome Research • Vol. 5, No. 3, 2006
Choline Binding Protein42 (RCSB pdb file 1HCX)
Photosynthetic Reaction Centers43-45 (RCSB pdb file 1YST, 1DXR and 1EYS)
Antifreeze Protein48 (RCSB pdb file 1L1I)
Nine-Heme Cytochrome49 (RCSB pdb file 19HC)
provided by the existence of proteins with interchanged sets of modules of different sizes. In Figure 7, a number of proteins with various degrees and combinations of modular segments are presented. On the basis of the vast variety of proteins that could be assembled from a sequence of 100 residues with a possibility of 20 (or more) different residues at each position, it seems logical that nature would have chosen a pathway where sequences build and expand based on templates or segments which function well for one reason or another. The small, highly conserved fragments might then be reciprocated throughout the protein sequence as it evolves, reaching functionality far faster than could be achieved by simple trial and error approaches, or single residue modifications. The concept of modular protein design discussed here is certainly not novel.7,10,15-17 It has been stated previously that the existence of repeating segments of residues should not be surprising.4 Though this duplication of sequences within a protein is proposed as a likely route to achieving rapid protein evolution, the extent to which this occurs in nature is difficult to assess. A goal of this work is to devise techniques to search for, and then determine the significance of such modular features in proteins. Once identified, such modular design features could serve as templates for the design of new de novo proteins, or possibly reveal more fundamental features of the early events in evolution. Many of the proteins shown in Figure 7 reveal patterns by which modules might be combined. As the databases containing such examples expand, further lessons in how modular proteins are assembled might be revealed. The algorithm and databases described here highlight specific proteins utilizing this design structure, and could have further utility in the search for functions in unknown or hypothetical proteins, as well as in the development and design of techniques for directed evolution studies.
Small Modular Proteins
research articles
Figure 7. Modular Proteins from Various Organisms. Shown above are small proteins with a modular architecture. The proteins presented are a hypothetical protein from Clostridium tetani (NP_782132.1) (A), a hypothetical protein from Archaeoglobus fulgidus (NP_068842.1) (B), a hypothetical protein from Deinococcus radiodurans (NP_293831.1) (C), a hypothetical protein from Photorhabdus luminescens (NP_929061.1) (D), a hypothetical protein from Synechocystis (NP_440405.1) (E), a hypothetical protein from Yersinia pestis (NP_670964.1) (F), a hypothetical protein from Chromobacterium violaceum (NP_900902.1) (G), ferredoxin from Clostridium acetobutylicum (NP_346720.1) (H), a hypothetical protein from Xylella fastidiosa (NP_297971.1) (I), and a hypothetical protein from Bacillus licheniformis (YP_091925.1) (J). Additional sequence regions not shown are depicted as (..) for clarity. The seven-residue sequence(s) used as the parent sequence which other segments were scored against are shown in red, as are all absolutely conserved residues. All were scored using the BLOSUM45 matrix.
It is clear that the number of repeats or frequency of repeats that are found within a set of proteins or a full genome is highly biased by the technique used to define the repeat. A variety of algorithms are already available for fast analysis of protein sequences to find regions of repeating sequences.3,10,12,13 In this work, the extent of small repeating segments based on a minimal size of seven residues has been analyzed in a variety of eubacterial and archaeal species. Seven residue repeats have been highlighted as a feature of coiled-coils,17 and also for their role in the transmembrane regions from a number of viral proteins.22,23 The seven residue similarity comparison used here demonstrates some utility for larger repeats as well, though obvious improvements could be made. Future work will be directed at differentiating between other sizes, between 5 and 10 residue segments, to improve the scoring for these varied sizes of
repeats, and increase the confidence in the reported results, while also broadening the database of proteins further. The maximum size of proteins analyzed might also be expanded, and more rigorous testing of significance implemented. Finally, new matrices designed around improving the confidence in the significance of the repeat might also be developed. The database of proteins obtained from this work provides a valuable first step in such efforts.
Acknowledgment. I wish to thank Robert Y. Igarashi and Wilson A. Francisco for reviews of this manuscript, Wilson A. Francisco for support in the preceding project that directed me to this work, Michael C. Minnotte for helpful discussions regarding statistical tests of the data generated by this algorithm, and Nita Deshpande for assistance obtaining sequences from the RCSB protein data bank. Journal of Proteome Research • Vol. 5, No. 3, 2006 481
research articles Supporting Information Available: Specific genome sequences analyzed in this work. Databases of sequences meeting the internal repeat requirements based on the BLOSUM45 or BLOSUM62 scoring matrices, as obtained from this analysis. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) Li, Y.-C.; Korol, A. B.; Fahima, T.; Beiles, A.; Nevo, E. Mol. Ecol. 2002, 11, 2453-2465. (2) Soding, J.; Lupas, A. N. Bioessays 2003, 25, 837-846. (3) Pellegrini, M.; Marcotte, E. M.; Yeates, T. O. Proteins: Struct., Funct., Genet. 1999, 35, 440-446. (4) Wootton, J. C. Curr. Opin. Struct. Biol. 1994, 4, 413-421. (5) Barney, B. M.; LoBrutto, R.; Francisco, W. A. Biochemistry 2004, 43, 11206-13. (6) Heringa, J. Curr. Opin. Struct. Biol. 1998, 8, 338-345. (7) McLachlan, A. D. J. Mol. Biol. 1972, 64, 417-437. (8) Marcotte, E. M.; Pellegrini, M.; Yeates, T. O.; Eisenberg, D. J. Mol. Biol. 1999, 293, 151-160. (9) Andrade, M. A.; Perez-Iratxeta, C.; Ponting, C. P. J. Struct. Biol. 2001, 134, 117-131. (10) Szklarczyk, R.; Heringa, J. Bioinformatics 2004, 20 Suppl 1, I311I317. (11) Marcotte, E. M.; Pellegrini, M.; Thompson, M. J.; Yeates, T. O.; Eisenberg, D. Nature 1999, 402, 83-86. (12) Kreil, D. P.; Ouzounis, C. A. Bioinformatics 2003, 19, 1672-1681. (13) Heger, A.; Holm, L. Proteins: Struct., Funct., Genet. 2000, 41, 224237. (14) Altschul, S. F.; Madden, T. L.; Schaffer, A. A.; Zhang, J. H.; Zhang, Z.; Miller, W.; Lipman, D. J. Nucleic Acids Res. 1997, 25, 33893402. (15) Tripp, K. W.; Barrick, D. J. Mol. Biol. 2004, 344, 169-178. (16) Stumpp, M. T.; Forrer, P.; Binz, H. K.; Pluckthun, A. J. Mol. Biol. 2003, 332, 471-487. (17) Tompa, P. Bioessays 2003, 25, 847-855. (18) Marcotte, E. M.; Eisenberg, D. Biochemistry 1999, 38, 667-676. (19) Garnett, A. P.; Viles, J. H. J. Biol. Chem. 2003, 278, 6795-6802. (20) Groves, M. R.; Barford, D. Curr. Opin. Struct. Biol. 1999, 9, 383389. (21) Lupas, A. N.; Ponting, C. P.; Russell, R. B. J. Struct. Biol. 2001, 134, 191-203. (22) Kingsley, D. H.; Behbahani, A.; Rashtian, A.; Blissard, G. W.; Zimmerberg, J. Mol. Biol. Cell 1999, 10, 4191-200. (23) Netter, R. C.; Amberg, S. M.; Balliet, J. W.; Biscone, M. J.; Vermeulen, A.; Earp, L. J.; White, J. M.; Bates, P. J. Virol. 2004, 78, 13430-13439. (24) Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov, I. N.; Bourne, P. E. Nucleic Acids Res. 2000, 28, 235-242.
482
Journal of Proteome Research • Vol. 5, No. 3, 2006
Barney (25) Corpet, F. Nucleic Acids Res. 1988, 16, 10881-10890. (26) Henikoff, S.; Henikoff, J. G. Proc. Nat’l Acad. Sci., U.S.A. 1992, 89, 10915-10919. (27) Luthy, R.; Xenarios, I.; Bucher, P. Protein Sci. 1994, 3, 139-146. (28) Altschul, S. F.; Erickson, B. W. Mol. Biol. Evol. 1985, 2, 526-538. (29) Lipman, D. J.; Wilbur, W. J.; Smith, T. F.; Waterman, M. S. Nucleic Acids Res. 1984, 12, 215-226. (30) Fitch, W. M. J. Mol. Biol. 1983, 163, 171-176. (31) Hutchison, C. A., III; Montague, M. G. Mycoplasmas and the Minimal Genome Concept. In Molecular Biology and Pathogenicity of Mycoplasmas; Razin, S., Herrmann, R., Eds.; Kluwer Academic/Plenum Publishers: New York, 2002. (32) Wilton, J. L.; Scarman, A. L.; Walker, M. J.; Djordjevic, S. P. Microbiology 1998, 144 (Pt 7), 1931-1943. (33) Collier, A. M.; Hu, P. C.; Clyde, W. A., Jr. Diagn. Microbiol. Infect. Dis. 1985, 3, 321-328. (34) Zielinski, G. C.; Ross, R. F. Am. J. Vet. Res. 1993, 54, 1262-1269. (35) Zielinski, G. C.; Ross, R. F. Am. J. Vet. Res. 1990, 51, 344-348. (36) Zhang, Q.; Young, T. F.; Ross, R. F. Infect. Immun. 1994, 62, 16161622. (37) Tu, A. H.; Clapper, B.; Schoeb, T. R.; Elgavish, A.; Zhang, J.; Liu, L.; Yu, H.; Dybvig, K. Infect. Immun. 2005, 73, 245-249. (38) Simmons, W. L.; Denison, A. M.; Dybvig, K. Infect. Immun. 2004, 72, 6846-6851. (39) Yogev, D.; Browning, G. F.; Wise, K. S. Genetic Mechanisms of Surface Variation. In Molecular Biology and Pathogenicity of Mycoplasmas; Razin, S., Herrmann, R., Eds.; Kluwer Academic/ Plenum Publishers: New York, 2002. (40) Minion, F. C.; Lefkowitz, E. J.; Madsen, M. L.; Cleary, B. J.; Swartzell, S. M.; Mahairas, G. G. J. Bacteriol. 2004, 186, 71237133. (41) Song, H. K.; Kim, Y. S.; Yang, J. K.; Moon, J.; Lee, J. Y.; Suh, S. W. J. Mol. Biol. 1999, 293, 1133-1144. (42) Fernandez-Tornero, C.; Lopez, R.; Garcia, E.; Gimenez-Gallego, G.; Romero, A. Nat. Struct. Biol. 2001, 8, 1020-1024. (43) Arnoux, B.; Gaucher, J. F.; Ducruix, A.; Reiss-Husson, F. Acta Crystallogr. Sect. D Biol. Crystallogr. 1995, 51, 368-379. (44) Lancaster, C. R.; Bibikova, M. V.; Sabatino, P.; Oesterhelt, D.; Michel, H. J. Biol. Chem. 2000, 275, 39364-39368. (45) Nogi, T.; Fathir, I.; Kobayashi, M.; Nozawa, T.; Miki, K. Proc. Nat’l Acad. Sci. U.S.A. 2000, 97, 13561-13566. (46) Katti, M. V.; Sami-Subbu, R.; Ranjekar, P. K.; Gupta, V. S. Protein Sci. 2000, 9, 1203-1209. (47) Adindla, S.; Inampudi, K. K.; Guruprasad, K.; Guruprasad, L. Comp. Funct. Genomics 2004, 5, 2-16. (48) Daley, M. E.; Spyracopoulos, L.; Jia, Z.; Davies, P. L.; Sykes, B. D. Biochemistry 2002, 41, 5515-5525. (49) Matias, P. M.; Coelho, R.; Pereira, I. A.; Coelho, A. V.; Thompson, A. W.; Sieker, L. C.; Gall, J. L.; Carrondo, M. A. Struct. Folding Design 1999, 7, 119-30.
PR050103M