Domain Graph of - American Chemical Society

not only the possible new function of both plant-specific and nonspecific domains via specific domain combinations in Arabidopsis thaliana but also th...
0 downloads 0 Views 751KB Size
Domain Graph of Arabidopsis Proteome by Comparative Analysis Song Liu,† Chi Zhang,† and Yaoqi Zhou* Howard Hughes Medical Institute Center for Single Molecule Biophysics, Department of Physiology & Biophysics, State University of New York at Buffalo, 124 Sherman Hall, Buffalo, New York 14214 Received October 28, 2004

The domain graph of domains and domain combinations of Arabidopsis thaliana is established based on pfam 14.0 database and analyzed via comparison with 10 eukaryotic, 30 bacterial, and 16 archaeal proteomes. The comparative analysis of the domain graphs provides a useful platform for revealing global insights on the evolution of plant kingdom. More importantly, it is a powerful tool for searching not only the possible new function of both plant-specific and nonspecific domains via specific domain combinations in Arabidopsis thaliana but also the functional role of unknown domains. As an example, we present the functional link between ubiquitin and Myb_DNA-binding domains via Bromodomain as the plant specific evidence for the association between transcription and ubiquitin. We further show that PentatricoPeptide Repeats (PPR) proteins have plant-specific links with a wide variety of domains responsible for RNA binding/metabolism, modulation of protein-protein interactions, ubiquitinconjugation, cell growth/maintenance, catalysis, and others. This further supports the recently proposed association of PPR proteins with specific RNA transcripts and defined effector proteins. Moreover, the domain graph built from tissue-specific genes is frequently associated with DNA binding domains, suggesting that the differentiation of tissue cell types is contributed mostly by tissue-specific transcriptional process. DOGMA (DOmain Graph via coMparitive analysis for Arabidopsis thaliana) is available on-line with a variety of search tools at http://theory.med.buffalo.edu/DOGMA. The database, which allows user-specified search for plant specific domains and their combinations, will be useful as an additional tool for annotation of the proteins that play specific roles in plants and other organisms. Keywords: domain graph • domain combination • Arabidopsis thaliana • pentatricopeptide repeats

Introduction Plants, one of the major eukaryotic kingdoms, have been a source in searching for the answers to many challenging environmental, food-, and health-related problems.1 This calls for a better understanding of plant biology, which, in turn, requires a detailed analysis of plant genes and their function. A widely adopted model system for plant gene identification and function annotation is the flowering weed Arabidopsis thaliana2,3,4 due to its small-size genome and rapid life cycle. The complete sequencing of Arabidopsis thaliana genome3 found that only 10% of its genes have been characterized experimentally. Efforts underway to better understand the rest of genes, with an ambitious goal to establish the functional annotation of most Arabidopsis genes by the year of 2010.5 In recent work, transcription factors and transcription process have been analyzed by genome-wide comparative analysis among eukaryotes6 and by massively parallel signature sequencing,7 respectively. These studies revealed the unexpected complexity and uniqueness of transcription regulation in Arabidopsis thaliana. Many plant-specific proteins have also been identified by phylogenetic profiling of the Arabidopsis thaliana proteome.8 * To whom correspondence should be addressed. Phone: (716) 829-2985. Fax: (716) 829-2344. E-mail: [email protected]. † These two authors contribute equally to this work. 10.1021/pr049805m CCC: $30.25

 2005 American Chemical Society

Proteins are made of structural and functional building blocks called domains. Domains are characterized either as spatially distinct and compact structure units or as protein regions with assigned biological functions, or as closely similar sequence homologues.9 Together, they can be viewed as evolutionary units, and how they interact with each other determines the function of proteins.10 One effective method to uncover the function of proteins on a genome scale is by analyzing the network graph of domaindomain (intraprotein and/or interprotein) interactions.11,12 For example, Wutchy showed that the interaction network has a scale-free and small-world topology for yeast and other sources.13,14 Teichmann and co-workers have performed a detailed study of the architectures and combinations of domains using a structure-based domain assignment.15,16 Ye and Godzik have pioneered a method called CADO (Comparative Analysis of Domain Organization) to perform comparative analysis of domain organization between proteomes.17 Structure-based or statistical learning methods for identifying domain-domain interactions have also been developed.18,19,20,21 Most of these studies, however, are focused on proteomes other than plant. Analysis of combinations between functional domains (i.e., functional link) could lead to testable hypothesis regarding a gene and/or gene family’s biological function. In a recent study, Goring and co-workers surveyed Arabidopsis proteins that contain armadillo (ARM) repeats.22 They discovered that a large Journal of Proteome Research 2005, 4, 435-444

435

Published on Web 02/05/2005

research articles proportion of ARM repeat proteins combine with U-box domain,23 a domain present in a family of E3 ligases. This leads to the hypothesis that ARM repeat proteins are involved in ubiquitin. The hypothesis was subsequently confirmed by a serial of experiments performed by themselves. The success in identifying the function of ARM repeat proteins raises the interest for a more comprehensive analysis of domains and their combinations in Arabidopsis thaliana. In this study, we establish the domain graph of Arabidopsis thaliana and make a comparative genomics analysis against 10 eukaryotic, 30 bacterial, and 16 archaeal proteomes. The graphs of Arabidopsis-specific and/or unknown domains and tissue-specific genes are analyzed in detail. These analyses quantify the global view for the evolution of Arabidopsis domains and domain combinations, and suggest testable predictions for several functional links that are specific to Arabidopsis. To stimulate and facilitate further study of Arabidopsis’s domain graph, we also created a database called DOGMA (DOmain Graph from coMparative analysis for Arabidopsis) with associated tools, which could be explored further either on-line or by local usage.

Theory, Method, and Material Proteome-Data Source. We used 10 eukaryotic, 30 bacterial, and 16 archaeal proteomes in addition to Arabidopsis thaliana in this work. They are from the Arabidopsis Information Resource (TAIR)24 for Arabidopsis thaliana, Flybase25 for Drosophila melanogaster, WormBase26 for Caenorhabditis elegans, Saccharomyces Genome Database (SGD)27 for Saccharomyces cerevisiae, Broad Institute28 for Neurospopora crassa, CandidaDB (http://genolizt.pasteur.fr/CandidaDB/) for Candida albicans, and DOE Joint Genome Institute (http://genome.jgipsf.org/) for Fugu rubripes and Ciona intestinalis. All other proteomes were obtained from NCBI database. The complete list of proteomes used here is an enlarged collection of that used by Ye and Godzik17 in that we added four additional eukaryotic proteomes (Mus musculus, Candida albicans, Schizosaccharomyces pombe, and Neurospopora crass). For a complete list of proteomes studied here, please visit the DOGMA website (http://theory.med.buffalo.edu/DOGMA). Protein-Domain Assignment. We use the pfam14.0 database which contains 7459 domain types.29 The domains in Arabidopsis thaliana proteome (and other proteomes) were assigned by using HMMER2.3.230 to locate the Arabidopsis proteins that are homologous to the profiles of pfam families. We only keep the hits with E-value less than 0.05 and score higher than the pfam curated cutoffs to ensure the quality of assignment.17 For domains that overlap, the less reliable one (higher E-value) was removed. Domain-Graph Construction and Analysis. A domain graph is formally defined as an undirected graph consisting of all domains found within a given proteome set.13,17 In such a graph, each vertex (or node) represents a distinct domain, and two vertexes are linked by an edge if they occur together in at least one protein. The degree k of a vertex (domain) is the number of edges linking it to other domains, i.e., the number of its adjacent domains. The clustering coefficient, Cv, of vertex v, is defined as Cv ) 2ev/k(k - 1), where ev is the total number of edges for k neighboring domains of v. Protein domain graph has the scale free behavior if the distribution of degrees, P(k), decays in a power-law [P(k) ≈ k-γ].13 The shortest path between two vertexes (or domains) is the path that has the smallest number of edges that traverse between the two vertexes. The 436

Journal of Proteome Research • Vol. 4, No. 2, 2005

Liu et al.

diameter of a graph is the maximal length of the shortest paths for all pairs of domains in the graph. Dijkstra’s algorithm and Floyd’s algorithm31 are used to identify the shortest path between a source vertex to all other vertexes and the shortest path between all pair of vertexes, respectively. Two domains are defined as disconnected if the shortest path between them is infinite, directly connected if the shortest path equals to one, or indirectly connected if the shortest path is greater than one.17 In general, a domain graph is made of unconnected components. Each component is a maximally connected subgraph of the domain graph (obtained with depth-first search algorithm). We use the CADO (Comparative Analysis of Domain Organization) methodology17 to compare domain graphs given by different proteomes (coded by us in C++). More specifically, the method is used to identify common domain combinations (occurred in all studied proteome sets) and specific domain combinations (occurred in a specific proteome set but were absent in all other sets). Extraction of Tissue Specific Genes. Tissue specific genes were extracted from the 17-base signature data (Version 09/ 13/2004) resulted from Arabidopsis thaliana MPSS project (downloaded from Arabidopsis MPSS database.32) The extraction method is that due to Meyers et al.7,33 Briefly, the expressed abundance value of a gene is the sum of the normalized abundances for that gene in the class 1/2/5/7 signatures (i.e., sense strand signatures). We consider five tissues of Arabidopsis thaliana: callus (CAF), inflorescence (INF), leaves (LEF), root (ROF), and silique (SIF). A gene is tissue-specific if its expressed abundance value in one tissue is 100-fold higher than that in any of the other four. A gene whose expressed abundance value is less than 100 TPM (Transcript per Million Value) in a given tissue is tissue-specific only if the abundance values of it in other four tissues are zero TPM. We obtained two sets of tissuespecific genes. The first set is obtained by using all signatures. This set contains 751, 717, 523, 802, and 509 distinct genes in CAF, INF, LEF, ROF, and SIF tissue, respectively. The second set is based on significant and reliable signatures only. Moreover, the tissue-specific gene must have expressed abundances that are greater than 3 TPM. The second set contains 347, 401, 201, 477, and 276 distinct genes in the five respective tissues described above.

Results Global Properties of Domain Graph. The Arabidopsis domain graph consists of 2454 distinct domain types (vertexes) and 1277 domain combinations (edges). The graph has 10 components that have six or more vertexes. The largest component is made of 807 vertexes connected by 457 distinct edges, with a diameter of 13 edges. The top 15 most connected domains have 17 to 38 edges (Table 1). Most of them involve in DNA/RNA binding activity.3,6 Figure 1 displays the distributions of degrees in the domain graphs of yeast, Arabidopsis, and human. The distributions of degrees for all three domain graphs follow a power-law behavior. Thus, the domain graphs for yeast, Arabidopsis, and human are scale free.13,17 Moreover, it is clear that the complexity of domain graphs (as described by degrees) increase from yeast to human. In fact, we found that the complexity of Arabidopsis’s domain graph, defined as the number of degree, edge, and size of the largest component, is higher than archaea, bacteria, and fungi but lower than that of animal as expected (data not shown). Evolution of Domain and Domain Combinations. Figure 2 shows the domains and domain combinations (edges) shared

research articles

DOGMA: Comparative Analysis of Domain Graph

Table 1. Specific Protein Domains with Top 15 Highest Degrees in the Whole Domain Graph (left) and in the Signature Domain Graph (right) of Arabidopsis thaliana Genome whole domain graph

signature domain graph

Pfam ID

degree

% percenta

Pfam ID

degree

% percenta

Pkinase Helicase•C PHD zf-C3HC4 WD40 F-box RRM•1 PPR LRR DEAD AT•hook zf-CCCH zf-C2H2 UBA Ank

38 33 28 28 23 22 22 21 19 19 18 18 17 17 17

71.1 15.2 39.3 35.7 26.1 68.2 36.4 100.0 63.2 15.8 66.7 61.1 41.2 29.4 23.5

Pkinase PPR F-box AT•hook LRR zf-CCCH PHD zf-C3HC4 Myb•DNA-binding NB-ARCb RRM•1 Homeobox Arm zf-C2H2 WD40

27 21 15 12 12 11 11 10 9 8 8 7 7 7 6

71.1 100.0 68.2 66.7 63.2 61.1 39.3 35.7 69.2 100.0 36.4 100.0 87.5 41.2 26.1

a The percentage of domain combination specific for Arabidopsis thaliana. b The current version of Pfam(14.0) treats NB-ARC and NACHT as separate domains. However, according to Inohara,62 they might not be separated using computational methods.

Figure 1. Distribution of degrees in the domain graph of Yeast, Arabidopsis, and Human as labeled. The near linear curve of the log-log plot indicates a power-law distribution.

Figure 2. Percentage of the domains (left black bars) and domain combinations (right black bars) that is shared (black) and specific (white) between Arabidopsis thaliana and the kingdoms as labeled. (Arch - Archaeal kingdom, Bact - Bacterial kingdom, Fungi - Fungus kingdom, Animal - Animal kingdom, Euka - Eukaryotic kingdom, All - all genomes except Arabidopsis thaliana). The calculated percentage is based on Arabidopsis thaliana.

between Arabidopsis and other organisms. The fraction of shared domains and domain combinations increase as the biological complexity increases from archaea, bacteria, fungi, to animal. It is of interest to note that the fraction of shared domains changes only 4.7% from Fungi to Animal whereas the corresponding

Figure 3. Percentage of all domain combinations (left bars) and Arabidopsis-specific domain combinations (right bars) that are between shared domains of Arabidopsis thaliana and the kingdom (black), between shared domains and Arabidopsis-specific domains (stripped bar), and between Arabidopsis-specific domains (white), as labeled. The calculated percentage is based on Arabidopsis thaliana.

change in domain combinations (11.3%) is significantly greater. This probably reflects the fact that the total number of types of domain combinations increases faster than that of the total number of domain types from fungi to plant/animal. There are 10.1% Arabidopsis domains and 30.3% domain combinations that are not found in all other proteomes studied here. The fractions of domain combinations between shared domains, between shared domains and Arabidopsis-specific domains, and between Arabidopsis-specific domains for a given compared organism type are shown in Figure 3. The fraction of all domain combinations in Arabidopsis that are between Arabidopsis-specific domains decreases drastically from 31.9% when compared to archaea to 5.5% when compared to animal. The fraction of domain combinations specific in Arabidopsis that are between Arabidopsis-specific domains also decreases drastically from 41.3% (archaea) to 13% (animal). In contrast, the fraction of domain combinations (either all combinations or Arabidopsis-specific combinations) involving the domains shared by Arabidopsis and fungi and the domains shared by Arabidopsis and animal are both high and in similar range. Thus, the overwhelming majority of the Arabidopsis domain Journal of Proteome Research • Vol. 4, No. 2, 2005 437

research articles

Liu et al.

Figure 4. Largest component of Arabidopsis thaliana “signature” domain graph. Only the domain names for the 15 highest degree domains are shown. The ubiquitin and Bromodomain are also shown in dashed frame for clarity (see text for details). The graph was drawn with dot program in graphviz package.

combinations (even for those Arabidopsis specific) are a result of recombination of domains already present in any kingdoms of eukaryote. Quantitatively, only 6.2% in the 387 domain combinations specific to Arabidopsis are from the combination between the domains that are not present in other proteomes studied here. This presents further evidence that the recombination between common families (domains), rather than invention of new families and recombination of them, plays a major role in kingdom and/or species specific function.16 Hub domain (or Multidomain connector) are defined as protein domain with high degree of connection to other protein domains.10,34 Table 1 (left) shows the 15 protein domains of Arabidopsis that have the top highest degrees. While most of these high-degree domains also exist in other proteomes, many of their connections to other domains (domain combinations) are specific to Arabidopsis. The percentage of Arabidopsisspecific domain combinations involving the top 15 high-degree domains ranges from 15.2% for Helicase_C to 100% for pentatricopeptide repeat (PPR). This suggests that it is the connections which are re-tooled for specific functional need in plant. In other words, the dominant mechanism of plant protein repertoire evolution is the duplication and recombination of limited domain families available.35,10 Because domain combinations are less “conserved” than domains themselves, 438

Journal of Proteome Research • Vol. 4, No. 2, 2005

we will focus on the 387 domain combinations specific to Arabidopsis next. Arabidopsis “Signature” Graph. The 387 Arabidopsis-specific domain combinations (absent in all other proteomes studied here) involve 363 domain types. They can be called as “signature” domain organization17 of Arabidopsis. The giant component (the largest subdomain graph unconnected to other parts of the domain graph) of this “signature” graph contains 305 domain combinations and 230 domains, as shown in Figure 4. The domains with the highest number of domain combinations in “signature” domain organization (hub domain with high “signature percentage”) are shown in Table 1 (right). Most of these domains are related to DNA/RNA binding activity and involved in process like signal transduction and transcriptional regulation. That is, plant has developed a specific transcriptional regulation system to inhabit a complex and diverse environment.3,6 For example, the combinations involved with Homeobox, which is a well-conserved domain with activity of transcription factor in eukaryotes,36 are 100% (7/7) Arabidopsis-specific. The “signature” domain organization has many interesting domain combinations deserving further studies. For example, there is a plant-specific connection between ubiquitin system and transcriptional regulation via Bromodomain. The detail is displayed in Figure 5. Ubiquitin system is responsible for the

research articles

DOGMA: Comparative Analysis of Domain Graph

Figure 5. Domain combinations involved in Myb•DNA-binding - Bromodomain-ubiquitin in the largest component of Arabidopsis thaliana “signature” domain graph. Only domains with direct connection to the above three domains in “signature” graph are shown.

selective and regulated degradation of diverse proteins in cell.37 Myb-like DNA-binding domain, a member of the SANT domain family that specifically recognizes the motif YAAC(G/T)G, plays a role in transcriptional regulation.38,39 Bromodomain, a 110 amino acid long domain, is found in many chromatin associated proteins.40 While the precise function of Bromodomain remains unclear, it may play a role in assembly or activity of multicomponent complexes involved in transcriptional activation.41 This signature graph not only provides a plant specific evidence for the connection between ubiquitin system and transcriptional regulation, but also suggests a specific new role played by Bromodomain in plant (more in discussion). In another example, the signature graph presents a new way to analyze the functional role played by Arabidopsis PPR (PentatricoPeptide Repeat) proteins. The 21 domain combinations in which PPR are involved are all Arabidopsis-specific (Table 1). A “tabulated” domain graph for PPR and its immediate neighbors is shown in Table 2. On the basis of the graph, the PPR domain combines with a serial of RNA binding domains, which include Helicases Helicase_C and DEAD involved in RNA metabolism process,42,43 Zinc finger zf-C3HC4 in nucleic binding,17 and RNA recognition motif RRM•1, which is the diagnostic domain of an RNA binding protein.44 The PPR domain also couples with Cu•bind•like domain (plant chloroplastic plastocyanins45), LRR domain (involved in modulating protein-protein interactions.46) and many catalytic domains such as UDP,47 Gb3•synth,48 and CBS49. These domain combinations confirm the established association (directly or indirectly) of PPR domain with specific RNA sequences and with other effector proteins in plant organelle from a genome-

Table 2. Function Classification of the Domains Directly Linked to PPR functional classes RNAa organelleb PPIc catalyticd ubiquitine othersf

Pfam ID

RRM_1 DEAD Helicases_C zf-C3HC4 Cu_bind_like LRR CRAL•TRIO UDPGT GB3•synth CBS P450 Pkinase PGAM Gly•transf•sug Ppx-GppA ubiquitin F-box Exostosin BRCT LAGLIDADG DUF247

+ + + +

+ + + + + + + + + + + + + + + + + +

a RNA binding and/or RNA metabolism. b Organelle specific process. Modulate protein-protein interactions. d Involved in catalytic process. e Ubiquitin-conjugation system. f BRCT and LAGLIDADG are DNA binding related and Exostosin is involved in cell growth and/or maintenance; the function of DUF247 is unknown. c

wide survey.50 The combinations of PPR domain with ubiquitin and F-box, however, suggest that it is possible additional role Journal of Proteome Research • Vol. 4, No. 2, 2005 439

research articles

Liu et al.

Figure 6. Largest component of Arabidopsis thaliana domain graph involving “unknown” domains. Only nearest neighbors to “unknown” domains and their connections are shown for clarity.

in ubiquitin-conjugating systems.51 Other combinations of PPR involve cell growth/maintenance domain like Exostosin.52 Annotation of Arabidopsis “Unknown” Domains Using Domain Graph. While the domain graph was demonstrated to reveal the plant-specific new function of known domains, it may also be useful for discovering the functional role of unknown domains.17 We found that there are 2202 Arabidopsis genes containing at least one “unknown” domain [domain with unknown function (DUF) or uncharacterized protein domain (UPF)]. Most of these genes belong to gene families with unknown function. The unknown domains consist of 319 different types of pfam “unknown” domains, 46 of which have at least one nearest neighbors (i.e., functional link) in the domain graph of Arabidopsis. The functional links provide the hint about the possible function of the domains and their corresponding genes. Figure 6 shows the largest component of Arabidopsis domain graph that involves unknown domains. In this figure, domain DUF283 has an exclusive connection to DNA/RNA binding domains. Thus, the domain is most likely 440

Journal of Proteome Research • Vol. 4, No. 2, 2005

involved in the transcriptional process. One can also detect the functional role of other Arabidopsis unknown domains via the domain combinations existed in other species. Indeed, there are 31 additional unknown domain types of Arabidopsis that have at least one functional link in other proteomes studied here (data not shown but available on DOGMA online). Domain and Domain Graph of Tissue-Specific Genes. The difference in different tissues can be analyzed based on the difference in the pfam domains and domain combinations occurred in different tissues (both tissue-specific and not specific genes). We found that there is less than 1% unique domain and/or combinations for each tissue by surveying expressed genes (from MPSS project,32 data not shown). Thus, there is no significant difference in transcriptional complexity in different plant tissues as already suggested by Meyers et al., who determined the overlap in transcript abundance among these tissue libraries.7 Tissue specific genes, although few, do exist. They were extracted following the procedure described in the Methods

research articles

DOGMA: Comparative Analysis of Domain Graph

Figure 7. Percentage of the domains (left bars) and domain combinations (right bars) of tissue-specific genes that are shared by more than one tissues (black) and specific for the corresponding tissue (white). CAF - callus, INF - inflorescence, LEF - leaves, ROF roots, SIF - silique. The percentages are calculated for each tissue. Two graphs are based on different criteria for tissue-specific genes (the full set on the left and the filtered set on the right. See methods for details). Table 3. Protein Domains that Are Shared by All of the Five Tissue-Specific Gene Sets of Arabidopsis thalinaa Pfam ID

no. genesb

functional annotationc

Pkinase LRR F-box p450 Myb_DNA-binding PPR HLH zf-C3HC4 AP2 Lipase_GDSL zf-C2H2 efhand MATH

73 54 34 31 28 22 20 19 15 13 12 11 10

signal transduction modulate protein-protein interaction ubiquitin-conjugating process oxidative degradation process DNA binding organelle biogenesis DNA binding DNA and/or RNA-binding DNA binding lipolytic enzymes DNA and/or RNA binding signal transduction; buffering/transport tissue-specific metalloendopeptidases; receptor interaction cell wall formation RNA binding carbohydrate metabolism sugar binding DNA binding

XET_C RRM_1 Glyco_hydro_16 B_lectin SRF-TF c

9 9 9 7 6

a The more reliable filtered sets are used (see method section for details). b The total number of tissue-specific genes with these “enriched” domains. Functional annotations were extracted from pfam database.

section. Figure 7 shows the domain and domain combinations for a given tissue that are shared with one or more tissues. Depending on the criteria used for defining tissue-specific genes, there are 64-73% (or 54-66%) domains and 45-58% (34-52%) domain combinations that are shared by more than one tissues. Thus, tissue-specificity resulted not only from the specific domain and their combinations occurred in a specific tissue, but also from those shared domains and their combinations. To better understand the role played by domains shared by all five tissue-specific gene sets, the shared domains are listed in Table 3. A major portion of these domains are transcription factors (6/18). An interesting example is SRF-TF domain (i.e., MADS box). Various experiments have shown that human serum response factor (SRF) is a ubiquitous nuclear protein important for cell proliferation and differentiation.53 The same transcriptional factor was also found to be predominantly involved in plant development process.54 Recent genome-scale analysis suggested that this transcriptional factor family can be further divided into five subgroups, most of which (84%) are of unknown function.55 Here, the tissue-specific expression of different members (genes) of the same transcriptional factor family (also observed by a recent large scale ORFeome cloning and analysis of Arabidopsis transcription factors56) indicated

that members of plant SRFT family might play different roles in proliferation and differentiation of the different tissue cell types that they specifically nest. Another possibility is that different cell types have employed different members of the same transcriptional factor family for proliferation and differentiation. The types of common domains shown in Table 3 are consistent with genetics experiments and mammalian analysis that DNA-binding transcription factors play a key role in specifying different cell types.57,58 Other common functional “tools” used by plant tissue-specific genes are domains for signal transduction, protein-protein interaction, cell-wall formation, protein degradation, and others. The “signature” domain graph of each tissue-specific gene set is shown in Figure 8. For each tissue, both the number of tissue-specific domain combinations and the degree of largest components are small due to the small size of tissue-specific genes. This suggests that tissue specificity results mostly from the interplay between tissue-specific genes and common expressed genes. Some “signature” combinations are related to their special role in the specific tissue expressed. For example, the tissue-specific link between Chal_sti_synt_N and Chal_sti_synt_C in flower tissue is consistent with its role for the bio-synthesis of anthocyanin pigments in flowering plants.59 Journal of Proteome Research • Vol. 4, No. 2, 2005 441

research articles

Liu et al.

Figure 8. “Signature” domain graph for each of the tissue-specific gene sets (filtered set) of Arabidopsis thaliana. Only connected domains and their connections are shown.

However, the major components of “signature” domain combinations are related to general DNA/RNA binding process, indicating again the essential role of transcriptional control in differentiation of the specific tissues. 442

Journal of Proteome Research • Vol. 4, No. 2, 2005

Discussion Functional domains and domain combinations of Arabidopsis are obtained and compared to those of other proteomes. Our analysis found that 10.1% and 30.3% of Arabidopsis

research articles

DOGMA: Comparative Analysis of Domain Graph

domains and domain combinations are absent in all other proteomes studied here, respectively. Only 7.6% of all Arabidopsis’s domain combinations, and 25.1% of Arabidopsisspecific domain combinations involved Arabidopsis-specific domain types. Thus, the major evolution mechanism of Arabidopsis’s proteome, like other species, are recombination, duplication, and divergence of limited domain families available. Similar conclusion was also obtained from structure-based analysis of structural domain combinations in archaeal, bacterial, and eukaryotic proteomes.15,16 Recent studies suggested that ubiquitin system and gene regulation, although seemingly unrelated, might be connected with each other.60 This hypothesis was supported by Ye and Godzik,17 who obtained the functional linkage between (Zinc/ Ring) finger domains and ubiquitination domains via “signature” graph common for all Eukaryotes. Our analysis provided further evidence for this hypothesis. We found that for Arabidopsis, there is a special link between ubiquitin system and gene regulation; i.e., in the Arabidopsis’s “signature” domain graph, Myb_DNA-binding is linked with ubiquitin via Bromodomain. This also suggests that Bromodomain might play a role in the protein-protein interaction related to ubiquitin conjugation, in addition to its role in transcriptional regulation. Genome-wide analysis estimated that around one-third of Arabidopsis’s genes have nothing in common with those in other kingdoms.3 The largest and most mysterious family of such genes are the PPR genes.50,61 Thus, it is difficult to get its functional insight from other kingdoms. A recent genome-wide functional study of Arabidopsis PPR genes (based on gene expression, gfp/rfp fusion, insertion mutants, RNA binding assays, subcellular location, and sequence analysis) indicated that it might play essential and constitutive role in organelle biogensis via binding to organellar transcript.50 Notably, the domain graph for PPR not only confirms its basic role from experiments, but also suggests its possible association with ubiquitin-conjugating systems, and others such as cell growth and/or maintenance. A high proportion of domains and their combinations occur in more than one tissue-specific gene sets. This indicates that tissue-specific genes use common functional tools to perform cell-specific task. A major portion of these domains are transcription factors. Other common functional “tools” used by tissue-specific genes are domains for signal transduction, protein-protein interaction, cell-wall formation, protein degradation, and others. On the other hand, the number of tissuespecific domain combinations for tissue-specific genes is small. The tissue-specific domain graphs are made of many small components with a few domains and domain combinations due to small size of tissue-specific genes. This suggested that the differentiation of tissue cell types are contributed mostly by the interplay between tissue-specific genes and commonly expressed genes.

Acknowledgment. We gratefully thank Professor Blake C. Meyers and Dr. Tam H. Vu for discussion of the usage of tissue-specific genes resulted from Arabidopsis MPSS project, Dr. Margarita Garcia for information of the usage of TAIR database. This work was supported by NIH (R01 GM 966049 and R01 GM 068530), a grant from HHMI to SUNY Buffalo and by the Center for Computational Research and the Keck Center for Computational Biology at SUNY Buffalo. Y.Z. was also supported in part by a two-base grant from National Science Foundation of China.

References (1) Minorsky, P. V. Achieving the in silico plant. systems biology and the future of plant biological research. Plant Physiol. 2003, 132, 404-409. (2) Meinke, D. W.; Cherry, J. M.; Dean, C.; Rounsley, S. D.; Koornneef, M. Arabidopsis thaliana: A model plant for genome analysis. Science 1998, 282, 662-665. (3) Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 2000, 408, 796-815. (4) Poethig, R. S. Life with 25 000 genes. Genome Res. 2001, 11, 313316. (5) Ausubel, F. M. Summaries of national science foundationsponsored Arabidopsis 2010 projects and national science foundation-sponsored plant genome projects that are generating Arabidopsis resources for the community. Plant Physiol. 2002, 129, 394-437. (6) Riechmann, J. L.; Heard, J.; Martin, G.; Reuber, L.; Jiang, C. et al. Arabidopsis transcription factors: genome-wide comparative analysis among eukaryotes. Science 2000, 290, 2105-2110. (7) Meyers, B. C.; Vu, T. H.; Tej, S. S.; Ghazal, H.; Matvienko, M.; Agrawal, V.; Ning, J. C.; Haudenschild, C. D. Analysis of the transcriptional complexity of Arabidopsis by massively parallel signature sequencing. Nat. Biotechnol. 2004, 22, 1006-1011. (8) Gutie´rrez, R. A.; Green, P. J.; Keegstra, K.; Ohlrogg, J. B. Phylogenetic profiling of the arabidopsis thaliana proteome: What proteins distinguish plants from other organisms? Genome Biol. 2004, 5, R53. (9) Ponting, C. P.; Russell, R. R. The natural history of protein domains. Annual Rev. Biophys. Biomol. Struct. 2002, 31, 45-71. (10) Vogel, C.; Bashton, M.; Kerrison, N. D.; Chothia, C.; Teichmann, S. A. Structure, function and evolution of multidomain proteins. Curr. Opin. Struct. Biol. 2004, 14, 208-216. (11) Diestel, R. Graph Theory; Springer-Verlag: New York, 2000. (12) Barabsi, A. L.; Oltvai, Z. N. Network biology: Understanding the cells’s functional organization. Nat. Rev. Genet. 2004, 5, 101113. (13) Wuchty, S. Scale-free behavior in protein domain networks. Mol. Biol. Evol. 2001, 18, 1694-1702. (14) Wuchty, S. Interaction and domain networks of yeast. Proteomics 2002, 2, 1715-1723. (15) Apic, G.; Gough, J.; Teichmann, S. A. An insight into domain combinations. Bioinformatics 2001, 17, S83-89. (16) Apic, G.; Gough, J.; Teichmann, S. A. Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J. Mol. Biol. 2001, 310, 311-325. (17) Ye, Y. Z.; Godzik, A. Comparative analysis of protein domain organization. Genome Res. 2004, 14, 343-353. (18) Aloy, P.; Russell, R. B. Interrogating protein interaction networks through structural biology. Proc. Natl. Acad. Sci. U.S.A. 2002, 99, 5896-5901. (19) Aloy, P.; Russell, R. B. The third dimension for protein interactions and complexes. Trends Biochem. Sci. 2002, 27, 633-638. (20) Deng, M.; Metah, S.; Sun, F.; Chen, T. Inferring domain-domain interactions from protein-protein interactions. Genome Res. 2002, 12, 1540-1548. (21) Aloy, P.; Ceulemans, H.; Stark, A.; Russell, R. B. The relationship between sequence and interaction divergence in proteins. J. Mol. Biol. 2003, 332, 989-998. (22) Mudgil, Y.; Shiu, S.-H.; Stone, S. L.; Salt, J. N.; Goring, D. R. A large complement of the predicted Arabidopsis ARM repeat proteins are members of the U-Box E3 ubiquitin ligase family. Plant Physiol. 2004, 134, 59-66. (23) Aravind, L.; Koonin, E. V. The U box is a modified RING finger a common domain in ubiquitination. Curr. Biol. 2000, 10, 132134. (24) Rhee, S. Y.; Beavis, W.; Berardini, T. Z.; Chen, G.; D. Dixon et al. The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res. 2003, 31, 224-228. (25) The FlyBase Consortium. The FlyBase database of the drosophila genome projects and community literature. Nucleic Acids Res. 2003, 31, 172-175. (26) Harris, T. W.; Chen, N. S.; Cunningham, F.; Tello-Ruiz, M.; Antoshechkin, I. et al. WormBase: a multi-species resource for nematode biology and genomics. Nucleic Acids Res. 2004, 32, D411-D417.

Journal of Proteome Research • Vol. 4, No. 2, 2005 443

research articles (27) Dolinski, K.; Balakrishnan, R.; Christie, K. R.; Costanzo, M. C.; Dwight, S. S. et al. “Saccharomyces Genome Database” ftp:// ftp.yeastgenome.org/yeast/, 2004. (28) Galagan, J. E. et al. The genome sequence of the filamentous fungus Neurospora crassa. Nature 2003, 422, 859-868. (29) Bateman, A.; Coin, L.; Durbin, R.; Finn, R. D.; Hollich, V.; GriffithsJones, S.; Khanna, A.; Marshall, M.; Moxon, S.; Sonnhammer, E. L. L.; Studholme, D. J.; Yeats, C.; Eddy, S. R. The pfam protein families database. Nucleic Acids Res. 2004, 32, D138-D141. (30) Eddy, S. R. Profile hidden Markov models. Bioinformatics 1998, 14, 755-763. (31) Cormen, T. H.; Leiserson, C. E.; Rivest, R. L. Introduction to Algorithms; MIT Press: Cambridge, 1990. (32) Meyers, B. C.; Lee, D. K.; Vu, T. H.; Tej, S. S.; Edberg, S. B.; Matvienko, M.; Tindell, L. D. Arabidopsis MPSS: an online resource for quantitative expression analysis. Plant Physiol. 2004, 135, 801-813. (33) Meyers, B. C.; Tej, S. S.; Vu, T. H.; Haudenschild, C. D.; Agrawal, V.; Edberg, S. B.; Ghazal, H.; Decola, S. The use of MPSS for whole-genome transcriptional analysis in Arabidopsis. Genome Res. 2004, 14, 1641-1653. (34) Koonin, E. V.; Wolf, Y. I.; Karev, G. P. The structure of the protein universe and genome evolution. Nature 2002, 420, 218-223. (35) Chothia, C.; Gough, J.; Vogel, C.; Teichmann, S. A. Evolution of the protein repertorie. Science 2004, 300, 1701-1703. (36) Gehring, W. J. The homeobox in perspective. Trends Biochem. Sci. 1992, 17, 277-280. (37) Hershko, A.; Ciechanover, A.; Varshavsky, A. The ubiquitin system. Nat. Med. 2000, 6, 1073-1081. (38) Biedenkapp, H.; Borgmeyer, U.; Sippel, A. E.; Klempnauer, K. H. Viral myb oncogene encodes a sequence-specific DNA-binding activity. Nature 1988, 335, 835-837. (39) Aasland, R.; Stewart, A. F.; Gibson, T. The SANT domain: a putative DNA-binding domain in the SWI-SNF and ADA complexes, the transcriptional co-repressor N-CoR and TFIIIB. Trends Biochem. Sci. 1996, 21, 87-88. (40) Haynes, S. R.; Dollard, C.; Winston, F.; Beck, S.; Trowsdale, J.; Dawid, I. B. The bromodomain: a conserved sequence found in human, drosophila and yeast proteins. Nucleic Acids Res. 1992, 20, 2063-2063. (41) Tamkun, J. W. The role of brahma and related proteins in transcription and development, Curr. Opin. Genet. Dev. 1995, 5, 473-477. (42) de la Cruz, J.; Kressler, D.; Linder, P. Unwinding RNA in Saccharomyces cerevisiae: DEAD-box proteins and related families. Trends Biochem. Sci. 1999, 24, 192-198. (43) Aubourg, S.; Kreis, M.; Lecharny, A. The DEAD box RNA helicase family in Arabidopsis thaliana. Nucleic Acids Res. 1999, 27, 628636. (44) Bandziulis, R. J.; Swanson, M. S.; Dreyfuss, G. RNA-binding proteins as developmental regulators, Genes Dev. 1989, 3, 431437. (45) Greene, E. A.; Erard, M.; Dedieu, A.; Barker, D. G. MtENOD 16 and 20 are members of a family of phytocyanin-related early nodulins. Plant Mol. Biol. 1998, 36, 775-783. (46) Kobe, B.; Deisenhofer, J. The leucine-rich repeat: a versatile binding motif. Trends Biochem. Sci. 1994, 19, 415-421.

444

Journal of Proteome Research • Vol. 4, No. 2, 2005

Liu et al. (47) Sutter, A.; Grisebach, H. Udp-glucose: flavonol 3-o-glucosyltransferase from cell suspension cultures of parsley. Biochim. Biophys. Acta 1973, 309, 289-295. (48) Keusch, J. J.; Manzella, S. M.; Nyame, K. A.; Cummings, R. D.; Baenziger, J. U. Cloning of gb3 synthase, the key enzyme in globoseries glycosphingolipid synthesis, predicts a family of alpha 1, 4-glycosyltransferases conserved in plants, insects, and mammals. J. Biol. Chem. 1997, 275, 25315-25321. (49) Bateman, A. The structure of a domain common to archaebacteria and the homocystinuria disease protein. Trends Biochem. Sci. 1997, 22, 12-13. (50) Lurin, C.; Andre´s, C.; Aubourg, S.; Bellaoui, M.; Bitton, F. et al. Genome-wide analysis of Arabidopsis pentatricopeptide repeat proteins reveals their essential role in organelle biogenesis. Plant Cell 2004, 16, 2089-2103. (51) Bai, C.; Sen, P.; Hofmann, K.; Ma, L.; Goebl, M.; Harper, J. W.; Elledge, S. J. SKP1 connects cell cycle regulators to the ubiquitin proteolysis machinery through a novel motif, the F-box. Cell 1996, 86, 263-274. (52) Saito, T.; Seki, M.; Yamauchi, N.; Tsuji, S.; Hayashi, A.; Kozuma, S.; Hori, T. Structure, chromosomal location, and expression profile of EXTR1 and EXTR2, new members of the multiple exostoses gene family. Biochem. Biophys. Res. Commun. 1998, 243, 61-66. (53) Pellegrini, L.; Tan, S.; Richmond, T. J. Structure of serum response factor core bound to DNA. Nature 1995, 376, 490-498. (54) Coen, E. S.; Meyerowitz, E. M. The war of the whorls: Genetic interactions controlling flower development. Nature 1991, 353, 31-37. (55) Paenicova´, L.; de Folter, S.; Kieffer, M.; Horner, D. S. et al. Molecular and phylogenetic analyses of the complete MADSBox transcription factor family in arabidopsis: New openings to the MADS world. Plant Cell 2003, 15, 1538-1551. (56) Gong, W.; Shen, Y. P.; Ma, L. G.; Pan, Y. et al. Genome-wide ORFeome cloning and analysis of Arabidopsis transcription factor genes. Plant Physiol. 2004, 135, 773-782. (57) Davidson, E. H. Genomic Regulatory Systems: Development and Evolution; Academic Press: San Diego, 2001. (58) Lehner, B.; Fraser, A. G. Protein domains enriched in mammalian tissue-specific or widely expressed genes. Trends Genet. 2004, 20, 468-472. (59) Ferrer, J. L.; Jez, J. M.; Bowman, M. E.; Dixon, R. A.; Noel, J. P. Structure of chalcone synthase and the molecular basis of plant polypeptide biosynthesis. Nat. Struct. Biol. 1999, 6, 775-784. (60) Muratani, M.; Tansey, W. P. How the ubiquitin-proteasome system controls transcrption. Nat. Rev. Mol. Cell. Biol. 2003, 4, 192-201. (61) Small, I. D.; Peeters, N. The PPR motif-a TPR-related motif prevalent in plant organellar proteins. Trends Biochem. Sci. 2000, 25, 46-47. (62) Inohara, N.; Nunez, G. NODs: intracellular proteins involved in inflammation and apoptosis. Nat. Rev. Immunol. 2003, 3, 371-382.

PR049805M