Predicted Roles for Hypothetical Proteins in the ... - ACS Publications

Introduction. Accurate annotation of gene function is vital to understand- ing the wealth .... SignalP at the CBS Prediction Servers (http://www.cbs.d...
0 downloads 0 Views 174KB Size
Predicted Roles for Hypothetical Proteins in the Low-Temperature Expressed Proteome of the Antarctic Archaeon Methanococcoides burtonii Neil F. W. Saunders,† Amber Goodchild,† Mark Raftery,‡ Michael Guilhaus,‡ Paul M. G. Curmi,§ and Ricardo Cavicchioli*,† School of Biotechnology and Biomolecular Sciences, The University of New South Wales, Sydney, 2052, NSW, Australia, Bioanalytical Mass Spectrometry Facility, The University of New South Wales, Sydney, 2052, NSW, Australia, and School of Physics, The University of New South Wales, Sydney, NSW, 2052, Australia and Centre for Immunology, St Vincent's Hospital, Sydney, NSW, 2010, Australia Received November 10, 2004

Using liquid chromatography-mass spectrometry, 528 proteins were identified that are expressed during growth at 4 °C in the cold adapted archaeon, Methanococcoides burtonii. Of those, 135 were annotated previously as unique or conserved hypothetical proteins. We have performed a comprehensive, integrated analysis of the latter proteins using threading, InterProScan, predicted subcellular localization and visualization of conserved gene context across multiple prokaryotic genomes. Functional information was obtained for 55 proteins, providing new insight into the physiology of M. burtonii. Many of the proteins were predicted to be involved in DNA/RNA binding or modification and cell signaling, suggesting a complex, uncharacterized regulatory network controlling cellular processes during growth at low-temperature. Novel enzymatic functions were predicted for several proteins, including a putative candidate gene for the posttranslational modification of the key methanogenesis enzyme coenzyme M methyl reductase. A bacterial-like CRISPR locus was identified as a strong candidate for archaeal-bacterial lateral gene transfer. Gene context analysis proved a valuable augmentation to the other predictive methods in several cases, by revealing conserved gene associations and annotations in other microbial genomes. Our results underscore the importance of addressing the “hypothetical protein problem” for a complete understanding of cell physiology. Keywords: proteome • LC/LC-MS/MS • archaea • methanogenesis • psychrophile • CRISPR locus • hypothetical proteins • conserved gene context

Introduction Accurate annotation of gene function is vital to understanding the wealth of information that is contained in microbial genome sequences. Typically, the first step in functional annotation uses BLAST to identify targets that are similar to the query sequence and to extract descriptive information about the target. This approach is fast, but suffers from two major problems with regard to annotation; widespread misannotation of genes in sequence databases and the many instances for which no significant or informative BLAST hit exists. The second problem is particularly problematic. Between 10 and 50% of the predicted genes in a newly sequenced genome are either conserved hypothetical or unique hypothetical genes, the latter often referred to as ORFans.1 However, * To whom correspondence should be addressed. Tel: +61-2-93853516. Fax: +61-2-93852742. E-mail. [email protected]. † School of Biotechnology and Biomolecular Sciences, The University of New South Wales. ‡ Bioanalytical Mass Spectrometry Facility, The University of New South Wales. § School of Physics, The University of New South Wales, Sydney, NSW, 2052, Australia and Centre for Immunology, St Vincent's Hospital.

464

Journal of Proteome Research 2005, 4, 464-472

Published on Web 03/23/2005

many of these genes are apparently expressed and are presumably physiologically relevant.2 Recently, attempts have been made to annotate genomes more comprehensively by integrating multiple computational analyses and by using methods to detect both sequence and structural signatures.3,4 In cases where a function is not suggested by sequence or structural analysis, studies have shown that the conserved arrangement of genes can often be used to imply roles for a gene. The conservation of gene order around a gene of interest has been described using a number of terms including conserved gene neighborhood,5 gene synteny6 and genomic context.7 Several computational techniques have been developed including phylogenomic profiling8 and algorithms based on graph theory5,9 to study conserved gene arrangement. In the most successful cases, missing genes in metabolic pathways have been identified and verified experimentally.10 However, there are difficulties for researchers wishing to deploy a suitable computational approach to investigate this problem. These include different definitions of what constitutes conserved gene order, ease of access to appropriate software and a lack of suitable tools to visualize the result in a clear, meaningful way. 10.1021/pr049797+ CCC: $30.25

 2005 American Chemical Society

research articles

Expressed Hypothetical Proteins of M. burtonii

We have recently used liquid chromatography-mass spectrometry (LC-MS) approaches combined with genome sequence data to identify 528 proteins that are expressed in the Antarctic archaeon Methanococcoides burtonii when grown at 4 °C.11 135 of these proteins had no functional classification, being annotated only as unique/conserved hypothetical. However, their consistent identification using LC-MS indicates that they play important roles during growth at low temperature. In the present study, we have applied a suite of computational tools to the analysis of these proteins with the aim of providing functional annotation in as many cases as possible and improving our understanding of cold adaptation in M. burtonii.

Experimental Procedures Organisms and Culture Conditions. M. burtonii, a cold adapted obligate methylotroph (Topt 23 °C, Tmin < 4 °C) was grown in liquid modified methanogen growth medium with trimethylamine at 4 °C and harvested as previously described.11,12 Sample Preparation, Liquid Chromatography and Mass Spectrometry. Protein extraction, digestion and MS analysis was performed as previously described.11 Briefly, total proteins from M. burtonii were digested with trypsin in a 1:100 (trypsin: protein) ratio overnight at 37 °C. Digested peptides were separated by online strong cation exchange (SCX) and nano C18 LC using an Ultimate HPLC, Switchos and Famos autosampler system (LC-Packings). Peptides (∼500 ng) were dissolved in formic acid (0.1%, 25 µL) and loaded onto a SCX micro trap (1 × 8 mm, Michrom Bioresources). Peptides were eluted sequentially using 5, 10, 15, 20, 25, 30, 40, 50, 75, 150, 300, and 1000 mM ammonium acetate (20 µL). The unbound load fraction and each salt step were concentrated and desalted onto a micro C18 precolumn (500 µm × 2 mm, Michrom Bioresources) using H2O:CH3CN (98:2, 0.1% formic acid, buffer A) at 20 µL/min. After a 10 min wash the precolumn was switched (Switchos) into line with an analytical column containing C18 RP silica (PEPMAP, 75 µm × 15 cm, LC-Packings) or a fritless C18 column (75 µm × ∼12 cm). Peptides were eluted using a linear gradient of buffer A to H2O:CH3CN (40: 60, 0.1% formic acid-buffer B) at 200 nL/min over 60 min. The column was connected via a fused silica capillary to a low volume tee (Upchurch Scientific) where high voltage (2300 V) was applied and a nano electrospray needle (New Objective) or fritless column outlet was positioned ∼1 cm from the orifice of an API QStar Pulsar i hybrid tandem quadrupole (Q) timeof-flight mass spectrometer (TOF-MS) (Applied Biosystems). The QStar was operated in information dependent acquisition mode. A TOF-MS survey scan was acquired (m/z 350-1700, 0.5 s) and the 2 largest precursors (counts > 10) sequentially selected by Q for MS/MS analysis (m/z 50-2000, 2.5 s). A processing script generated data suitable for submission to the database search programs. Extracted spectra were also analyzed using DTASelect to simplify interpretation.13 Collision induced dissociation (CID) spectra were analyzed using SEQUEST software with the following parameters: peptide mass tolerance of 1.5 Da, strict trypsin enzyme digestion with the modification +16 Methionine. Searches were performed on a local database of M. burtonii translated sequences. Proteins were considered identified if they matched set criteria. For SEQUEST the criteria were: fragments were tryptic, the Xcorr score was >2 for [M+2H]2+, and a distinct ladder sequence was visible. For the peptides analyzed a SEQUEST Xcorr > 2 indicated identity. Mascot MS/MS ion search (Matrix Science) criteria were:

trypsin digestion allowing up to 1 missed cleavage, oxidation of methionine, peptide tolerance of 1.0 Da and MS/MS tolerance of 0.15 Da. A Mascot score >18 indicated identity. All SEQUEST and Mascot scores were manually verified. Computational Analyses. Draft genome data are based on the JGI assembly of 12-Nov-03 and the ORNL Genome Channel analysis of 11-Dec-03 (see Additional Information). The assembly contained 70 contigs (estimated genome size ∼ 2.8 Mbp) and 2 782 putative gene models. Theoretical molecular weight and isoelectric point of proteins were calculated using the EMBOSS sequence analysis package14 (v. 2.8.0). The subcellular location of proteins was predicted using TMHMM and SignalP at the CBS Prediction Servers (http://www.cbs.dtu.dk/ services/). The InterProScan package3 was used to search protein sequences for motifs in the InterPro database. The Prospect suite15 (v. β-2) was used for protein threading. Information from secondary structure prediction and PSIBLAST profiles was incorporated into threading predictions using 4199 nonredundant structures from the SCOP database as templates. Bioperl-based Perl scripts16 were used for parsing all program outputs. To detect and visualize conserved gene clusters, NCBI BLAST (tblastx) was used to search all available chromosome/plasmid sequences from complete microbial genomes (currently 255 sequences), using the draft genome assembly of M. burtonii (70 contigs). The BLAST report, query and genome sequences (the latter in GenBank format) were converted to GFF format and stored in MySQL databases. We defined a contig segment as containing a conserved gene region if a region ( 5000 bp either side of the gene of interest contained significant tblastx matches (E < 1e-5) to a target genome sequence and those matches were to 2 or more genes. A Perl script was written to identify these regions and display query, matches and target using the Bioperl Bio::Graphics module.

Results and Discussion Characteristics of Hypothetical Proteins Identified in the Low-Temperature Expressed Proteome. The proportion of proteins identified in more than one LC/LC-MS/MS run positively correlates with their cellular abundance.11 52% of protneins were identified in more than one run and 9% were detected in all LC/LC-MS/MS runs. This is a pattern similar to that observed for previously annotated proteins from the expressed proteome,11 indicating that the hypothetical proteins had similar cellular levels of abundance. The majority of proteins were identified from more than one peptide. All identifications met strict criteria, including a high SEQUEST or Mascot score (see Experimental Procedures), a monoisotopic mass of precursor and product ions of less than 0.15 Da, and verification of all tandem mass spectra by visual inspection. The majority of the hypothetical proteins are predicted to be small (120/135 < 50 kDa) cytosolic proteins with acidic pI (99/135 pI 4-7) (Figure 1). A notable exception is encoded by gene 2680 (MW 6.5 kDa, pI 13.2), which is the most basic protein encoded in the M. burtonii genome. 129/135 proteins were classified as conserved hypothetical (one or more BLAST hits with E e 1e-5 to a protein with no functional annotation). Many of these BLAST hits were predicted proteins from the family Methanosarcinaceae. The remaining 6/135 proteins were unique hypothetical (no significant BLAST hit). Assignment of Functional Categories. We integrated 4 kinds of information: InterProScan, the Prospect threading package, Journal of Proteome Research • Vol. 4, No. 2, 2005 465

research articles

Figure 1. Plot of isoelectric point versus molecular weight (log scale) for 135 expressed conserved and unique hypothetical proteins from M. burtonii

predicted subcellular localization and visualization of conserved regions around genes of interest to arrive at consensus annotations for unannotated proteins (Table 1). InterProScan provided information for 101/135 proteins. 73/101 proteins had a match to an Interpro database entry, 15 of which were annotated only as conserved proteins or domains of unknown function (Table 2). 28/101 proteins did not have a database match, but were predicted to contain coiled-coil and/or significant regions of low-complexity sequence using ncoils and seg. Using z-score to assess the reliability of Prospect threading predictions, fold assignment confidence level was ranked as high, very high or certain (corresponding to similarity at the level of superfamily or family) for 36/135 proteins and certain (similarity at level of family) for 17/135 proteins. 72/135 proteins had no significant threading score. Additional information regarding function was gained by predicting whether proteins may be secreted or contain transmembrane helices and by examining the degree to which gene arrangement was conserved around the gene of interest. We defined conserved gene order as the situation where a region of DNA around the M. burtonii hypothetical gene of interest contained significant sequence similarity to two or more genes from another genome. Using these criteria, 125/135 genes were found in a region with similarity to one or more genomes. Sorting by organism provided a phylogenetic breakdown. Of the 125 genes located in a region with some degree of conserved gene order, 32 conserved regions were found only in archaea, 7 only in methanogens and 5 only in bacteria. Confidence levels for each functional annotation were described as high, medium or low, based primarily on the degree of agreement between threading predictions and the search methods implemented in InterProScan. Predicted subcellular localization also played a role in confidence assignment for several proteins (e.g., where a protein predicted to be a protease is also predicted to be secreted). Visualization of conserved gene order proved most useful in instances where little or no information was obtained using other methods. By combining these data, we were able to provide an annotation with some confidence level for 55/135 proteins. The results are summarized in Table 1. Subcellular Localization. Four proteins (genes 294, 295, 1032, and 1420) were predicted to be secreted via the Sec466

Journal of Proteome Research • Vol. 4, No. 2, 2005

Saunders et al.

dependent pathway. However, compared with bacteria and eukaryotes, secretion across the cytosolic membrane is a poorly-characterized process in archaea.17 In this study, we adopted a method in which archaeal proteins are predicted to be secreted based on a consensus of high scores using SignalP models for eukaryotes, Gram-negative and Gram-positive bacteria.18 This is likely to be an underestimate of the true number of Sec-dependent secreted proteins when applied to archaeal genomes. Additional functional information was obtained for only one of the putative secreted proteins, gene 294, which is predicted by both threading and InterProScan analyses to be a trypsin-like serine/cysteine protease (Table 1). None of the 135 proteins were predicted to be secreted via the alternative Tat-dependent pathway.19 Although this pathway has been identified in several archaea, M. burtonii does not appear to contain tat genes and none of the 2782 ORFs contain the N-terminal twin arginine motif required for Tatdependent secretion. These data on Sec and Tat pathways imply that secretion mechanisms for unfolded proteins (Sec pathway) operate in M. burtonii, and folded proteins are either not secreted, or are secreted by a Tat-independent system. Five proteins (genes 287, 1501, 1815, 2242, and 2563) were predicted to contain transmembrane helices. Gene 2242 is conserved in the Methanosarcina and contains a conserved TM•helix feature in association with an MscS/YggB-like mechanosensitive channel signature. Membrane proteins of this family occur in all 3 domains of life and respond to membrane stretching and depolarization.20 It is possible that in M. burtonii growing at 4 °C, changes in membrane fluidity provide mechanical stimulation of the gene 2242 product. This may arise due to by growth temperature-induced changes in lipid saturation.21 Gene 1501 is predicted to contain 8 transmembrane helices. In addition, it contains a conserved domain of unknown function (DUF1119) and an aspartic endopeptidase signature belonging to the peptidase A22 presenilin signal peptide family. Members of this protein family are polytopic membrane proteins that promote intramembrane proteolysis of some signal peptides and generation of biologically active peptides. Given the lack of knowledge regarding signal peptide processing in archaea, gene 1501 provides a useful target for experimental verification. DNA/RNA Binding and Modification. Nucleic acid binding and modification proteins made up the largest category of annotations (18/135). M. burtonii appears to express at least one (gene 1909) and possibly two (gene 1066) novel DNA polymerase B proteins, each with a polymerase and a ribonuclease-like domain. Regulation of gene expression at the transcriptional level is indicated by the presence of 7 newly annotated winged-helix DNA-binding proteins, 1 or 2 Zn-finger proteins and a lambda repressor-like protein. Gene 1556 contains a nucleic acid-binding OB fold and a DHH phosphoesterase domain similar to that found in RecJ, suggesting a role in nucleic acid repair. RNA binding and modification appear to be important processes in M. burtonii growing at low-temperature. Genes 375, 874 and 2722 are small (63-68 residue) proteins containing a TRAM domain. This domain has been detected in 2 classes of tRNA-modifying enzymes and is thought to bind tRNA and deliver the modification domain to its target.22 Gene 1952 contains a THUMP domain, also predicted to bind RNA and deliver RNA-modifying enzymes.23 All trusted members of this domain family are archaeal and show some similarity to

research articles

Expressed Hypothetical Proteins of M. burtonii Table 1. Functional Annotation of Expressed Hypothetical Proteins from M. burtonii prospect assignment

gene

prospect confidence

InterProScan assignment

annotation

confidence

287 ABC transporter ATPase domain-like

certain

ABC transporter ATPase domain-like

1289 ABC transporter ATPase domain-like

certain

ABC transporter ATPase domain-like

2607 ABC transporter ATPase domain-like

certain

ABC transporter ATPase domain-like

13 1170 1171 1172 1514

v. high v. high v. high v. high v. high

CBS CBS CBS CBS CBS

CBS-domain containing CBS-domain containing CBS-domain containing CBS-domain containing CBS-domain containing

v. high high

CBS RmlC-like cupin

CBS-domain containing RmlC cupin-like

CBS-domain CBS-domain CBS-domain CBS-domain CBS-domain

2716 CBS-domain 1741 Germine/Seed storage 7S protein/RmlC-like cupins 2572 Cytochrome b5 1909 DNA polyerase I

certain certain

1066 DNA/RNA polymerase 203 ferritin 1233 Ferritin 1815 Integrin A (or I)like/vWA-like 1112 RecA protein-like (ATPase domain) 722 Lambda repressor-like DNA-binding domain 358 MTH1175-like 359 MTH1175-like 803 MTH1175-like 310 Nucleic acid-binding proteins 311 Nucleic acid-binding proteins 1556 Cysteine-rich domain of DnaJ 375 Nucleic acid-binding proteins 874 Nucleic acid-binding proteins 2722 Nucleic acid-binding proteins 1501 294 Prokaryotic proteases

Cytochrome b5 DNA-directed DNA polymerase B medium DNA-directed DNA polymerase B low Ferritin/ribonucleotide reductase-like certain Ferritin/ribonucleotide reductase-like v. high -

Ferritin/ribonucleotide reductase-like Ferritin/ribonucleotide reductase-like Integrin A (or I)-like

certain

KaiC-like

medium Lambda repressor-like, DNA-binding certain Nitrogen fixationrelated protein certain Nitrogen fixationrelated protein certain Nitrogen fixationrelated protein v.high Nucleic acid-binding OB-fold medium Nucleic acid-binding OB-fold v. high Nucleic acid-binding OB-fold/DHH-like low low low v. high

2687 HEAT repeat

certain

2418 PRC, H-chain, cytoplasmic domain 1721 Nucleic acid-binding proteins 2581 Rubredoxin-like

low

2242 2664 ATP-binding domain peptide synthetases 1952 Peridinin-chlorophyll protein 438 B-box Zn-binding protein

KaiC

low low

v. high low low

Cytochrome b5-like DNA-directed DNA polymerase B DNA polymerase like

Lambda repressor-like, DNA-binding Nitrogen fixation-related

medium medium Threading significant only using global•local alignment high Also similarity to rubrerythrin medium Threading not strictly applicable (membrane protein). high high high

Nitrogen fixation-related

high high

Nucleic-acid binding

high

Nucleic-acid binding

medium

Nucleic acid-binding OBfold/DHH-like

high

Nucleic-acid binding/ TRAM domain Nucleic-acid binding/ TRAM domain Nucleic-acid binding/ TRAM domain Peptidase, membranebound

Peptidase, trypsin-like serine and cysteine proteases PBS lyase HEATlike repeat

Peptidase, trypsin-like serine and cysteine proteases Phycocyanin R-phycocyanobilin lyase related protein PRC barrel-like

PRC-barrel

high high

Nitrogen fixation-related

Deoxyribonuclease/rho motif-related TRAM Deoxyribonuclease/rho motif-related TRAM Deoxyribonuclease/rho motif-related TRAM Peptidase A22, presenilin signal peptide

notes

medium InterProScan Superfamily detects P-loop containing nucleotide triphosphate hydrolase domain. Threading not strictly applicable (membrane protein). medium InterProScan Superfamily detects P-loop containing nucleotide triphosphate hydrolase domain medium InterProScan Superfamily detects P-loop containing nucleotide triphosphate hydrolase domain high high high high high Also contains DUF293 and winged-helix high Also contains DUF39 high

InterProScan also detects HTH motif MTH1175; probably RNAbinding MTH1175; probably RNAbinding MTH1175; probably RNAbinding

Contains DHH-like phosphoesterase domain and 1 or 2 RNAbinding domains medium Threading significant only using global•local alignment medium medium Threading significant only using global•local alignment high InterProScan also detects DUF1119. Threading not applicable (membrane protein). high Predicted secreted protein high

InterProScan also detects ARM repeat

medium Threading significant only using global•local alignment Protein of unknown Putative Fe-S binding medium UPF0179 contains conserved function UPF0179 Cys clusters that may be Fe-S Protein of unknown Putative Zn binding medium UPF0153 may bind Zn via 8 function UPF0153 conserved Cys. Threading significant only using global•local alignment. Conserved TM helix/ Putative mechanosensitive medium Threading not applicable Mechanosensitive channel channel/TM helix (membrane protein). MscS, transmembrane Protein of unknown function Putative peptide medium DUF1246/DUF1297 synthetase THUMP/Conserved Putative pseudouridine low Twilight-zone similarity to hypothetical protein 1213 synthase predicted RNA pseudouridine synthases. Putative Zn-binding medium HMMER detects similarity to protein UPF0148, a putative Zn-binding protein

Journal of Proteome Research • Vol. 4, No. 2, 2005 467

research articles

Saunders et al.

Table 1. (Continued) prospect assignment

gene

147 PNP-oxidase-like/FMNbinding split barrel 1049 PNP-oxidase-like/FMNbinding split barrel 1227 TIM beta/alpha barrel 424 Thioltransferase 752 -

prospect confidence

v. high

1224 G proteins

certain

1509 G proteins

certain

1613 G proteins

certain

1629 Toll/Interleukin receptor TIR domain 715 Winged helix DNA-binding domain 813 Lrp/AsnC-like N-terminal domain 979 Winged helix DNA-binding domain

v. high

549 C2H2 and C2HC zinc fingers

high certain v. high

Pyridoxamine 5′-phosphate oxidase-related Pyridoxamine 5′-phosphate oxidase-related Radical SAM domaincontaining Redox-active disulfide protein S-layer-related duplication S-layer-related domain duplication domain GTP1/OBG, small GTP-binding Small GTP-binding protein domain Ras GTPase, small GTP-binding Small GTP-binding protein domain GTP1/OBG, nucleolar Small GTP-binding GTP-binding TIR TIR-domain containing Winged helix DNA-binding Winged helix DNA-binding Winged helix DNA-binding Winged helix DNA-binding Winged helix DNA-binding Winged helix DNA-binding

v. high

Winged helix DNA-binding

high

Winged helix DNA-binding

medium Winged helix DNA-binding medium Winged helix DNA-binding low

FYVE/PHD zinc finger

medium Zn-finger, ZPR1 type

Table 2. Expressed Proteins of M. burtonii Annotated as Conserved Protein/Domain of Unknown Function gene

annotation

1104 1448 1770 1861 1956 2254 2292 340, 451, 588 449 586 683 838 886

DUF556 DUF1130 DUF89 DUF555 DUF655 DUF124 UPF0027 DUF169 UPF0147 UCP004929 DUF198 DUF75 UPF0044

pseudouridine synthase. Genes 358, 359 and 803 are most similar to the MTH1175 protein from Methanothermobacter thermautotrophicus. A crystal structure has been obtained for this protein and SCOP analysis reveals that it is most similar to the ribonuclease H superfamily. The protein is annotated as nitrogen-fixation related due to its similarity to NifB, a protein involved with nitrogenase cofactor synthesis. However, archaeal homologues lack the C-terminal region of this protein, possessing instead an Arg/Gly rich flexible C-terminal region suggestive of RNA-binding capability. Genes 194/195 encode small (75 and 78 residue) proteins, are transcribed in the same direction and are separated by only 468

annotation

FMN-binding split barrel/ Pyridoxamine 5′-phosphate oxidase-related v. high FMN-binding split barrel/ Pyridoxamine 5′-phosphate oxidase-related medium Radical SAM/Protein of unknown function DUF512 certain Redox-active disulfide protein 2 -

1395 Winged helix DNA-binding domain 2123 Winged helix DNA-binding domain 2507 Winged helix DNA-binding domain 2594 Winged helix DNA-binding domain 195 C2H2 and C2HC zinc fingers

InterProScan assignment

Journal of Proteome Research • Vol. 4, No. 2, 2005

confidence

notes

high high high

DUF512 is often found C-terminal to Radical SAM

high medium Threading not applicable (membrane protein). high high high high high high high

Winged helix DNA-binding Winged helix DNA-binding Winged helix DNA-binding Winged helix DNA-binding Zn-finger

low

Zn-finger, ZPR1 type

high

Also contains domain found in phenylalanyl-tRNA synthetase, R chain

high high high medium Threading hit to one Zn finger domain only significant using global•local alignment. Additional evidence from conserved gene cluster. Very significant threading score only using global•local alignment.

2 bp in the M. burtonii genome. They were annotated as conserved hypothetical proteins on the basis of BLAST similarity to 2 similar genes from Methanosarcina acetivorans. Genes 194/195 are located in a region that is highly conserved in both M. acetivorans and M. mazei and in the latter organism, both of the equivalent genes are annotated as Zn-finger proteins (Figure 2a). Additional evidence for this annotation comes from weakly significant threading z-scores using global•local alignment and in the case of gene 194, a weak Pfam hit to the FYVE zinc finger domain. Putative Zn-binding ligands (1 His and 7 Cys residues) are conserved in a multiple alignment of genes 194/195 and similar small proteins from M. mazei and M. acetivorans (data not shown). Gene 150, a conserved hypothetical protein for which InterProScan and Prospect provided no new data, is located between a putative helicase (gene 149) and a conserved hypothetical protein (gene 151), all transcribed in the same direction. Surprisingly, gene context analysis revealed that this arrangement is specific to bacteria, being conserved to varying degrees in 11 bacterial genomes. In addition, a high degree of BLAST similarity downstream of gene 151 corresponds to a fourth ORF, gene 152, the product of which has significant similarity to several bacterial proteins. In Geobacter sulfurreducens, genes equivalent to 149 and 152 are annotated as CRISPR-associated helicase Cas3 and CRISPR-associated protein, CT1975 family, respectively (Figure 2b). The arrangement of genes 149, 150 and 152 was also conserved in E. coli (3

Expressed Hypothetical Proteins of M. burtonii

research articles

Figure 2. Conserved gene context for selected expressed conserved hypothetical genes from M. burtonii. Gene pairs with BLAST similarity of E < 1e-5 are shaded black. Vertical spacing of genes is to permit labeling, not to indicate reading frame. (a) Annotation of genes 194/195 as zinc-finger proteins in M. mazei; (b) conservation of the CRISPR gene locus in the bacteria G. sulfurreducens and P. luminescens; (c) conserved association of gene 1227 with the methyl coenzyme M reductase operon in M. thermoautotrophicum; (d) annotation of gene 1956 as RNA-binding and association with rRNA dimethyladenosine transferase in T. volcanium. Journal of Proteome Research • Vol. 4, No. 2, 2005 469

research articles strains), Salmonella typhi (2 strains), S. typhimurium, Streptomyces avermitilis, Chlorobium tepidum, Corynebacterium diphtheriae and most notably, Photorhabdus luminescens, to which genes 149, 150, and 152 showed considerable identity (Figure 2b). Gene 150 is therefore a functional cas gene (CRISPRassociated), where CRISPR stands for clustered regularly interspaced short palindromic repeats.24 CRISPR loci have been detected in many prokaryotes; they are a novel family of repetitive DNA sequence and the associated cas genes are predicted to be involved with DNA metabolism or gene expression. CRISPRs are believed to be mobile elements and it is particularly interesting that although they have been detected in several species of archaea, the CRISPR locus in M. burtonii is more similar to that found in a small group of bacteria, in terms of both gene order and sequence similarity. The locus is therefore a strong candidate for evidence of archaeal-bacterial lateral gene transfer and its similarity to the CRISPR locus in several pathogenic species of bacteria is notable in the light of recent discussion regarding pathogenicity and archaea.25,26 Gene 1956 is a conserved hypothetical gene containing a domain of unknown function, DUF655. This domain is unique to archaea. Genes similar to gene 1956 were found in 18/19 complete archaeal genome sequences and some conservation of the surrounding region was detected in all cases. The gene was found upstream of a gene encoding rRNA dimethyladenosine transferase in 14/18 cases and in the M. burtonii genome. Gene 1956 is annotated as a DNA-binding protein in Methanopyrus kandleri (Figure 2d) and Pyrococcus furiosus and as an RNA-binding protein in Thermoplasma volcanium. Some evidence for a nucleic-acid binding role is provided by the presence of a second domain within DUF655 with similarity to the N-terminal DNA-binding domain of DNA polymerase beta. The conserved association with rDNA dimethyladenosine transferase indicates an accessory role for gene 1956 in rRNA processing. Metabolic and Enzymatic Functions. Several proteins are predicted to have primary/secondary metabolic or simple enzymatic functions. Genes 147 and 1049 contain both a pyridoxamine-5′-phosphate oxidase-like domain and an FMNbinding split barrel. While there are examples of proteins containing the former domain that lack enzymatic activity, the presence of both domains is indicative of an enzymatic function. However, a recent phylogenetic analysis suggests that pyridoxamine-5′-phosphate oxidase activity is unlikely to be involved with pyridoxal 5′-phosphate biosynthesis in archaea.27 In keeping with other members of this conserved hypothetical protein family, genes 147/1049 are therefore annotated as “pyridoxamine 5′-phosphate oxidase-related”. Gene 1741 is annotated as a cupin-like protein, containing an RmlC-like cupin domain. RmlC, the prototype for this protein family is an epimerase involved with the synthesis of the saccharide L-rhamnose.28 Other members of the family include glucose-6-phosphate isomerase and various proteins involved with the metabolism and storage of carbohydrates in plants. An enzymatic function, possibly in carbohydrate metabolism therefore seems a likely role for gene 1741. Gene 1227 contains DUF512, a conserved domain of unknown function and a radical SAM domain. Radical SAM domains are found in a superfamily of more than 2000 proteins that catalyze diverse reactions, but are related by the generation of a radical species by reductive cleavage of S-adenosyl470

Journal of Proteome Research • Vol. 4, No. 2, 2005

Saunders et al.

methionine.29 Although not part of a putative operon, gene 1227 is located in a conserved region which was found only in the 6 complete methanogen genomes and the draft genome of Methanosarcina barkeri. In all cases, the gene is transcribed in the opposite direction to a cluster of genes which encode subunits of the enzyme methyl coenzyme M reductase (Figure 2c). This enzyme contains amino acids which are thought to be methylated via an S-adenosylmethionine-dependent posttranslational modification.30 The radical SAM domain is associated with unusual methylation reactions and the conserved association of gene 1227 with the methyl coenzyme M reductase suggests that gene 1227 may be the enzyme responsible for the post-translational modification of methyl coenzyme M reductase. A high degree of sequence conservation was observed in the intergenic region between gene 1227 and the methyl coenzyme M reductase beta subunit gene in M. burtonii, M. mazei, and M. acetivorans, suggesting that although divergently transcribed, the genes may be coregulated. Three other genes in the category metabolic/enzymatic are worthy of note, as they demonstrate how confidence in an annotation is assigned by consideration of multiple analyses. On the basis of several lines of evidence, gene 294 is a strong candidate for a protease: a predicted signal peptide, highly significant threading scores to the prokaryotic protease family and InterProScan detection of a trypsin-like serine protease domain. Gene 2664 is annotated as a putative peptide synthetase. The protein sequence of gene 2664 threads with significant scores to D-Ala:D-Ala and D-Ala:lactate ligases and contains a predicted ATP binding domain and 2 conserved domains of unknown function (DUF1246/DUF1297). The protein is found in multiple species of archaea and in one case (Pyrococcus abyssi) is annotated as a putative carboligase of the ATP-grasp superfamily. Finally, gene 459 is an example where gene context analysis provides the only evidence as to function. The gene lies downstream of and is transcribed in the same direction as gene 458. In M. mazei, although the basis for this annotation is not clear, both of the gene products are annotated as putative phosphoserine phosphatases. In addition to M. mazei, this gene pairing is conserved in Methanopyrus kandleri and M. acetivorans (data not shown). Redox or Metal-Binding. A number of proteins contain putative binding sites for metals, including iron-sulfur clusters (genes 1721, 2581), zinc (gene 438) and ferritin/ribonucleotide reductase-like domains (genes 203 and 1233). Gene 1721 is associated with a putative glycerol-1-phosphate dehydrogenase in 4 archaeal genomes and may be an electron carrier associated with this metabolic activity. The ferritin/ ribonucleotide reductase proteins are a diverse class of nonhaem iron enzymes performing a wide range of redox reactions.31 There are few other clues to the precise role of genes 203 and 1233, other than that gene 203 is associated with iron-sulfur and flavodoxin proteins in the Methanosarcina. Gene 424 contains a redox-active disulfide region (CXXC motif). This domain differs from that found in thioredoxin or glutaredoxin and has been identified in other archaea and a cyanobacterium. It may be more closely related to protein disulfide isomerases. In this context, it is noteworthy that the proteins that assist in correct protein folding have been highlighted in the expressed proteome in M. burtonii grown at 4 °C.11,12 Gene 2572 was classified using both threading and InterProScan as a cytochrome b5 protein. Several archaeal b-type cytochromes have been characterized, mainly from thermo-

research articles

Expressed Hypothetical Proteins of M. burtonii

philic sulfate reducers32 and methanogens.33 The fold is characteristic of a large protein family that includes b-type cytochromes, flavocytochrome b2 and sulfite oxidase and indicates an oxidoreductase function for gene 2572.

right-handed superhelix, is common to many proteins and probably performs a rather general “scaffolding” role.43

Cell-Sensing and Signaling. The CBS domain is a common feature of the conserved hypothetical proteins, found in genes 1170-1172 (4 domains each), 1514 and 2716 (2 domains each). CBS domains are found in a variety of proteins from all 3 domains of life and mutations in the CBS domain are associated with a number of hereditary human diseases.34 Pairs of CBS domains are thought to act as sensors of cellular energy status through the binding of ligands that contain adenosyl moieties. In M. burtonii, genes 1170-1172 are found as part of a unidirectional cluster of 8 genes, conserved in other Methanosarcina. Gene 1514 is of particular interest. In addition to 2 CBS domains it contains a conserved domain of unknown function, DUF293 and a strong signature for a winged-helix DNA-binding domain. Anecdotal evidence suggests DUF293 is related to the transcriptional regulator HrcA,35 raising the possibility that the product of gene 1514 is both a sensor and a DNA-binding transcriptional regulator.

The “hypothetical protein problem” remains a major challenge to a complete understanding of cell physiology. In this work, we have demonstrated that a large number of conserved/ unique hypothetical genes are expressed during growth at 4 °C in M. burtonii. A recent study using the archaeon Halobacterium sp. NRC-1 also showed that the majority of ORFans, including short ORFs, are expressed,2 in contrast to the assertion that they are not genuine protein-coding genes.44 Using an integrated approach to annotation, 55/135 proteins were assigned putative functions. Our approach provides an appraisal of the methods which are currently available for gene annotation, as well as underscoring that a description of gene function should be a consensus derived from multiple approaches. The success of these predictive tools reinforces the observation that data and methods for studying both unique and conserved hypothetical genes are not so sparse as is commonly believed.45 We found genomic context to be a useful tool for gene annotation, particularly for cases where information derived from sequence or structural analysis was ambiguous or nonexistent. In contrast to other studies that have focused on conserved gene order for the gene of interest plus directly adjacent genes, we aimed to detect conserved gene context for 2 or more genes in the vicinity of the gene of interest. This approach revealed a surprisingly high degree of conserved gene structure between M. burtonii and other prokaryotic genomes. In the most significant cases, we were able to suggest a putative gene function either by association with a group of annotated genes (e.g., gene 1227) or by a suggested annotation from the target genome (e.g., genes 194/195 and gene 150). Frequently (as is the case for gene 150), a sequence match described as hypothetical in one instance has a more informative description in another. The quality of annotations provided for publicly available genomes is therefore a crucial factor in the annotation of a newly sequenced genome. An excellent illustration of this problem is provided by proteins that contain CBS domains, which are frequently misannotated in microbial genomes as inosine-5′-monophosphate dehydrogenases (IMPDH), due to local sequence similarity to the CBS domains found in IMPDH.46 Our annotation process has provided additional insight into the physiology of M. burtonii. In particular, the prevalence of genes involved with transcription, nucleic acid modification and cell signaling hints at a complex regulatory network which controls cell processes at low-temperature. The focus for future research will be the experimental validation of predicted function through target selection, expression and structural/ biochemical characterization.

Gene 1629 contains a TIR domain and in the C-terminus, a weak match for P-loop NTPase activity. TIR refers to the Toll and interleukin-1 receptors which are members of this family.36 In eukaryotes, TIR domains are found in paired transmembrane receptors and cytosolic adaptor proteins which form a signal transduction system involved with detecting and responding to microbial invasion.37 TIR domains have also been detected in a number of bacterial proteins.38 The presence of a TIR domain and a putative NTPase activity in gene 1629 is especially interesting, as the fusion of TIR with an ATPase domain has been observed in bacterial proteins that have been linked with the evolution of apoptosis in eukaryotes.39 Gene 1629 is therefore likely to be involved in an as yet uncharacterized cell signaling pathway in M. burtonii. Three proteins (genes 1224, 1509 and 1613) were annotated as GTP-binding proteins. Each of these proteins is distinct: gene 1613 contains a NOG1 (nucleolar-type) domain, genes 1224 and 1613 share a GTP1/OBG domain and gene 1509 is a Ras-type GTPase. A recent classification of these proteins40 indicates that they are likely to act as switches in cell signaling pathways. In particular, GTP1/OBG-type GTPases have been implicated in the initiation of replication, sporulation and differentiation, the sensing of intracellular GTP concentration and regulation of stress induced genes. Structural Classification Only. In several cases, annotation suggests a structural fold, rather than a precise function. Gene 2418 contains a PRC barrel domain which takes its name from subunit H of the bacterial photosynthetic reaction center, but is widely distributed in archaea, bacteria and plants. The domain is found in RimM, a protein involved with the processing of 16S rRNA.41 An RNA-binding role would seem more likely for the M. burtonii protein. Gene 2687 contains a repeated structure known as the ARM repeat fold, which in turn is made up of repeated units named a HEAT repeat. The HEAT repeat found in gene 2687 is described as PBS lyase-like and is generally found in enzymes that attach open-chain tetrapyrrole chromophores known as phycobilins to apo-phycobiliproteins.42 These proteins make up part of a membrane-associated light-harvesting complex in cyanobacteria and red algae. The significance of this kind of repeat in gene 2687 is unclear, but the ARM repeat forms a

Conclusion

Acknowledgment. We thank Stephen Harrop and Gregory Tyrelle for useful discussions and Paul Richardson and Frank Larimer for the provision of M. burtonii genome data. The work was supported by the Australian Research Council. Mass spectrometric analysis for the work were carried out at the Bioanalytical Mass Spectrometry Facility, UNSW, and was supported in part by grants from the Australian Government Systemic Infrastructure Initiative and Major National Research Facilities Program (UNSW node of the Australian Proteome Analysis Facility) and by the UNSW Capital Grants Scheme. Journal of Proteome Research • Vol. 4, No. 2, 2005 471

research articles Additional Information Available: Supplementary data including complete InterProScan, threading and gene context analysis for the 135 hypothetical proteins is available at our web server: (http://psychro.bioinformatics.unsw.edu.au/genomes/index.php). The latest genome data and annotations are available at the same location and at the ORNL Genome Channel web server (http://maple.lsd.ornl.gov/ microbial/mbur/).

References (1) Siew, N.; Azaria, Y.; Fischer, D. Nucleic Acids Res. 2004, 32, 281283. (2) Shmuely H, Dinitz E, Dahan I, Eichler J, Fischer D.; Shaanan B. Bioinformatics 2004, 20, 1248-1253. (3) Zdobnov, E. M.; Apweiler, R. Bioinformatics 2001, 17, 847-848. (4) Bonneau, R.; Baliga, N. S.; Deutsch, E. W.; Shannon, P.; Hood, L. Genome Biol. 2004, 5, R52. Epub 2004 Jul 12. (5) Rogozin, I. B.; Makarova, K. S.; Murvai, J.; Czabarka, E.; Wolf, Y. I.; Tatusov, R. L.; Szekely, L. A.; Koonin, E. V. Nucleic Acids Res. 2002, 30, 2212-2223. (6) Paulsen, I. T.; Seshadri, R.; Nelson, K. E.; Eisen, J. A.; Heidelberg, J. F.; Read, T. D.; Dodson, R. J.; Umayam, L.; Brinkac, L. M.; Beanan, M. J. et al. Proc. Natl. Acad. Sci. U.S.A. 2002, 99, 1314813153. (7) Aravind, L. Genome Res. 2000, 10, 1074-1077. (8) Pellegrini, M.; Marcotte, E. M.; Thompson, M. J.; Eisenberg, D.; Yeates, T. O. Proc. Natl. Acad. Sci. U.S.A. 1999, 96, 4285-8. (9) Haas, B. J.; Delcher, A. L.; Wortman, J. R.; Salzberg, S. L. Bioinformatics 2004, Jul. 9, epub ahead of print. (10) Osterman A.; Overbeek, R. Curr. Opin. Chem. Biol. 2003, 7, 238251. (11) Goodchild, A.; Raftery, M.; Saunders, N. F. W.; Guilhaus, M.; Cavicchioli, R. J. Proteome Res. 2004, 3, 1164-1176. (12) Goodchild, A.; Saunders, N. F. W.; Ertan, H.; Raftery, M.; Guilhaus, M.; Curmi, P. M. G.; Cavicchioli, R. Mol. Microbiol. 2004, 53, 309321. (13) Tabb, D. L.; McDonald, W. H.; Yates, J. R. 3rd. J. Proteome Res. 2002, 1, 21-26. (14) Rice, P.; Longden, I.; Bleasby, A. Trends Genet. 2000, 16, 276277. (15) Kim, D.; Xu, D.; Guo, J. T.; Ellrott, K.; Xu, Y. Protein Eng. 2003, 16, 641-650. (16) Stajich, J. E.; Block, D.; Boulez, K.; Brenner, S. E.; Chervitz, S. A.; Dagdigian, C.; Fuellen, G.; Gilbert, J. G.; Korf, I.; Lapp, H. et al. Genome Res. 2002, 12, 1611-1618. (17) Ring, G.; Eichler, J. J. Bioenerg. Biomembr. 2004, 36, 35-45. (18) Nielsen, H.; Brunak, S.; von Heijne, G. Protein Eng. 1999, 12, 3-9. (19) Hutcheon, G. W.; Bolhuis, A. Biochem. Soc. Trans. 2003, 31, 686689.

472

Journal of Proteome Research • Vol. 4, No. 2, 2005

Saunders et al. (20) Kloda, A.; Martinac, B. Eur. Biophys. J. 2002, 31, 14-25. (21) Nichols, D.; Miller, M. R.; Davies, N. W.; Goodchild, A.; Raftery, M.; Cavicchioli, R. J. Bacteriol. 2004, 186 (24), 0000-0000. (22) Anantharaman, V.; Koonin, E. V.; Aravind, L. FEMS Microbiol. Lett. 2001, 197, 215-221. (23) Aravind, L.; Koonin, E. V. Trends Biochem. Sci. 2001, 26, 215217. (24) Jansen, R.; Embden, J. D.; Gaastra, W.; Schouls, L. M. Mol. Microbiol. 2002, 43, 1565-1575. (25) Cavicchioli, R.; Curmi, P. M. G.; Saunders, N. F. W.; Thomas, T. Bioessays 2003, 25, 1119-1128. (26) Faguy, D. M. BMC Infect. Dis. 2003, 3, 13. (27) Mittenhuber, G. J. Mol. Microbiol. Biotechnol. 2001, 3, 1-20. (28) Giraud, M. F.; Leonard, G. A.; Field, R. A.; Berlind, C.; Naismith, J. H. Nat. Struct. Biol. 2000, 7, 398-402. (29) Sofia, H. J.; Chen, G.; Hetzler, B. G.; Reyes-Spindola, J. F.; Miller, N. E. Nucleic Acids Res. 2001, 29, 1097-1106. (30) Selmer, T.; Kahnt, J.; Goubeaud, M.; Shima, S.; Grabarse, W.; Ermler, U.; Thauer, R. K. J. Biol. Chem. 2000, 275, 3755-3760. (31) Harrison, P. M.; Arosio, P. Biochim. Biophys. Acta 1996, 1275, 161-203. (32) Gomes, C. M.; Kletzin, A.; Teixeira, M. J. Biol. Inorg. Chem. 2002, 7, 483-489. (33) Simianu, M.; Murakami, E.; Brewer, J. M.; Ragsdale, S. W. Biochemistry 1998, 37, 10027-10039. (34) Scott, J. W.; Hawley, S. A.; Green, K. A.; Anis, M.; Stewart, G.; Scullion, G. A.; Norman, D. G.; Hardie, D. G. J. Clin. Invest. 2004, 113, 274-284. (35) Schulz, A.; Schumann, W. J. Bacteriol. 1996, 178, 1088-1093. (36) Mitcham, J. L.; Parnet, P.; Bonnert, T. P.; Garka, K. E.; Gerhart, M. J.; Slack, J. L.; Gayle, M. A.; Dower, S. K.; Sims, J. E. J. Biol. Chem. 1996, 271, 5777-5783. (37) McGettrick, A. F.; O’Neill, L. A. Mol. Immunol. 2004, 41, 577582. (38) Turner, J. D. FEMS Immunol. Med. Microbiol. 2003, 37, 13-21. (39) Koonin, E. V.; Aravind, L. Cell Death Differ. 2002, 9, 394-404. (40) Pandit, S. B.; Srinivasan, N. Proteins 2003, 52, 585-597. (41) Anantharaman, V.; Aravind, L. Genome Biol. 2002, 3, 0061.10061.9. (42) Zhao, K. H.; Deng, M. G.; Zheng, M.; Zhou, M.; Parbel, A.; Storf, M.; Meyer, M.; Strohmann, B.; Scheer, H. FEBS Lett. 2000, 469, 9-13. (43) Andrade, M. A.; Petosa, C.; O’Donoghue, S. I.; Muller, C. W.; Bork, P. J. Mol. Biol. 2001, 309, 1-18. (44) Skovgaard, M.; Jensen, L. J.; Brunak, S.; Ussery, D.; Krogh, A. Trends Genet. 2001, 17, 425-428. (45) Siew, N.; Fischer, D. J. Mol. Biol. 2004, 342, 369-373. (46) Wu, C. H.; Huang, H.; Yeh, L. S.; Barker, W. C. Comput. Biol. Chem. 2003, 27, 37-47.

PR049797+