Genome-Wide Proteomics of Natronomonas pharaonis - Journal of

Nov 15, 2006 - To reach further conclusions about its cellular physiology, the cytosolic protein inventory of Nmn. pharaonis has been analyzed using M...
15 downloads 10 Views 354KB Size
Genome-Wide Proteomics of Natronomonas pharaonis Kosta Konstantinidis, Andreas Tebbe, Christian Klein, Beatrix Scheffer, Michalis Aivaliotis, Birgit Bisle, Michaela Falb, Friedhelm Pfeiffer, Frank Siedler, and Dieter Oesterhelt* Department of Membrane Biochemistry, Max Planck Institute of Biochemistry, Am Klopferspitz 18, 82152 Martinsried, Germany Received July 17, 2006

The aerobic, haloalkaliphilic archaeon Natronomonas pharaonis is able to survive in salt-saturated lakes of pH 11. According to genome analysis, the theoretical proteome consists of 2843 proteins. To reach further conclusions about its cellular physiology, the cytosolic protein inventory of Nmn. pharaonis has been analyzed using MS/MS on an ESI-Q-TOF mass spectrometer coupled on-line with a nanoLC system. The efficiency of this shotgun approach is illustrated by the identification of 929 proteins of which 886 are soluble proteins representing 41% of the cytosolic proteome. Cell lysis under denaturing conditions in water with subsequent separation by SDS-PAGE prior to nanoLC-MS/MS resulted in identification of 700 proteins. The same number (but a different subset) of proteins was identified upon cell lysis under native conditions followed by size fractionation (retaining protein complexes) prior to SDS-PAGE. Additional size fractionation reduced sample complexity and increased identification reliability. The set of identified proteins covers about 60% of the cytosolic proteins involved in metabolism and genetic information processing. Many of the identified proteins illustrate the high genetic variability among the halophilic archaea. Keywords: Natronomonas pharaonis • halophilic • alkaliphilic • archaea • proteomics • mass spectrometry • nanoLCMS/MS • size exclusion chromatography • codon adaptation index • overprinting

Introduction Strains of the extreme haloalkaliphilic archaeon Natronomonas pharaonis were isolated originally from saline soda lakes in Egypt1 and Kenya.2 Growth occurs in media containing concentrations higher than 2 M NaCl. The cells thrive optimally in 3.5 M NaCl at a pH of 8.5-9.5 (surviving up to pH 11) and have a very low magnesium tolerance.3 Nmn. pharaonis belongs to the halophilic branch of the euryarchaeota which form a homogeneous phylogenetic group according to rRNA analysis. The close phylogenetic relationship between the haloarchaea is further supported by the large number of highly homologous proteins. The genome of Nmn. pharaonis consists of a 2.6-Mb GC-rich chromosome and two plasmids of 131 kb and 23 kb.4 From genome analysis, it has been concluded that the metabolism of halophilic archaea is highly flexible4 and reflects an adaptation to the extreme habitats that share the characteristic of being hypersaline but may include additional environmental threats like high alkalinity as in the case of Natronomonas. This flexibility can originate from either (a) species-specific gene loss or (b) genetic exchange with other organisms that live (or more likely died) in the hypersaline environments like halophilic bacteria and algae. Genetic exchange may originate from DNA uptake or may involve gene transfer via phages or mobile genetic elements affecting also larger genome regions. Consistent with this, most halophiles * To whom correspondence should be addressed. Tel: +49 8985782386; Fax: +49 8985783557; E-mail: [email protected]. 10.1021/pr060352q CCC: $37.00

 2007 American Chemical Society

have megaplasmids (or small chromosomes) and genome regions that have a GC content deviating considerably from the chromosome average. The theoretical proteome of Nmn. pharaonis consists of 2843 protein-coding genes according to detailed evaluation of automatic gene prediction data,4 which has confirmed that gene prediction in GC-rich genomes is error-prone.5 Due to the statistical bias of GC-rich genomes, which results in a scarcity of stop codons, the number of reading frames that are open for considerable length greatly exceeds the number of reading frames representing protein-coding genes. This results in a large number of so-called spurious ORFs, most of which are longer than 100 codons in Natronomonas but are assumed to not code for real proteins. Most sequenced genomes contain a large number of organism-specific genes with neither sequence homology nor assigned function and it may be questioned if they are real genes. Experimental validation and confirmation can be obtained by proteomic data, which increase the reliability of the theoretical proteome. Such a reliable proteome is also required for genome-wide analyses, e.g., metabolic reconstruction. Natronomonas, which shows a high nutritional self-sufficiency and thus grows in a very simple synthetic medium,4 is well suited for metabolic analysis using labeling experiments (e.g., growth on labeled acetate). Corresponding studies are severely hampered for organisms like Halobacterium, which requires a rather complex set of nutrients for growth. Journal of Proteome Research 2007, 6, 185-193

185

Published on Web 11/15/2006

research articles

Konstantinidis et al.

Classically, large-scale proteomic data are obtained by twodimensional gel electrophoresis (2-DE) with subsequent mass spectrometric identification.6,7 Whereas peptide mass fingerprinting (PMF) analysis requires a rather pure protein sample such as can be obtained from 2-D gel spots, MS/MS permits the identification of proteins from complex mixtures. A limiting factor for protein identification may be the complexity of the sample, and several approaches were developed to solve this problem. One possibility is to fractionate proteins or its peptides prior to MS/MS analysis. Here we describe the establishment of the cytosolic protein inventory of Nmn. pharaonis by nanoLC-MS/MS and compare approaches with and without prefractionation by size-exclusion chromatography before separation on the protein level by SDSPAGE.

Materials and Methods Growth Conditions. Natronomonas pharaonis was grown aerobically to late-log phase at 37 °C, as described for Halobacterium salinarum.8 DSMZ medium 371 (http://www.dsmz.de) was modified as proposed:9 due to a sufficient concentration of Ca2+ and Mg2+ in NaCl (“reinst” quality), no MgSO4 and CaSO4 were added to avoid precipitation of calcium and magnesium carbonates. The pH was adjusted to 6.5 with NaOH before autoclaving and after cooling to pH 9.0 with sterilized Na2CO3. For protein preparation, Nmn. pharaonis was grown through three successive transfers. For each transfer, 35 mL of fresh medium were inoculated with 1 mL from the previous culture grown to late-log phase (∼40 Klett units). Sample Preparation under Low Ionic Strength Conditions (Water Lysis). Cells were harvested by centrifugation at 6500 g for 20 min at 4 °C and lysed by exposure to pure water containing a protease inhibitor cocktail (“Complete without EDTA”, Roche, Basel, Switzerland) with subsequent tip-sonication (four exposures of 30 s, Branson Sonifier, cell disruptor B15, Danbury, CT). The cell suspension was ultracentrifuged at 213 000 g for 30 min at 4 °C to remove cell debris and cell envelope fragments.7 Proteins of the supernatant were precipitated by adding a 10-fold excess of acetone (-20 °C). After incubation for 30 min at -20 °C, the precipitated proteins were centrifuged at 20 000 g for 15 min at 4 °C and the supernatant was carefully removed. The pellet was washed with 80% (v/v) acetone (-20 °C) to remove salt, dried, and stored at -80 °C until use. The amount of protein was estimated on the basis of cell mass (determined by volume and turbidity of the suspension) used for protein preparation.10 Optical density of the cell suspension was measured at 578 nm with an Eppendorf spectrophotometer or a Klett-Summerson photoelectric colorimeter with a no. 66 filter.8 100 Klett units (Klett MFG. Co. Inc., NY) correspond to an optical density of 1 at 578 nm (Eppendorff spectrophotometer, Netheler and Hinz, Hamburg, Germany) and represent a protein concentration of 0.5 mg/mL. Cytosolic proteins are assumed to amount for about 2/3 of the overall protein. Sample Preparation under High Ionic Strength Conditions (Native Lysis) with Subsequent Size-Exclusion Chromatography. Cells were harvested, resuspended in basal salt (per liter: 200 g of NaCl, 1 g of sodium L-glutamate monohydrate, 1 g of KH2PO4, 1 g of KCl, 1 g of NH4Cl, pH 9.0) also containing a protease inhibitor cocktail and lysed by tip-sonication with subsequent ultracentrifugation as described for water lysis. 186

Journal of Proteome Research • Vol. 6, No. 1, 2007

Figure 1. SDS-PAGE of the protein preparation of Nmn. pharaonis after lysis in water. After separation, proteins were stained by Coomassie and the lane was cut into 22 slices.

Concentration of native proteins was measured by BCA (bicinchoninic acid) protein assay kit (Pierce, Rockford, IL), and 500 µg were used for size exclusion chromatography (SEC). A precision column PC 3.2/30 prepacked with Superose 6 (GE Healthcare Bio-Sciences AB, Uppsala, Sweden) was used in a SMART system (GE Healthcare Bio-Sciences AB, Uppsala, Sweden) for prefractionation. The optimal range for separation of globular proteins in this column is 5-5000 kDa, with an exclusion limit of 40 000 kDa. The column was calibrated with blue dextran (2000 kDa), ferritin (444 kDa), albumin (67 kDa), ribonuclease A (13.7 kDa), thyroglobulin (669 kDa), catalase (232 kDa), and chymotrypsinogen (25 kDa) (LMW and HMW calibration kits, GE Healthcare Bio-Sciences AB, Uppsala, Sweden, applied in basal salt). Before a chromatographic run, the column was first equilibrated with degassed and filtered distilled H2O and then with degassed and filtered basal salt. Cytoplasmic lysate (500 µg, 10 µL) was injected into the SMART system, the protein profile was monitored at 260 and 280 nm with a column flow rate of 40 µL/min at 4 °C, and eight fractions (I-VIII) were collected (Figure 2). The fractions were precipitated by adding 100% (w/ v) TCA to a final concentration of 20% (45 min incubation, 15 min centrifugation at 20 000 g, 4 °C) and washed with 80% (v/ v) acetone (-20 °C). Electrophoresis and In-Gel Digestion. Approximately 250 µg protein from the water-lysed sample (Figure 1) as well as the eight collected SEC-fractions from one SEC run (Figure 3) were applied after resuspension in Laemmli sample buffer and denaturation for 5 min at 95 °C onto a 3% T stacking/12.5% T separation gel (11 cm long and 1.5 mm thick) for SDS PAGE according to Laemmli.11 On each gel also 5 µL of a protein marker (precision plus protein standards, Bio-Rad, Hercules, CA) was applied and electrophoresis was carried out at a maximum of 40 mA/gel and 300 V limit setting. Protein staining was done with CBB [0.1% (w/v) CBB-R250 dissolved in 45/45/ 10 methanol/water/acetic acid (v/v/v)] and destained in 12.5/ 77.5/10 i-propanol/water/acetic acid (v/v/v). The lane from the water-lysed sample (Figure 1) and the lanes from the fractions II-VII (Figure 3) of the native sample were cut into slices and subsequently minced to small pieces before being transferred into reaction vessels. Excised slices of

research articles

Genome-Wide Proteomics of Nmn. pharaonis

Figure 2. SEC on a Superose 6 column after native lysis of Nmn. pharaonis in 3.4 M NaCl. A total of eight fractions (I-VIII) were collected as indicated. Elution of proteins was monitored at 260 nm (s) and 280 nm (---).

Figure 3. SDS-PAGE of the protein preparation of Nmn. pharaonis after native lysis in 3.4 M NaCl. Proteins, prefractionated by SEC to eight fractions (I-VIII), were applied to a SDS-PAGE, and the resulting lanes were cut into a total of 93 slices. The estimated molecular mass ranges (in kDa) of the single fractions deduced from the calibration curve are shown above the gel.

the CBB stained gels were destained by adding 50% (v/v) acetonitrile and 50 mM NH4HCO3 for 15 min alternately until complete destaining. Reduction/alkylation12 and trypsin digestion/elution7 of the samples was performed according to published procedures. The eluates were frozen in liquid nitrogen and dried in a vacuum centrifuge. NanoLC-MS/MS. For nanoLC-MS/MS, our standard protocol13 (based on the work of Gevaert et al.14) was used with minor modifications. In brief, dried protein digests were redissolved in 40 µL 5% (v/v) formic acid of which 10 µL were desalted on a self-prepared micro RP column.15 After elution with 10 µL 80/15/5 methanol/water/formic acid (v/v/v), peptide solutions were dried in a vacuum centrifuge, redissolved in 10 µL 5% (v/v) formic acid followed by sonication for 10 s, and subjected to nanoflow LC-MS/MS analysis. The column setup was essentially as described by Meiring et al.16 RP-C18 material was used for both the trapping and nanoscale analytical column. For chromatography, a 90 min gradient from solvent

A (0.5% v/v formic acid in 2/98 ACN/water) to solvent B (0.5% v/v formic acid in 80/20 ACN/water) was applied. Throughout the analysis, 1.5-s MS acquisitions were followed by 6-s MS/ MS experiments in information-dependent acquisition mode. Protein Identification by MS/MS. After nanoLC-MS/MS, the obtained CID-spectra were converted to the Mascot (Matrix Science, London, UK)17 acceptable pkl format using ProteinLynx software. These peaklists were used for protein identification in a database of 11 874 potential ORFs from Nmn. pharaonis, consisting of the theoretical proteome of 2843 proteins as well as 9031 spurious ORFs, of which more than 99% are longer than 100 codons. Search parameters were used as previously described,13 which involves application of high stringency to reduce false positive identification. Even for “normal” identification, proteins are considered identified only if the summed peptide scores exceed the confidence threshold score for a 95% confidence level by at least 20, which is equivalent to application of a 99.95% confidence level. For “reliable” identification, the score must be at least 40 above the confidence threshold score, equivalent to a 99.9995% confidence level. Bioinformatic Computations. Transmembrane helices were predicted by TMHMM18 and signal sequences by SIGNALP3.0.19 N-terminal lipid anchors (as found for halocyanin20) were characterized by a modified lipobox motif (LAGC), and protein secretion signals were predicted as described.4 Proteins are considered membrane-associated when they either have a lipid anchor or are annotated as membrane complex subunits lacking transmembrane domains. Proteins are considered extracellular when they have a secretion signal but are predicted to contain neither transmembrane domains nor a lipid anchor. The GRAVY index (grand average of hydrophobicity) was computed based on the empirical values of Kyte and Doolittle.21 The codon adaptation index was computed according to Carbone et al.22 Other statistical data were computed in the framework of HaloLex (www.halolex.mpg.de). Proteins from Natronomonas pharaonis were compared to those from other species using blast.23 Homology searches were made via HaloLex against three other completely sequenced halophiles: Halobacterium salinarum (strain R1 (F. Pfeiffer et al., manuscript in preparation; www.halolex.mpg.de), which is very closely related to strain NRC-124), Haloarcula marismortui,25 and Haloquadratum walsbyi.26 Homology searches against proteins from completely sequenced microbial genomes and public protein sequence databases were performed in the MiGenAS computing environment.27

Results and Discussion Preparation of Proteins. Two different sample preparations have been performed to achieve maximal identification of soluble proteins and protein complexes. In the first approach, cells were disrupted by osmotic shock, replacing the high molar salt concentration of the growth medium with pure water followed by sonication (water lysis). Proteins were collected by acetone precipitation prior to SDS-PAGE (Figure 1) and MS/ MS analysis. The benefit of this method is that all membranes pellet upon ultracentrifugation due to the lower density of water (F ) 1.0 g/cm3) in comparison to that of membranes (F ) 1.05 - 1.18 g/cm3),13 ensuring a pure cytosolic solution in the supernatant. The drawback is the loss of some soluble proteins, which may become hydrophobic due to denaturation in H2O, stick to membranes, and pellet upon ultracentrifugation. Journal of Proteome Research • Vol. 6, No. 1, 2007 187

research articles

Konstantinidis et al.

Table 1. Number of Proteins Identified in Nmn. pharaonisa

total proteins total soluble extracellular hydrophobic membrane- associated transmembrane

both approaches

water lysis

native lysis

929 886 10 43 40 43

698 695 2 34 8 3

705 662 9 34 38 43

a Data are given for the two individual experiments as well as the overall data (column “both approaches”). Proteins that lack a predicted transmembrane domain are classified as soluble, more than 90% of these being cytosolic. Only a few of the identified proteins are probably extracellular as they carry an export signal.4 Membrane-associated proteins having an N-terminal lipid anchor4 were identified mainly upon native lysis. About 5% of the remaining identified soluble proteins are hydrophobic as indicated by a positive GRAVY index.

Figure 4. Classification of cytosolic proteins into 7 function superclasses. Columns represent identified cytosolic proteins (light gray) and all non-transmembrane domain proteins from the theoretical proteome (dark gray) (total number, left axis). The percentage of identified proteins for each superclass is also shown (curve, right axis). Abbreviations for superclasses: MET, metabolism; GIP, genetic information processing; TP_CP, transport and cellular processes; ENV, environmental information processing; MIS, miscellaneous; CHY, conserved hypothetical protein; HY, hypothetical protein.

In the second approach, disruption has been made just by sonication retaining high salt molarity (native lysis) to allow prefractionation of native protein complexes by SEC (Figure 2). Fractionated proteins were collected by TCA precipitation and subjected to SDS-PAGE (Figure 3) and MS/MS analysis. Here, small membrane vesicles do not pellet by ultracentrifugation due to the high density of the basal salt solution (F ≈ 1.13 g/cm3). Accordingly, a number of integral membrane proteins were identified in the native sample. Protein Inventory of the Cytosol. The combination of 1-D SDS-PAGE with nanoLC-MS/MS resulted in the identification of 929 proteins from Nmn. pharaonis (Table 1). The vast majority of these, 886, are soluble proteins representing 41% of the theoretical soluble proteome. This is a conservative estimate as we applied stringent search parameters (requesting a 99.95% confidence level). The remaining 43 proteins with predicted transmembrane domains make up less than 5% of the identification and were identified primarily after native lysis. The complete list of identified proteins is provided in Supplemental Table 2, Supporting Information. The classification of the cytosolic proteins by function is shown in Figure 4. More than 60% of the proteins involved in metabolism (MET) could be identified, thus covering most of the intermediary metabolism. A similar high identification level 188

Journal of Proteome Research • Vol. 6, No. 1, 2007

Figure 5. Classification of cytosolic proteins by codon adaptation index (CAI). Columns represent identified cyotosolic proteins (light gray) and all non-transmembrane domain proteins from the theoretical proteome (dark gray) (total number, left axis). The percentage of identified proteins is increasing as the CAI increases (curve, right axis).

was reached for proteins involved in genetic information processing (GIP) and for soluble subunits of transport complexes (TP_CP). In contrast, only 15% of the hypothetical proteins and 25% of the conserved hypothetical proteins could be identified. This may indicate persisting errors in gene assignment. Alternatively, these function classes may contain proteins with low expression levels thus having an increased tendency to escape proteomic identification. The lower identification rates for proteins involved in environmental information processing (ENV, 46%) could be due to their dependence on environmental conditions, and thus, a subset of the proteins may not be expressed under the applied growth conditions. This functional superclass is also expected to host proteins of low abundance like proteins involved in signal transduction. The improved identification rate of proteins from Nmn. pharaonis with a higher codon adaptation index (CAI), which reflects nonrandom usage of synonymous codons,28 is displayed in Figure 5. Proteins with a CAI higher than 0.8 could be identified to 60%, whereas only about 10% of those with a CAI lower than 0.7 were found. The median CAI for the theoretical proteome of Nmn. pharaonis is 0.736. Thus, a higher codon adaptation index indicates higher protein abundance which results in preferential identification. A similar correlation has been reported for several species.28 Proteins involved in genetic information processing and metabolism for which a high proportion has been identified, are likely to be abundant. Correspondingly, they have a high average CAI (above 0.76) that exceeds that of the proteins in the other function superclasses with lower identification rates (Supplemental Figure 9, Supporting Information). Prefractionation by SEC and Identification of Protein Complexes. Proteins were fractionated by size using a Superose 6 column and eight fractions (I-VIII) were collected (Figure 2). The eight SEC fractions were analyzed by SDS-PAGE, which confirmed that fractions I and VIII are devoid of protein. Fraction VII contains very few proteins as evident from the SDSPAGE, consistent with the spectroscopic characteristics (comparison of the optical densities at 260 and 280 nm according to Warburg and Christian29). As expected, earlier fractions contain larger proteins than later fractions (Figure 3) but there is a large overlap in protein sizes as SEC is performed under native and SDS-PAGE under denaturing conditions.

Genome-Wide Proteomics of Nmn. pharaonis

Figure 6. Protein identifications for individual SEC fractions after native lysis (II-VII) and total identification for native lysis (dark gray), water lysis (light gray), as well as for both approaches (black).

Figure 7. Distribution of identified proteins in different SEC fractions. (Left) Majority of the proteins (80%, 618 of 705) are found in only 1 fraction or 2 adjacent fractions. (Right) Proteins identified in only one fraction (59%, 415 of 705) are quite evenly distributed over fractions II-VI.

Protein identification by nanoLC-MS/MS revealed a similar number of proteins throughout the SEC-fractions II-VI (Figure 6). The number of proteins identified for each fraction (about 250) is much lower than without prefractionation (about 700). This shows that protein identification is not limited by the capacity of the used mass spectrometer. As expected, only few proteins could be identified in fraction VII having a low protein content. Figure 7 reveals that most of the proteins occur in only a single SEC fraction or in two adjacent fractions (618 proteins, 88%). Fractions II and III probably contain residual membrane patches as they contained all except 6 integral and membraneassociated proteins. Protein complexes are expected to stay at least partially intact under native SEC. Several subunits from large protein complexes have been identified in our study: (a) RNA polymerase (subunits A′, A′′, B′, B′′, D, and N), (b) the thermosome (subunits 1, 2, and 3), and (c) the proteasome (R- and β- subunits as well as the proteasome-activating nucleotidases 1 and 2). We concentrated on fraction IV as fractions II and III seem to contain membrane patches and thus protein complex formation, co-localization in a membrane patch and unspecific membrane

research articles attachment cannot be distinguished. According to our calibration, fraction IV corresponds to 300-630 kDa (Figure 3), although the identified subunits range in size from 7.3 to 118 kDa and thus are too small to be expected in this fraction IV unless being part of a complex. Five of the six RNA polymerase subunits have been found in fraction IV as well as three of the four proteasome and two of the three thermosome subunits (Supplemental Figure 10, Supporting Information). This strongly indicates that this fraction contains native protein complexes. One goal of the native separation by SEC was to reduce the complexity of the sample applied to LC-MS/MS with the expectation of an increased number of protein identifications. However, the total number of identified proteins in this experiment is similar to the sample without prefractionation (Table 1, Figure 6). Up to 300 proteins were identified for each SEC fraction whereas almost 700 proteins could be identified from a single 1-D gel lane without prefractionation. This proves the high efficiency of the nanoLC-MS/MS approach and excludes instrumental capacity problems. However, about onethird of the proteins from each experiment (223 of 698 after water lysis and 230 of 705 after native lysis, respectively) have not been identified by the other approach. This could be due to proteins randomly escaping detection but could also be caused by differences in cell lysis and sample preparation. For example, integral membrane proteins are nearly exclusively found in the native sample. Although prefractionation did not result in an increased number of identified proteins, it permitted to increase the reliability of the identification. Proteins identified in both experiments were compared concerning the MASCOT score and the number of identified peptides per protein. Both average score and sequence coverage were higher with prefractionation. The MASCOT score increased on average by 32.5 and the number of identified peptides increased by 0.7. This can be attributed to the reduced complexity. Genetic Flexibility of Halophilic Archaea Illustrated by the Set of Identified Proteins. Proteins from Nmn. pharaonis identified in our study were analyzed with respect to specific biological functions with an emphasis on gene conservation and variability within the branch of halophilic archaea, represented by the completely sequenced Hbt. salinarum (strain R11 (www.halolex.mpg.de), which is very closely related to strain NRC-124), Har. marismortui,25 and Hqr. walsbyi.26 For several gene clusters, a high proportion of the encoded proteins have been identified in our proteomic survey. Examples are gene clusters involved in amino acid and coenzyme biosynthesis which contribute to the nutritional self-sufficiency of Natronomonas. Seven of the nine gene products from an arginine biosynthesis gene cluster were identified. The equivalent clusters occur in Haloarcula and Haloquadratum with highly conserved gene order. Halobacterium contains only two genes involved in the conversion of ornithine to arginine but lacks the genes for conversion of glutamate to ornithine. Rather than synthesizing arginine, Halobacterium uses this amino acid as an energy source, fermenting it to ornithine via the arginine deiminase pathway,30 which does not occur in the other three halophiles. Additional metabolic flexibility with respect to arginine utilization has been described.4 A gene cluster (Figure 8A) with 12 of 15 gene products identified encodes enzymes involved in biotin biosynthesis (NP4230A-NP4236A) and fatty acid degradation (NP4240ANP4256A), including two subunits of a biotin-containing acylJournal of Proteome Research • Vol. 6, No. 1, 2007 189

research articles CoA carboxylase (NP4250A, NP4252A). A highly homologous gene cluster involved in biotin biosynthesis is present on chromosome I of Haloarcula (the NP4230A homolog at 763649764260 has not been annotated). Homologs are encoded neither in Halobacterium nor in Haloquadratum but in many bacteria. Most of the genes involved in fatty acid degradation have Haloarcula homologs clustered on chromosome II. Whereas no homologs for the fatty acid degradation enzymes are found in Haloquadratum, Halobacterium has some homologs spread throughout the genome. Besides the biotin-dependent carboxylase of Natronomonas, the cluster contains 3 of 12 acylCoA synthase homologs (10 identified), 1 of 13 enoyl-CoA hydratase homologs (7 identified), and 1 of 12 acyl-CoA dehydrogenase homologs (11 identified). Consistent with the high number of expressed fatty acid degradation enzymes, we could show that Natronomonas grows as efficiently on fatty acids as on acetate (L. Koenigsmaier, unpublished observations). All halophiles contain several enzymes that are dependent on cobalamin, and a large gene cluster in Natronomonas (NP1088A-NP1124A, 19 genes) codes for 14 proteins involved in cobalamin biosynthesis (10 of these identified). Genes from this cluster are completely conserved in Halobacterium, Haloarcula, and Haloquadratum. Also conserved are the subunits of cobalamin-dependent methylmalonyl-CoA mutase (A subunits NP1226A, NP2320A identified), which produce the citric acid cycle intermediate succinyl-CoA from methlymalonyl-CoA. The latter is synthesized from propionate and CO2 and thus allows CO2 fixation.31,32 Another enzyme that is likely to be involved in CO2 fixation and has been identified in our proteomic survey is ribulose-bisphosphate carboxylase (NP2770A), which occurs in Natronomonas but not in any of the other halophiles. The enzyme is highly homologous to that from Methanococcus where it was shown to be catalytically active.33 In archaea, ribulose-1,5-bisphosphate is generated by oxidation of ribose-1,5-bisphosphate33 and the corresponding dehydrogenase is found in all haloarchaea and has been identified (NP5174A). The existence of several potential pathways for CO2 fixation in Natronomonas may be explained by the high concentration of carbonates in its natural environment. Another cobalamin-dependent enzyme is the typical haloarchaeal ribonucleotide reductase, which converts ribonucleotides to 2′-deoxyribonucleotides providing the precursors needed for both synthesis and repair of DNA. This type II enzyme is highly conserved (75% sequence identity) and consists only of an alpha-type chain (NP3346A). However, we could not only identify this ribonucleotide reductase but also an additional enzyme consisting of alpha and beta chains (NP6168A and NP6166A, respectively). A second ribonucleotide reductase was also found in Halobacterium (classified as type Ia,34 consisting of alpha and beta chains), but the additional enzymes from Halobacterium and Natronomonas are only very distantly related (25% sequence identity). The other two haloarchaea, Haloarcula and Haloquadratum, do not have an additional ribonucleotide reductase. The second Natronomonas enzyme, which is encoded on plasmid PL131, is closely related to an enzyme from Salinibacter ruber (70% sequence identity), for which substantial lateral gene transfer to and from haloarchaea has been described.35 However, the closest homolog for this enzyme is from the haloarchaeal phage AAJ-2005 with 80% sequence identity. Consequently, we searched for other proteins with high homology to phage proteins and identified a probable restriction/modification protein (NP3112A, 1258 amino acid residues) 190

Journal of Proteome Research • Vol. 6, No. 1, 2007

Konstantinidis et al.

with 54% sequence identity to proteins from haloarchaeal phages HF1 and HF2. To identify potential phage remnants, we compared all proteins from these three phages against the theoretical Natronomonas proteome (using blastP) as well as the genome (using tblastN). Only few phage protein homologs were detected in Nmn. pharaonis, and all were more closely related to proteins from other halophiles. This excludes phage remnants, and thus, gene transfer from phages into the Natronomonas genome is unlikely for the proteins with suspiciously high homology to phage proteins. In our proteomic analysis, we identified 34% of the chromosomal proteins, 12% of the proteins from PL131, but only 1 of the 36 proteins encoded by PL23.4 However, we identified 4 of the 11 proteins encoded on the 13 kb GC-low region II, which is inserted into the chromosomal copy of PL23. One of these (NP3254A) belongs to a small gene family (10 genes) that is distantly related to a protein from Halobacterium phage phiH proposed to function as repressor.36 The other 3 belong to a long transcription unit (7 genes, NP3262A to NP3274A), which does not occur in other halophiles. However, a highly conserved gene cluster with nearly identical gene order exists in the archaeon Methanospirillum hungatei. More distant homologs, also clustered but with different gene order, occur in Anabaena variabilis. GC-low region III includes a number of remarkably long proteins, of which five (958-1999 amino acid residues long) have been identified in our proteomic analysis. Three of these proteins (helicase NP3872A, restriction/modification enzyme NP3874A, and helicase NP3876A) are likely to be translated as a transcription unit. Homologous genes with 30% sequence identity are clustered (but with altered gene order) in several bacteria, including Magnetococcus and Deinococcus. A highly homologous transcription unit with 60-80% sequence identity at the protein level and conserved gene order is found in Haloquadratum whereas Haloarcula and Halobacterium have no or only distant homologs. In Natronomonas, this transcription unit is followed by an ADP-ribosylglycohydrolase homolog (NP3878A) that occurs in many bacteria but not in halophiles. The largest protein in this region (NP3886A, 1999 amino acid residues) is specific for Natronomonas as no database homologs could be identified. In conclusion, relatively high proteomic identification ratios have been found for several GC-low regions. This contrasts to results from Halobacterium where a reduction in GC content correlates with decreased proteomic identification rates.7 The extreme diversity with respect to metabolic enzymes between halophiles and the occasional extraordinary sequence conservation to proteins from unrelated species is a drastic example for the well-known fact that prokaryotic genomes represent a melting pot for DNA of diverse origin37 to an extent that is hard to reconcile with the classical view of vertical inheritance within independent species. Largely Overlapping Alternative Open Reading Frames Are Not Used in the GC-Rich Haloarchaeal Genomes. Commonly, neighboring genes in bacteria and archaea overlap for a few bases, allowing translational coupling.4,38,39 Such overlapping genes should be distinguished from usage of alternative and largely overlapping open reading frames so that more than one protein is produced from the same genomic sequence stretch. This phenomenon has been termed “overprinting”40 and is well-known for viruses and phages but has been observed in bacteria only in exceptional cases. We searched for protein identification from alternate overlapping reading frames in our

research articles

Genome-Wide Proteomics of Nmn. pharaonis

Figure 8. Gene cluster involved in biotin biosynthesis and fatty acid degradation. (A) Only protein-coding genes are shown (filled arrows). (B) All open reading frames are shown, including those that result in spurious ORFs (open arrows). The genome positions (in kbp) are indicated. Arrows pointing right indicate ORFs on the forward strand, those pointing left indicate ORFs on the reverse strand. The colors indicate the level of proteomic identification (based on data from this study and additional data from our group). Green indicates reliable identification (99.9995% confidence level), blue indicates normal identification (99.95% confidence level), gray and yellow indicate not identified.

proteomic data set to find evidence for the eventual occurrence of overprinting in archaea. Figure 8 displays the gene cluster involved in biotin biosynthesis and fatty acid degradation (see above). In panel A, only genes considered to code for real proteins are displayed. Panel B also contains the alternative open reading frames, which show large-scale overlaps with the protein-coding genes but which we assume to not code for proteins (spurious ORFs).4 In addition to the 15 protein-coding genes, of which 13 have been identified (12 in this study), there are 61 spurious ORFs. Such a high number of spurious ORFs is seen throughout the genome. On average, 3.2 spurious ORFs are found for each protein-containing gene (Supplemental Table 3, Supporting Information). In Natronomonas, the average protein length is 290 amino acid residues (86% of the proteins longer than 100 codons) whereas the average length of the spurious ORFs is 173 amino acid residues (99% of the spurious ORFs longer than 100 codons). In all of our proteomic experiments on Natronomonas (this report and B. Bisle et al., manuscript in preparation), 1226 proteins were identified under high stringency (“normal” identification, 99.95% confidence), representing 42.9% of the theoretical proteome. Of these, 84% were “reliably” identified (99.9995% confidence). Only 3 spurious ORFs are classified as identified, none of these “reliably”. These 3 represent 0.03% of the set of spurious ORFs, which is consistent with the expected false positive rate of 0.05% at the applied 99.95% confidence level for “normal” identification. All three high-scoring spurious ORFs were rated to be false positives upon manual inspection of the MS/MS spectra. Thus, we conclude that not a single case of overprinting could be identified despite the very high incidence of potential alternative and largely overlapping open reading frames in the genome of Natronomonas. Reducing the confidence level to 99.5% would increase protein identifications by 114 but would also severely increase doubtful identifications (29 spurious ORFs, representing 0.3% of the spurious ORF set).

The same conclusion is reached for Hbt. salinarum where 66.0% of the theoretical proteome has been identified (1869 of 2831 (F. Siedler et al. unpublished observations); for more details see Supplemental Table 3, Supporting Information). The published proteomic data as well as the Nmn. pharaonis genome are accessible through HaloLex (www.halolex.mpg.de).

Conclusions Including membrane proteomic data for Nmn. pharaonis (B. Bisle et al., manuscript in preparation), we have identified a total of 1226 proteins (43% of the theoretical proteome) under high stringency, showing that Nmn. pharaonis has one of the densest coverages with proteome data. The shotgun method with analysis of SDS-PAGE slices by nanoLC-MS/MS turned out to be a fast and effective way for examining the protein inventory of a given cell without the need of further prefractionation, whereas a high number of proteins were identified additionally just by a different cell lysis procedure. Because not a single case of overprinting could be identified in the genome of both Nmn. pharaonis and Hbt. salinarum, we deduce that in the genome of halophiles, large-scale gene overlaps do not occur. Although this does not exclude that such cases will be detected in the future, they may occur only very rarely. Many of the identified proteins exemplify the high variability among halophiles on a background of highly conserved and common proteins. This is attributed to the high variability of the environments that share the high salt concentration but may contain additional threats like high pH values. The necessity to cope with several extreme conditions is resolved by genetic exchange between organisms that are evolutionary nearly unrelated. This genetic exchange may affect single genes or larger genome regions, and phages or plasmids may participate in the transfer of genetic material. In addition to the large number of functionally characterized proteins, a total of 200 conserved hypothetical proteins and Journal of Proteome Research • Vol. 6, No. 1, 2007 191

research articles Natronomonas-specific hypothetical proteins have been identified by this proteomic survey, proving their existence.

Acknowledgment. We are grateful to Martin Grininger for his introduction to the SMART system for SEC. The bioinformatical assistance of Rutger W. W. Brouwer, Volker Hickmann, Thomas Gillich, and Florian Schoetz is highly appreciated. We thank Markus Rampp for making the bioinformatic infrastructure, developed for the MiGenAS system, available to this project and to HaloLex. Supporting Information Available: Supporting Figures 9 and 10 as well as supporting Tables 2 and 3. This material is available free of charge via the Internet at http:// pubs.acs.org. References (1) Soliman, G. S. H.; Trueper, H. G. Halobacterium-pharaonis newspecies a new extremely halo alkaliphilic archaebacterium with low magnesium requirement. Zentralbl. Mikrobiol. 1982, 3 (2), 318-329. (2) Tindall, B. J.; Ross, H. N. M.; Grant, W. D. Syst. Appl. Microbiol. 1984, (5), 41-57. (3) Staley, J. T.; Bryant, M. P.; Pfennig, N.; Holt, J. G. Bergey’s Manual of Systematic Bacteriology. In Archaeobacteria; Koenig, H., Stetter, K. O., Eds.; Williams & Wilkins: Baltimore, MD, 1989; Vol. 3, pp 2230-2232. (4) Falb, M.; Pfeiffer, F.; Palm, P.; Rodewald, K.; Hickmann, V.; Tittor, J.; Oesterhelt, D. Living with two extremes: conclusions from the genome sequence of Natronomonas pharaonis. Genome Res. 2005, 15 (10), 1336-1343. (5) McHardy, A. C.; Goesmann, A.; Puhler, A.; Meyer, F. Development of joint application strategies for two microbial gene finders. Bioinformatics 2004, 20 (10), 1622-1631. (6) Hecker, M. A proteomic view of cell physiology of Bacillus subtilis-bringing the genome sequence to life. Adv. Biochem. Eng. Biotechnol. 2003, 83, 57-92. (7) Tebbe, A.; Klein, C.; Bisle, B.; Siedler, F.; Scheffer, B.; Garcia-Rizo, C.; Wolfertz, J.; Hickmann, V.; Pfeiffer, F.; Oesterhelt, D. Analysis of the cytosolic proteome of Halobacterium salinarum and its implication for genome annotation. Proteomics 2005, 5 (1), 168179. (8) Oesterhelt, D.; Krippahl, G. Phototrophic growth of halobacteria and its use for isolation of photosynthetically deficient mutants. Ann. Microbiol. 1983, 134B (1), 137-150. (9) Balows, A.; Trueper, H. G.; Dworkin, M.; Harder, W.; Schleifer, K.-H. The Prokaryotes. In The Family Halobacteriaceae; Tindall, B. J., Ed.; Springer-Verlag: New York, 1992; Vol. 1, p 784. (10) Michel, H.; Oesterhelt, D. Light-induced changes of the pH gradient and the membrane potential in H. halobium. FEBS Lett. 1976, 65 (2), 175-178. (11) Laemmli, U. K. Cleavage of structural proteins during the assembly of the head of bacterio phage t-4. Nature (London) 1970, 227 (5259), 680-685. (12) Shevchenko, A.; Wilm, M.; Vorm, O.; Mann, M. Mass spectrometric sequencing of proteins silver-stained polyacrylamide gels. Anal. Chem. 1996, 68 (5), 850-858. (13) Klein, C.; Garcia-Rizo, C.; Bisle, B.; Scheffer, B.; Zischka, H.; Pfeiffer, F.; Siedler, F.; Oesterhelt, D. The membrane proteome of Halobacterium salinarum. Proteomics 2005, 5 (1), 180-197. (14) Gevaert, K.; Van Damme, J.; Goethals, M.; Thomas, G. R.; Hoorelbeke, B.; Demol, H.; Martens, L.; Puype, M.; Staes, A.; Vandekerckhove, J. Chromatographic isolation of methioninecontaining peptides for gel-free proteome analysis: identification of more than 800 Escherichia coli proteins. Mol. Cell. Proteomics 2002, 1 (11), 896-903. (15) Rappsilber, J.; Ishihama, Y.; Mann, M. Stop and go extraction tips for matrix-assisted laser desorption/ionization, nanoelectrospray, and LC/MS sample pretreatment in proteomics. Anal. Chem. 2003, 75 (3), 663-670. (16) Meiring, H. D.; van der Heeft, E.; ten Hove, G. J.; de Jong, A. P. J. M. Nanoscale LC-MS (n): Technical design and applications to peptide and protein analysis. J. Sep. Sci. 2002, 25 (9), 557568.

192

Journal of Proteome Research • Vol. 6, No. 1, 2007

Konstantinidis et al. (17) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20 (18), 3551-3567. (18) Krogh, A.; Larsson, B.; von Heijne, G.; Sonnhammer, E. L. L. Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J. Mol. Biol. 2001, 305 (3), 567-580. (19) Nielsen, H.; Engelbrecht, J.; Brunak, S.; von Heijne, G. Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng. 1997, 10 (1), 1-6. (20) Mattar, S.; Scharf, B.; Kent, S. B. H.; Rodewald, K.; Oesterhelt, D.; Engelhard, M. The primary structure of halocyanin, and archaeal blue copper protein, predicts a lipid anchor for membrane fixation. J. Biol. Chem. 1994, 269 (21), 14939-14945. (21) Kyte, J.; Doolittle, R. F. A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 1982, 157 (1), 105-132. (22) Carbone, A.; Zinovyev, A.; Kepes, F. Codon adaptation index as a measure of dominating codon bias. Bioinformatics 2003, 19 (16), 2005-2015. (23) Altschul, S. F.; Madden, T. L.; Schaffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D. J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17), 3389-3402. (24) Ng, W. V.; Kennedy, S. P.; Mahairas, G. G.; Berquist, B.; Pan, M.; Shukla, H. D.; Lasky, S. R.; Baliga, N. S.; Thorsson, V.; Sbrogna, J.; Swartzell, S.; Weir, D.; Hall, J.; Dahl, T. A.; Welti, R.; Goo, Y. A.; Leithauser, B.; Keller, K.; Cruz, R.; Danson, M. J.; Hough, D. W.; Maddocks, D. G.; Jablonski, P. E.; Krebs, M. P.; Angevine, C. M.; Dale, H.; Isenbarger, T. A.; Peck, R. F.; Pohlschroder, M.; Spudich, J. L.; Jung, K. W.; Alam, M.; Freitas, T.; Hou, S.; Daniels, C. J.; Dennis, P. P.; Omer, A. D.; Ebhardt, H.; Lowe, T. M.; Liang, P.; Riley, M.; Hood, L.; DasSarma, S. Genome sequence of Halobacterium species NRC-1. Proc. Natl. Acad. Sci. 2000, 97 (22), 1217612181. (25) Baliga, N. S.; Bonneau, R.; Facciotti, M. T.; Pan, M.; Glusman, G.; Deutsch, E. W.; Shannon, P.; Chiu, Y.; Weng, R. S.; Gan, R. R.; Hung, P.; Date, S. V.; Marcotte, E.; Hood, L.; Ng, W. V. Genome sequence of Haloarcula marismortui: a halophilic archaeon from the Dead Sea.[erratum appears in Genome Res. 2004, 14 (12), 2510]. Genome Res. 2004, 14 (11), 2221-2234. (26) Bolhuis, H.; Palm, P.; Wende, A.; Falb, M.; Rampp, M.; RodriguezValera, F.; Pfeiffer, F.; Oesterhelt, D. The genome of the square archaeon “Haloquadratum walsbyi”: life at the limits of water activity. BMC Genomics 2006, in press. (27) Rampp, M.; Soddemann, T.; Lederer, H. The MIGenAS integrated bioinformatics toolkit for web-based sequence analysis. Nucleic Acids Res. 2006, in press. (28) Sharp, P. M.; Li, W. H. The codon Adaptation Index-a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 1987, 15 (3), 1281-1295. (29) Warburg, O.; Christian, W. Isolierung und Kristallisation des Gaerungsferments Enolase. Biochem. Z. 1942, 310, 384-421. (30) Ruepp, A.; Soppa, J. Fermentative arginine degradation in Halobacterium salinarium (formerly Halobacterium halobium): genes, gene products, and transcripts of the arcRACB gene cluster. J. Bacteriol. 1996, 178 (16), 4942-4947. (31) Danon, A.; Caplan, S. R. CO2 fixation by Halobacterium halobium. FEBS Lett. 1977, 74 (2), 255-258. (32) Oren, A. Bacteriorhodopsin-mediated CO2 photoassimilation in the Dead Sea. Limnol. Oceanogr. 1983, 28 (1), 33-41. (33) Finn, M. W.; Tabita, F. R. Modified pathway to synthesize ribulose 1,5-bisphosphate in methanogenic archaea. J. Bacteriol. 2004, 186 (19), 6360-6366. (34) Borovok, I.; Kreisberg-Zakarin, R.; Yanko, M.; Schreiber, R.; Myslovati, M.; Aslund, F.; Holmgren, A.; Cohen, G.; Aharonowitz, Y. Streptomyces spp. contain class Ia and class II ribonucleotide reductases: expression analysis of the genes in vegetative growth. Microbiol. 2002, 148 (Pt 2), 391-404. (35) Mongodin, E. F.; Nelson, K. E.; Daugherty, S.; DeBoy, R. T.; Wister, J.; Khouri, H.; Weidman, J.; Walsh, D. A.; Papke, R. T.; Sanchez Perez, G.; Sharma, A. K.; Nesbo, C. L.; MacLeod, D.; Bapteste, E.; Doolittle, W. F.; Charlebois, R. L.; Legault, B.; Rodriguez-Valera, F. The genome of Salinibacter ruber: Convergence and gene exchange among hyperhalophilic bacteria and archaea. Proc. Natl. Acad. Sci. 2005, 102 (50), 18147-18152.

research articles

Genome-Wide Proteomics of Nmn. pharaonis (36) Ken, R.; Hackett, N. R. Halobacterium halobium strains lysogenic for phage phi H contain a protein resembling coliphage repressors. J. Bacteriol. 1991, 173 (3), 955-960. (37) Doolittle, R. F. Evolutionary aspects of whole-genome biology. Curr. Opin. Struct. Biol. 2005, 15(3), 248-253. (38) Rogozin, I. B.; Spiridonov, A. N.; Sorokin, A. V.; Wolf, Y. I.; Jordan, I. K.; Tatusov, R. L.; Koonin, E. V. Purifying and directional selection in overlapping prokaryotic genes. Trends Genet. 2002, 18 (5), 228-232.

(39) Fukuda, Y.; Nakayama, Y.; Tomita, M. On dynamics of overlapping genes in bacterial genomes.[erratum appears in Gene 2004 330, 185]. Gene 2003, 323, 181-187. (40) Keese, P. K.; Gibbs, A. Origins of genes: “big bang” or continuous creation? Proc. Natl. Acad. Sci. 1992, 89 (20), 9489-9493.

PR060352Q

Journal of Proteome Research • Vol. 6, No. 1, 2007 193