Determination and Comparison of the Baseline Proteomes of the

Gregory B. Hurst,† Loren Hauser,‡,§ Brian H. Davison,§ J. Thomas Beatty,|. Caroline S. Harwood,# F. Robert Tabita,⊥ Robert L. Hettich,† and ...
0 downloads 0 Views 225KB Size
Determination and Comparison of the Baseline Proteomes of the Versatile Microbe Rhodopseudomonas palustris under Its Major Metabolic States Nathan C. VerBerkmoes,*,†,‡ Manesh B. Shah,§ Patricia K. Lankford,§ Dale A. Pelletier,§ Michael B. Strader2,†, David L. Tabb2,§, W. Hayes McDonald,† John W. Barton2,§, Gregory B. Hurst,† Loren Hauser,‡,§ Brian H. Davison,§ J. Thomas Beatty,| Caroline S. Harwood,# F. Robert Tabita,⊥ Robert L. Hettich,† and Frank W. Larimer§ Chemical Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, Graduate School of Genome Science and Technology, University of Tennessee-Oak Ridge National Laboratory, Oak Ridge, Tennessee 37830, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, Department of Microbiology and Immunology, University of British Columbia, Vancouver, British Columbia, Canada, V6T 1Z3, Department of Microbiology, University of Washington, Seattle, Washington 98195, and Department of Microbiology, The Ohio State University, Columbus, Ohio 43210-1292 Received September 23, 2005

Rhodopseudomonas palustris is a purple nonsulfur anoxygenic phototrophic bacterium that is ubiquitous in soil and water. R. palustris is metabolically versatile with respect to energy generation and carbon and nitrogen metabolism. We have characterized and compared the baseline proteome of a R. palustris wild-type strain grown under six metabolic conditions. The methodology for proteome analysis involved protein fractionation by centrifugation, subsequent digestion with trypsin, and analysis of peptides by liquid chromatography coupled with tandem mass spectrometry. Using these methods, we identified 1664 proteins out of 4836 predicted proteins with conservative filtering constraints. A total of 107 novel hypothetical proteins and 218 conserved hypothetical proteins were detected. Qualitative analyses revealed over 311 proteins exhibiting marked differences between conditions, many of these being hypothetical or conserved hypothetical proteins showing strong correlations with different metabolic modes. For example, five proteins encoded by genes from a novel operon appeared only after anaerobic growth with no evidence of these proteins in extracts of aerobically grown cells. Proteins known to be associated with specialized growth states such as nitrogen fixation, photoautotrophic, or growth on benzoate, were observed to be up-regulated under those states. Keywords: Rhodopseudomonas palustris • metabolic growth states • liquid-chromatography • mass spectrometry • proteomics

Introduction Rhodopseudomonas palustris is a purple nonsulfur anoxygenic phototrophic bacterium that belongs to the R-proteo* To whom correspondence should be addressed. Chemical Sciences Division, Oak Ridge National Laboratory, P.O. Box 2008, Oak Ridge, TN 37831-6131. Tel: (865) 576-4546. Fax: (865) 576-8559. E-mail: verberkmoesn@ornl.gov. † Chemical Sciences Division, Oak Ridge National Laboratory. ‡ Graduate School of Genome Science and Technology, University of Tennessee-Oak Ridge National Laboratory. § Life Sciences Division, Oak Ridge National Laboratory. | Department of Microbiology and Immunology, University of British Columbia. # Department of Microbiology, University of Washington. ⊥ Department of Microbiology, The Ohio State University. 2 Current Addresses: Michael B. Strader: NIMH/LNT MSC 1262 Bldg 10, Room 3D42, Bethesda, MD 20892-1262. David L. Tabb: Department of Biomedical Informatics, Vanderbilt University Medical Center, 465 21st Ave. U9211 MRB III, Nashville, TN 37232-8575. John W. Barton: Battelle, 1204 Technology Drive, Aberdeen, MD 21001. 10.1021/pr0503230 CCC: $33.50

 2006 American Chemical Society

bacteria. It is found widely distributed in the environment, preferring soil and freshwater. Among bacteria, R. palustris is exceptional in its metabolic versatility. It can grow aerobically or anaerobically, and can utilize energy from light or organic compounds. It can degrade structurally diverse fatty acids, dicarboxylic acids, and aromatic compounds, including many lignin monomers, under both aerobic and anaerobic conditions.1-3 Furthermore, R. palustris is capable of producing hydrogen gas as a byproduct of nitrogen fixation, making it a potential biofuel producer. R. palustris also has the potential to act as a greenhouse gas sink by converting carbon dioxide into cell mass. Since most of these metabolic states can easily be attained in laboratory settings, R. palustris is an ideal model system for the study of diverse metabolic modes and their control within a single organism. The genome of this microbe has recently been completed and annotated,4 revealing 4,836 potential protein-encoding Journal of Proteome Research 2006, 5, 287-298

287

Published on Web 01/06/2006

research articles genes in a 5.4 Mb genome. The genome sequencing effort has paved the way for a detailed systems biology characterization of this microbe at the genome, transcriptome, proteome, and metabolome level. Our long-term goal is to use a multidisciplinary, molecular level technology approach consisting of proteome profiling protein-protein interaction studies,5 global gene knockouts, and transcriptome profiling,6 to obtain a comprehensive understanding of the diverse metabolic states of this microbe. The specific goal of this study was to determine the baseline proteome of an R. palustris wild-type strain under phototrophic and chemotrophic growth conditions, including variants of each state. This information would then serve as a dataset from which a variety of biological studies could be performed to understand this microbe’s life processes and metabolic diversity. The experimental approach for these studies involved lysing cells grown under each condition and fractionating by centrifugation techniques into four major proteome fractions (crude, membrane, pellet, and cleared), followed by digestion of each fraction with trypsin and analysis by liquid chromatography coupled with electrospray-tandem mass spectrometry (LC-ES-MS/MS) (“shotgun” proteomics7-11). A previous paper reported proteomic analysis of membrane proteins,12 but this is the first proteome analysis of R. palustris to date that provides deep characterization into the diverse metabolic properties of this microbe. The six metabolic states targeted in this study can be classified into two major categories: aerobic growth in dark (chemotrophic) and anaerobic growth in light (phototrophic), as illustrated in Figure 1a. The core state was the anaerobic photoheterotrophic growth mode, in which light provided the energy, organic carbon (in particular succinate) provided the carbon source for cell material, and ammonia served as the nitrogen source. Four variations of the core phototrophic metabolic state were examined: (1) photoautotrophic carbonfixing, where carbon dioxide served as the sole carbon source and hydrogen served as the electron donor; (2) photoheterotrophic nitrogen-fixing, in which nitrogen gas was substituted for ammonia as the nitrogen source; (3) photoheterotrophic benzoate, in which benzoate was substituted for succinate as the carbon source: and (4) photoheterotrophic stationary phase, where succinate-grown cells were harvested in stationary phase (all other states were harvested in midexponential phase). These four variations provided an opportunity to probe how R. palustris adjusts its molecular machinery to undertake the energetically expensive processes of fixing either gaseous carbon or nitrogen, how it copes under stationary phase photoheterotrophic growth conditions, and how it utilizes a more complex carbon material (benzoate) as its major carbon source. This later growth state, in which benzoate is employed, has been the subject of numerous investigations for R. palustris,13,1,2,14-16 although no comprehensive proteome data have been published. The core aerobic state was the chemoheterotrophic growth state, in which cells were grown aerobically in the dark, with succinate as both the carbon and energy source, and ammonia as the nitrogen source. An lhaA mutant,17,18 carrying a miniTn5-lacZ insertion in rpa1547, annotated as encoding a photosynthetic complex assembly protein, was grown as a secondary control for chemoheterotrophic growth. The primary goal for this state was a secondary biological replicate of aerobic growth and the verification of no major phenotypic change at the proteome level for the mutant under aerobic growth. The 288

Journal of Proteome Research • Vol. 5, No. 2, 2006

VerBerkmoes et al.

Figure 1. a. The core metabolic states of Rhodopseudomonas palustris interrogated in this study. The top figure illustrates the basic anaerobic state for photoheterotrophic growth in light without oxygen. The bottom figure illustrates the basic aerobic state for chemoheterotrophic growth in the dark with oxygen present. The circle in the center of the cell represents central metabolism. This figure was adapted from Larimer et al. Nat. Biotechnol. 2004, 22, 55-60. b. This diagram illustrates the binary relationships between all R. palustris growth states examined in this study. Chemoheterotrophic (anaerobic) and photoheterotrophic (aerobic) are the core metabolic states, as defined in Figure 1a. All other growth states are directly compared in a binary fashion to one of these two core states. Proteins identified as being differentially expressed between any two growth states were then compared across the global dataset to determine overall trends reflecting generic phototrophic or chemotrophic growth conditions.

full functional characterization of the mutant was not undertaken in this study. This mutant grows very poorly under phototrophic conditions and is only very slightly pigmented under aerobic chemoheterotrophic growth conditions. This is in contrast to strain CGA010 which is pink, indicating a basal level of photopigment production, when grown aerobically. The binary relationships of these various metabolic states are shown in Figure 1b. Large-scale, qualitative characterization of these six growth states of R. palustris was obtained by shotgun liquid chromatography - tandem mass spectrometry (LC-ES-MS/MS) measurements. This global measurement strategy can provide deep proteome measurements, often identifying up to 1000 nonredundant proteins from a given growth state. This technological approach provides unambiguous identifications of both known and unknown proteins, and thus can serve as a powerful information platform for obtaining the detailed

Baseline Proteomes of R. palustris

proteome information that is necessary not only to characterize a given growth condition, but also to distinguish it from other metabolic conditions. The objective of this work was to identify both known and uncharacterized proteins that are important for each growth state, compare metabolic states in a binary fashion as illustrated in Figure 1b, and then determine global trends in protein expression across all metabolic states.

Materials and Methods Chemicals and Reagents. All salts, dithiothreitol (DTT), trifluoroacetic acid (TFA), and guanidine used in this work were obtained from Sigma Chemical Co. (St. Louis, MO). Modified sequencing grade trypsin from Promega (Madison, WI) was used for all protein digestion reactions. The water and acetonitrile used in all sample clean up and HPLC applications was HPLC grade from Burdick & Jackson (Muskegon, MI), and the 98% formic acid used in these applications was purchased from EM Science (an affiliate of MERCK KgaA, Darmstadt, Germany). Cell Growth and Production of Protein Fractions. R. palustris strain CGA010, a hydrogen-utilizing derivative of the sequenced strain (unpublished C. S. Harwood) and referred to here as the wild-type strain, was grown under the six conditions outlined in the Introduction section. An lhaA mutant was grown as a secondary measurement for chemoheterotrophic growth. All cultures were grown anaerobically in light or aerobically in dark in 1.5 Ls of defined mineral medium at 30 °C to mid-log phase (OD660nm ) 0.6) (except the stationary phase sample).14 Chemoheterotrophic cells were grown aerobically in the dark with shaking at 200 rpm; phototrophic cells were grown anaerobically in the light with mixing with a stir bar. All anaerobic cultures were illuminated with 40 or 60 W incandescent light bulbs from multiple directions. Carbon sources were added to a final concentration of 10 mM succinate (for all growth modes except benzoate and photoautotrophic), 3 mM benzoate (benzoate growth) or 10 mM sodium bicarbonate with H2 gas in the headspace (photoautotrophic growth). Growth was monitored spectrophotometrically at 660 nm. For the photoheterotrophic N2 fixing cultures, ammonium sulfate was replaced by sodium sulfate in the culture medium and N2 gas was supplied in the headspace. The photoheterotrophic stationary phase culture was grown exactly as the photoheterotrophic log phase state except the cultures were grown until they exhausted the substrate, became light limited and stopped increasing in O.D. (OD660nm > 2.0). Cell extracts were prepared as follows: cells were harvested by centrifugation, washed twice with ice-cold wash buffer (50 mM Tris-HCl buffer (pH 7.5) with 10 mM EDTA) and resuspended in ice-cold wash buffer. Cells were then lysed by sonication and unbroken cells were removed with low-speed centrifugation (5000 × g × 10 min). Four proteome fractions were created from this cellular extract by ultracentrifugation (100 000 × g for 1 h lead to membrane and crude fractions; this supernatant was then further centrifuged at 100 000 × g for 18 h leading to pellet and cleared fractions). All four proteome fractions were quantified for total protein by Lowry’s analysis, aliquoted and frozen at -80 °C until digestion. Digestion of Proteome Fractions. Proteome fractions from each growth state were processed by the same protocol: Briefly, 5 mg of each proteome fraction were diluted in 6 M guanidine/5 mM DTT and heated at 60 °C for 1 h. The guanidine and DTT were diluted 6-fold with 50 mM Tris-HCl/10 mM CaCl2 (pH 7.8) followed by the addition of sequencing grade trypsin at 1:100 (wt/wt). The digestions were run with gentle

research articles shaking at 37 °C for 18 h followed by a second addition of trypsin at 1:100 and additional 5 h incubation. The samples were then treated with 10 mM DTT for 1 h at 60 °C as a final reduction step. Samples were immediately de-salted with SepPak Plus C18 solid-phase extraction (Waters, Milford, MA). All samples were concentrated and solvent exchanged into 0.1% TFA in water by centrifugal evaporation to ∼10 µg/µL starting material, filtered, aliquoted and frozen at -80 °C until mass spectrometry analysis. LC-ES-MS/MS Analysis. The four proteome fractions from each growth state were analyzed in duplicate via multiple onedimensional LC-ES-MS/MS experiments performed with an Ultimate HPLC (LC Packings, a division of Dionex, San Francisco, CA) coupled to an LCQ-DECA or LCQ-DECAXPplus quadrupole ion trap mass spectrometer (Thermo Finnigan, San Jose, CA). Automated 50 µL injections were made with a Famos autosampler (LC Packings) onto the HPLC column. Flow rate was set at 4 µL/min with a 240-min gradient for each LC-ESMS/MS run. A VYDAC 218MS5.325 (Grace-Vydac, Hesperia, CA) C18 column (300 µm id × 25 cm, 300 Å with 5 µm particles) was used for all separations. The column was directly connected to the Finnigan electrospray source with 100 µm id fused silica. For each new growth state, a new HPLC column was used to prevent cross-contamination. For all LC-ES-MS/MS data acquisitions, the LCQ was operated in the data-dependent mode with dynamic exclusion enabled. To increase dynamic range in the 1D-LC-ES-MS/MS analysis, separate injections were made with a total of 8 overlapping segmented m/z ranges scanned (referred to as gas-phase fractionation or multiple mass range scanning).7,19 The m/z ranges scanned included a full spectrum scan of m/z 400-2000 and 7 segmented overlapping scans at m/z 400-790, m/z 790-1000, m/z 990-1300, m/z 1290-2000, m/z 400-900, m/z 890-1400, and m/z 1390-2000. Data Analysis. The resulting ∼450 LC-ES-MS/MS datasets were all processed as follows. The MS/MS spectra from all files were identified by SEQUEST20 in two modes: tryptic only (only fully tryptic peptides were considered) and nontryptic (fully tryptic, partially tryptic and fully non tryptic peptides were considered). The MS/MS spectra from all files were then searched with DBDigger21,22 with only the fully tryptic option. For all database searches, a R. palustris proteome database containing 4833 proteins and 36 common contaminants (trypsin, keratin, etc) was used. The database can be found at http:// compbio.ornl.gov/rpal_proteome/databases. All resulting output files from SEQUEST and DBDigger were organized by growth state and run number (all fractions from a single proteome analysis were combined) and then filtered by DTASelect23 at the 1-peptide, 2-peptides and 3-peptides level (defines number of identified peptides required to identify a protein) with the following parameters: For SEQUEST, delCN of at least 0.08 and cross-correlation scores (Xcorrs) of at least 1.8, 2.5, and 3.5 for +1, +2, and +3 charge states respectively; for DBDigger, delCN of at least 0.08 and MASPIC scores of at least 25 (+1), 30 (+2), and 45 (+3) were used. The filtered DTASelect files from proteome replicates were compared using Contrast23 to ensure quality reproducibility (at least 70% overlap in protein identifications at the 2-peptide filter level between replicates was required). Contrast was then used to create pairwise comparisons of growth states as well as a global comparison of all growth conditions (http://compbio.ornl.gov/ rpal•proteome/analysis). For the evaluation of protein fractionation and the creation of tandem affinity purification targets, the fractions from individual proteome datasets were Journal of Proteome Research • Vol. 5, No. 2, 2006 289

research articles

VerBerkmoes et al.

Table 1. Total Identified Proteins and Peptides by Search Algorithm and Filtering Level filtering level

SEQUEST fully tryptica

SEQUEST nontryptica

DBDigger fully trypticb

Proteins 1-peptide 2-peptides 3-peptides

2752 1670 1317

4482 2888 1833

2785 1698 1338

50999 42370 36802

28813 26903 25613

Peptides 1-peptide 2-peptides 3-peptides

26607 24588 23261

a Filters for SEQUEST Xcorrs of at least 1.8 (+1), 2.5 (+2), and 3.5 (+3). Filters for DBDigger MASPIC scores of at least 25 (+1), 30 (+2), and 45 (+3).

b

not concatenated but rather analyzed individually by DTASelect and then compared with Contrast (this was only done with the SEQUEST fully tryptic dataset).

Results and Discussion The four proteome fractions from each culture were digested with trypsin and then analyzed in duplicate by an automated 1D-LC-ES-MS/MS technique employing multiple mass range scanning. Multiple mass range scanning is a simple technique for increasing the dynamic range of proteome measurements. It involves injecting the same sample repeatedly with overlapping narrow mass ranges analyzed by the mass spectrometer. We have found this technique to be highly reproducible and simple to implement for a large-scale study of multiple samples.7,19 Its main disadvantage is the large amount of sample material needed since multiple injections must be made. This was not a concern for this study since ample protein could be obtained from a single 1.5 L culture of R. palustris. A total of about 450 LC-ES-MS/MS data sets were generated from seven cultures (a total of six different metabolic growth states), with four proteome fractions prepared from each culture and eight samples analyzed per fraction run in duplicate. Each data set was searched with SEQUEST20 and DBDigger,21,22 filtered with DTASelect23 and compared with Contrast.23 All resulting DTASelect files and Contrast files used in this study, as well as the protein database, are available from the Rhodopseudomonas palustris Proteome Study Website (http://compbio.ornl.gov/ rpal_proteome/). Also accessible from this site are MS/MS spectra for all identified peptides, a step toward open access proteome results.24,25 These can be found as links from the DTASelect files. The entire dataset was analyzed in three different modes: SEQUEST fully tryptic, SEQUEST nontryptic, and DBDigger fully tryptic. DBDigger is a new search algorithm developed at Oak Ridge National Laboratory (ORNL), which provides better accuracy and sensitivity than SEQUEST.21,22 The results from each of these searches with the number of identified proteins and peptides filtered at 1-peptide, 2-peptides, and 3-peptides are shown in Table 1 (the numbers in these tables also include some common contaminants; numbers listed below have these removed and are slightly smaller, numbers from 1-peptide include 2-peptide hits and 2-peptide hits include 3-peptide hits). Nonspecific searches identify many more proteins than fully tryptic SEQUEST searches, but with a greater number of false positives. Because of the current controversy in the proteomics field over the use of nonspecific searches for LC-ES-MS/MS data generated from trypsin digestions, we will confine our discussions to the fully tryptic 290

Journal of Proteome Research • Vol. 5, No. 2, 2006

Table 2. Identified Proteins by Growth State and Filtering Level growth condition:

lhaA mutant Chemoheterotrophic Run 1 lhaA mutant Chemoheterotrophic Run 2 WT Chemoheterotrophic Run 1 WT Chemoheterotrophic Run 2 WT Photoheterotrophic Run 1 WT Photoheterotrophic Run 2 WT Photoheterotrophic Stationary Phase Run 1 WT Photoheterotrophic Stationary Phase Run 2 WT Photoautotrophic Run 1 WT Photoautotrophic Run 2 WT Photoheterotrophic Nitrogen Fixation Run1 WT Photoheterotrophic Nitrogen Fixation Run2 WT Benzoate Photoheterotrophic Run 1 WT Benzoate Photoheterotrophic Run 2 Total Proteins Identified

3-peptide 2-peptide 1-peptide

714

931

1287

722

930

1322

721 631 692 724 725

941 809 884 891 921

1350 1235 1263 1251 1369

746

930

1405

764 695 768

961 900 995

1455 1391 1441

807

1018

1508

624

814

1273

686

884

1285

1320

1670

2752

datasets. However, we do provide the nonspecific dataset for comparison. From the fully tryptic dataset of proteins identified by at least two peptides, 1691 proteins were identified by DBDigger and 1664 proteins were identified by SEQUEST. The combined list from these two algorithms resulted in the overall identification of 1805 proteins; 1549 are shared between the lists, with 140 proteins identified only by DBDigger and 116 identified only by SEQUEST. Supporting Information Table 1 contains all proteins identified by each algorithm as well as corresponding sequence coverage for each protein. This table also contains a scatter plot of percent sequence coverage for each protein and for each algorithm plotted against each other. This scatter plot shows that the two algorithms generated very similar results with few outliers. Because SEQUEST is a more widely accepted platform we focused our biological analysis on those proteins confidently identified from the SEQUEST fully tryptic dataset. Major Features of the Proteome. The identified protein totals from fully tryptic SEQUEST searches of each growth state analysis filtered at 1-peptide, 2-peptides, and 3-peptides are shown in Table 2. As the criterion describing the number of observed peptides required for each protein increases, the number of proteins observed decreases. Statistical analysis of a single growth state by the Peptide/ProteinProphet software26,27 indicated a 5% false positive rate at the 1-peptide level, a 1% false positive rate at the 2-peptides level, and virtually no false positives at the 3-peptides level (personal communication A. Nesvizhskii, Institute of Systems Biology, Seattle, WA). Only those proteins identified with at least two fully tryptic peptides were retained for further analysis. After the removal of common contaminants and some redundant proteins from the final protein list, a total of 1664 unique proteins were identified. The entire list of identified proteins with predicted functions, percent sequence coverage, functional categories, pIs and molecular weights (MW) can be found in Supporting Information Table 2. The identified proteins with their peptide count (number of identified peptides), spectral count (number of MS/ MS spectra obtained from a protein), and percent sequence coverage (total percentage of the protein sequence covered by

research articles

Baseline Proteomes of R. palustris Table 3. Functional Categories category

protein % of genome % of % IDs proteome prediction genome identified

unknowns and 325 unclassified cellular processes 207 transport 168 general function 157 prediction energy metabolism 142 translation 122 amino acid 104 metabolism lipid metabolism 98 metabolism of 80 cofactors/ vitamins signal transduction 73 carbon and 67 carbohydrate metabolism transcription 59 purine and 40 pyrimidine metabolism replication and 22 repair total 1664

19.53

1407

29.22

23.1

12.44 10.10 9.44

524 699 420

10.88 14.51 8.72

39.69 24.03 37.38

8.53 7.33 6.25

306 168 181

6.35 3.49 3.76

46.41 72.62 57.46

5.89 4.81

158 150

3.28 3.11

62.03 53.33

4.39 4.03

231 107

4.80 2.22

31.6 62.62

3.55 2.40

283 56

5.88 1.16

20.85 71.43

1.32

126

2.62

17.46

4816

34.57

tryptic peptides) from individual growth states can be found in Supporting Information Table 3. To determine whether our methodology had any major biases against certain protein forms, we compared the identified pI and molecular weight (MW) ranges of the identified proteins with those predicted from the entire genome. We found no major biases against the MW or pI of the proteins identified in this study. The functional categories for the identified proteins are shown in Table 3 (these functional categories are based on the ORNL annotation scheme for bacteria (http://genome.ornl.gov/microbial/). The table is broken down into proteins identified from each category and the percentage of the total proteome identified from that category, the total number of proteins predicted in each category from the genome and the percentage of the total genome for each category, and the percent of the predicted genome identified from each category. In general most of the categories had very similar percentage of detection in the proteome with that of the entire genome. The most notable exceptions were translation and amino acid metabolism which had nearly twice the percentage detected in the proteome than was predicted in the genome. We mapped the proteins found in pathways predicted by the Kyoto Encyclopedia of Genes and Genomes (KEGG) onto KEGG pathways and indicated the metabolic states under which each was detected. The complete set of pathways can be found at http://compbio.ornl.gov/rpal•proteome/analysis/. Most of the identified proteins were hypothetical and conserved hypothetical proteins, 325 in total being identified with a high level of confidence. This represents 23% of the hypothetical proteins predicted to be encoded in the genome. In our classification scheme, protein names are changed from hypothetical and conserved hypothetical to unknown and conserved unknown when they are confidently identified with at least two unique peptides. A total of 107 unknown proteins and 218 conserved unknown proteins were identified and renamed in this manner. The second most numerous category included proteins predicted to be involved in general cellular processes such as chaperones, proteases, flagellar proteins, stress proteins and assorted enzymes. A total of 207 proteins were identified in this category representing 40% of those predicted from the

genome. The R. palustris genome contains two separate copies of groEL (RPA1140 and RPA2164) and groES (RPA1141 and RPA2165). We identified each of the proteins predicted to be encoded by these four genes and were able to confidently differentiate the two forms of each subunit via a number of unique peptides. Both types of GroEL/ES were expressed under all of the growth states that we examined. A total of 168 transport proteins were detected, representing 24% of those predicted from the genome. R. palustris is predicted to have 325 complete transport systems corresponding to almost 15% of the genome.4 This is larger than most microbes, where transport genes are generally predicted to comprise 5-6% of the genome. Approximately 10% of the detected proteome were proteins involved in transport. Many of the detected transport proteins were the ATP-binding cassette (ABC) systems. All eight of the TrapT periplasmic binding proteins were detected as well as fourteen TonBdependent receptors/iron transporters, confirming the genome annotation’s implication that iron acquisition is important for R. palustris.4 Thirty efflux pumps, twelve proteins involved in protein export, nine proteins involved in metal uptake other than iron, and seven outer membrane porins were also detected. The categories which were identified with the highest percentage of proteins predicted by the genome sequence included translation (72%), purine/pyrimidine metabolism (71%), carbon and carbohydrate metabolism (62%), lipid metabolism (62%), and amino acid metabolism (57%). This is to be expected since many of the proteins in these categories are necessary under all metabolic modes. In a previous study of the purified 70S ribosome from R. palustris, we identified 53 of the 54 predicted ribosomal proteins.28 In the present study, we identified 50 of the 54 predicted ribosomal proteins without prior purification. The missed ribosomal proteins are all small and rich in lysine residues, which suggest that they were digested into peptides too small for confident identification. A total of 18 of the potential 20 tRNA synthetases were confidently identified. Most of the ribosomal proteins and tRNA synthetases were found under every growth state characterized (Supporting Information Table 3). Although only 40 purine and pyrimidine metabolism proteins were detected, these represented most of the predicted proteins from this group. As was the case with the proteins involved in translation, many of the purine/pyrimidine metabolism proteins were found under all of the metabolic states. Proteins from the proteome categories of replication and repair, energy metabolism, transcription, general function prediction, metabolism of cofactors and vitamins, and signal transduction were identified, but many with less than 50% of the number of proteins predicted by the genome sequence. The category of replication and repair was detected with the smallest percentage and smallest number of proteins, with only 22 proteins (17% of the predicted proteome). This may be due to the low abundance of these proteins and their rapid turnover; they may only be used during DNA replication and repair and then quickly degraded. Relatively few proteins involved in transcription were detected; only 59 proteins were detected, representing 21% of those predicted by the genome sequence. Transcription proteins known to be essential, such as the RNA polymerase complex, were detected with high sequence coverage, peptide count, and spectral counts from every growth state analyzed. Proteins involved in signal transduction are generally considered to be of lower abundance. A Journal of Proteome Research • Vol. 5, No. 2, 2006 291

research articles total of 73 proteins representing 31% of those predicted by the genome sequence were confidently detected. Many of these were putative methyl-accepting chemotaxis proteins, twocomponent response regulators, and two-component sensor histidine kinases. A total of 142 proteins, nearly 50% of the proteins predicted to be involved in energy metabolism were confidently detected. These include proteins involved in photosynthesis and oxidative phosphorylation. We found that often only a subset of the known proteins from protein complexes in these pathways was confidently identified. For example, ATP synthase is encoded by two operons, RPA0175-RPA0179 and RPA0843-RPA0846 (see KEGG map at http://compbio.ornl.gov/rpal•proteome/analysis/ keggmaps/html/map00193.html). All of the proteins encoded by the first operon are predicted to be associated with the inner membrane but not directly embedded. All of these proteins were detected with high sequence coverage under every metabolic state. The second operon encodes two proteins (ATP synthase B chain and ATP synthase subunit B′), which are associated with the membrane but not directly embedded. These proteins were also readily detected under all metabolic states. The final two proteins of the latter operon, ATP synthase subunit C transmembrane protein and ATP synthase subunit A, are both small, membrane-embedded proteins. Neither the A nor C subunits was detected in any of the metabolic states, even though they must be expressed at high abundance with the rest of the complex. This result underscores that small proteins embedded in the membrane are problematic for “shotgun” proteomics techniques using common trypsin digestion methodologies (note: these proteins have been detected with alternative digestion methodologies (data not shown)]. This same phenomenon limited the detection of membrane proteins involved in photosynthesis, the cytochrome-c oxidase complex, and the NADH-ubiquinone dehydrogenase complex. Two operons (RPA2937-RPA2952 and RPA4252-RPA4264) are each predicted to encode complete NADH-ubiquinone dehydrogenase complexes. The structure of each operon and the degree of divergence of the individual proteins suggests that these operons have different evolutionary lineages, perhaps due to lateral transfer, rather than originating by duplication and divergence within this genome. A total of 6 proteins were identified from the first operon and 4 proteins from the second operon indicating both operons are indeed expressed. These proteins were found across all metabolic states indicating expression under all metabolic states. Metabolic State Comparisons. One of the primary goals of this study was to identify the major differences at the protein level between defined but distinct metabolic states of R. palustris. This was first done by binary comparisons of related metabolic states as illustrated in Figure 1b. Proteins identified as showing expression differences between metabolic states were then compared across all metabolic states to determine global trends in protein expression. A quantitative comparison of proteins present in different growth states is currently a serious challenge for MS-based proteomics efforts. While effort has been put into relative quantitation of proteins between different growth states using technologies such as isotope coded affinity tags (ICAT),29 metabolic labeling30,31 and 18O water labeling,32 none of these techniques has been shown to be effective for large-scale studies of many growth states in microbial systems. Specifically, the ICAT technology requires the labeling of cysteine residues, which are not as prevalent in microbial systems as eukaryotic systems. Indeed, ∼60% of the 292

Journal of Proteome Research • Vol. 5, No. 2, 2006

VerBerkmoes et al. Table 4. Reproducibility of Identified Proteins by Growth State growth condition

1 peptide (%)

2 peptide (%)

lhaA mutant Chemoheterotrophic WT Chemoheterotrophic WT Photoheterotrophic WT Photoheterotrophic Stationary WT Photoheterotrophic Nitrogen Fixation WT Photoautotrophic WT Photoheterotrophic Benzoate

71.7 68.1 70.5 70.1 69.8 70.3 67.7

75.7 70.8 77.7 79.1 75.7 77.3 73.7

R. palustris predicted proteome contains two or fewer cysteines per protein and 20% contains no cysteines at all. Metabolic labeling with 15N has shown the most promise in microbial systems9,33 but requires strict control of nitrogen intake. Indeed, many microbial species cannot be cultivated under conditions where strict control of nitrogen intake is required. Labeling peptides during trypsin digestion with 18O water is a potential alternative,32 but the expense involved in labeling the number of samples used in this study and the need for high-resolution mass spectrometers, limits its use for large-scale proteome comparisons of many states. For proteome datasets, the relative amount of each protein can be related to its percent sequence coverage, number of identified peptides, and number of spectra identified (multiple spectra may appear for each peptide). All of these values are semiquantitative indicators of protein abundance.34 In a previous study of Shewanella oneidensis, the percent sequence coverage and number of unique peptides per protein identified in triplicate analysis of a control and a fur mutant were used to compare relative protein abundances between proteomic and transcriptomic data.19 Here we used the same technique of comparing the percent sequence coverage and the number of unique peptides per protein between different metabolic states to identify those R. palustris proteins showing substantial differences between states. We used the general rule that a protein must have a replicated difference of at least four peptides and/or 30% sequence coverage between the two states being compared to be called a major difference.19 Proteins called as major differences were then evaluated at the spectral count level to ensure a replicated difference of at least 2×. Only those proteins passing all three criteria were kept for further analyses. The three statistics in Supporting Information Table 3 give a semiquantitative measure of each protein’s abundance: spectral count, peptide count, and sequence coverage. Each has individual merits. For detected proteins, sequence coverage values typically fall between 11% and 33% (25th and 75th percentiles, respectively), spectral counts fall between 4 and 17 (25th and 75th percentiles), and peptide counts range from 2 to 7 (25th and 75th percentiles). The highest values seen for these three measures were 100% sequence coverage, 2014 spectra, or 131 peptides observed for a single protein (different proteins and different states). A peptide count of two is quite common because this was the minimum number for a protein to be considered present. By evaluating each of these measures for comparing protein abundance, we reduced the impact of random differences in individual statistics. It is important in such comparisons to process and analyze samples under identical conditions, to achieve the best reproducibility possible. Table 4 illustrates the reproducibility of proteins detected between duplicate runs for each metabolic state. At the two peptide filtering level, 71-79% reproducibility of identified proteins was achieved for each metabolic state.

research articles

Baseline Proteomes of R. palustris Table 5. Unknown Operon Identified in Only the Anaerobic Metabolic States locus

lhaa1

lhaa2

Aerob1

Aerob2

Anaerob1

Anaerob2

Stat1

Stat2

Auto1

RPA2333

0

0

0

0

19

17

17

20

4

RPA2334 RPA2335 RPA2336 RPA2337 RPA2338

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

34 27 84 0 28

38 0 74 0 28

55 13 74 0 42

30 27 74 0 42

12 0 74 0 28

Auto2

N2_1

N2_2

Benz1

Benz2

4

24

25

13

13

12 0 74 0 0

28 0 74 0 37

55 27 74 0 37

55 27 74 0 28

26 0 64 0 37

functional assignment

Putative transport protein unknown unknown unknown unknown unknown

a Numbers in table report the percentage of residues in protein sequence that were identified by at least two peptide passing the filtering criteria. b Cells highlighted in bold are anaerobic states.

This degree of overlap between duplicate experiments is typical for successful whole-proteome experiments in our experience. Proteins that appear in only one of the two replicates are typically those identified on the basis of very few spectra. Proteins were required to appear in both replicates to be included in the comparisons of the metabolic states. Proteins identified as showing major differences were then compared across all growth states to determine if trends in expression could be determined. In total we found 311 proteins exhibiting major differences between growth states. The entire list of proteins sorted by compared metabolic states can be found in Supporting Information Table 4. This table also contains the entire list of differentially expressed proteins with peptide count, spectral count and sequence coverage from every growth state. It should be noted that this technique is only useful in determining proteins exhibiting large-scale differences in expression between growth states and generating hypotheses about these proteins for future testing. Precise quantification of the protein differences cannot be established with this technique and additional validation by other methods in a targeted fashion would be needed to verify exact differences. Chemoheterotrophic Wild-Type Cells Compared to Chemoheterotrophic lhaA Mutant Cells. The chemoheterotrophic lhaA mutant17,18 was analyzed as a secondary control for aerobic growth. It was used in the global comparison of phototrophic (anaerobic) and chemotrophic (aerobic) states (Figure 1a and Figure 1b) because this mutation is not believed to have an effect on aerobic growth in the dark. For this to be effective, very few differences should be seen between the chemoheterotrophic wild-type and chemoheterotrophic lhaA mutant. This was indeed the case, as only six proteins showed significant differences between the WT and lhaA mutant (Table S4, 1st tab). As might be predicted based on the proposed function of the LhaA protein in the assembly of the photosynthetic apparatus, the difference in pigmentation between wild-type cells and lhaA mutant cells does not appear to be due to major differences in the amounts of photosynthesis proteins synthesized. Thus, we concluded that this mutant could be used as a secondary measurement of aerobic growth and directly compared with the anaerobic phototrophic states as discussed below. Chemoheterotrophic Cells Compared to Photoheterotrophic Cells. The chemoheterotrophic and photoheterotrophic states are the base states against which all other states are compared in this study, as shown in Figure 1a and Figure 1b. One might expect the protein profiles of cells grown under these two conditions to be quite different because cells obtain energy from the oxidation of succinate during chemoheterotrophic growth and energy from light during photoheterotrophic growth. Moreover, chemoheterotrophic cells were grown aero-

bically whereas photoheterotrophic cells were grown anaerobically. Succinate was the source of carbon for both growth modes. Interestingly, many of the proteins involved in photosynthesis were found under every state studied, regardless of whether the samples were grown aerobically in the dark or anaerobically in the light. For example, gene RPA1548, which encodes the H subunit of the photosynthetic reaction center, was found under every metabolic state, although spectral counts were two to three times lower under the dark aerobic states. Indeed, the hallmark phenotype of photosynthesis, the red coloring of the cell membranes, was observed for every metabolic state, though the red coloring was much more pronounced under phototrophic states. Nonetheless, many differences were found at the protein level between these two states. In total, 31 proteins were found to be up-regulated under the chemoheterotrophic state and 56 proteins were found to be up-regulated under the photoheterotrophic state (Table S4, 2nd tab). A total of eight unknown and conserved unknown proteins were found to be up-regulated in the chemoheterotrophic state. The unknown proteins RPA2269, RPA2471, and RPA3930 all showed strong correlation with the aerobic states with very little expression under any of the anaerobic states. The conserved unknown protein RPA4179 was found with ∼90% sequence coverage and relatively high spectral counts (70-80) for both the wild type and lhaA mutant under aerobic conditions, and was not found under any anaerobic states except nitrogen fixation and benzoate growth, where it was found with high sequence coverage (>50%) but very low spectral counts. (2-10). A total of 17 unknown and conserved unknown proteins were found to be up-regulated under the photoheterotrophic state. The unknown proteins RPA1494, RPA1495, RPA1620, RPA2333, RPA2334, RPA2335, RPA2336, RPA2338, and RPA3786 all showed strong correlation with the anaerobic states and very little expression under any of the aerobic states. The unknown protein RPA3011 was found with high sequence coverage only under photoheterotrophic growth conditions. With all other anaerobic states it was found with less than 10% coverage or not at all. The operon of genes encoding unknown proteins, from RPA2333 to RPA2338, is an especially interesting case; this entire operon, except RPA2337, was found to show relatively strong expression under anaerobic states but no expression in the aerobic states (Table 5). The lack of detection of RPA2337 cannot be explained; the protein has no predicted transmembrane domains and predicted cleavage by trypsin indicates 5-6 peptides that should easily be detected. None of the proteins in this operon have been found to have strong similarity to any genes in sequenced microbial genomes to date except RPA2333. RPA2333 has similarity to segments of a putative cation transport ATPase but does not have the predicted Journal of Proteome Research • Vol. 5, No. 2, 2006 293

research articles transmembrane domains generally associated with such a transport ATPase. This, coupled with the fact that the rest of the proteins in the operon do not show similarity to any other known protein, indicates that we have detected a completely novel, expressed operon with potential function under anaerobic illuminated growth. The proteins of this putative operon that were detected were mainly in the membrane fraction (see Discussion below), suggesting a potential membrane-associated protein complex or large protein complex. This operon is a good target for future functional studies such as gene knockouts and protein interaction studies through tagging protocols or other biochemical enrichment techniques. The conserved unknown proteins RPA0932 and RPA0934, along with the putative protease RPA0933, also appear to make up an operon that shows expression only under the anaerobic states. RPA0933 may be involved in processing of protein(s) necessary for anaerobic growth that are not required under aerobic growth. The functions of the two adjacent proteins from the operon are unknown. The conserved unknown proteins RPA1659, RPA3501, RPA4217 all showed strong correlation with the anaerobic light states with very little expression under any of the aerobic dark states. The conserved unknown protein RPA3501 is an interesting case, showing 80-90% sequence coverage with a high spectral count (>80) and a high number of peptides (>13) under all anaerobic states (suggesting high expression), yet it was undetected in the aerobic states. Thus, all indicators of abundance point to this protein being highly expressed under anaerobic states, yet we have no clue to its function. Photoheterotrophic Cells Compared to Nitrogen-Fixing Cells. Evaluating the proteomic differences associated with nitrogen fixation in the photoheterotrophic state was an ideal test for the effectiveness of our methodology since many of the proteins associated with nitrogen fixation are known and should be present in substantial numbers only when this process is needed for cell growth.6,35 This was indeed found to be the case. In total, 12 proteins were found to be up-regulated under the photoheterotrophic state and 40 proteins were found to be up-regulated under the photoheterotrophic nitrogen fixation state (Table S4, 3rd tab). Most of the proteins thought to be involved in nitrogen fixation were found to be distinctly up-regulated under nitrogen-fixing conditions and not detected to any great extent under any of the other conditions. These include RPA0274, a nitrogen regulatory protein; RPA2593 and RPA2595, nitrogen assimilation regulatory proteins; RPA4209, glutamine synthetase; and the entire nif regulon RPA4602-4632 (RPA4633 is also part of this regulon but was barely detected). The nif regulon contains the catalytic nitrogenase protein complex for nitrogen fixation (NifHDK RPA4618-4620) which was clearly up regulated. The difference in spectral counts for these proteins is one of the largest seen, going from zero in all other metabolic states to 651 for NifK, 401 for NifD and 1605 for NifH during nitrogen fixation (numbers are averages of replicated runs). The major nitrogen fixation proteins, as well as some other proteins showing expression only under nitrogen fixation conditions, are compared with expression levels for all other metabolic states in Table 6. The clear identification of these proteins under the nitrogen fixation states and not under other states indicates the effectiveness of this qualitative technique for comparing large numbers of metabolic states in microbial systems without accurate quantitation. As indicated in Table 6, a number of proteins predicted to be involved in fixed 294

Journal of Proteome Research • Vol. 5, No. 2, 2006

VerBerkmoes et al.

nitrogen acquisition were identified only under nitrogen fixing conditions. Examples include RPA0761, a possible oligopeptide transporter, and RPA3669, a putative urea short-chain amide/ branched-chain amino acid uptake ABC transporter periplasmic solute-binding protein precursor. Photoheterotrophic Cells Compared to Photoautotrophic Cells. The photoheterotrophic state was compared with the photoautotrophic state to evaluate proteins important for carbon dioxide fixation. In total, 18 proteins increased in abundance under the photoheterotrophic state and 37 proteins increased in abundance under the photoautotrophic state (Table S4, 4th tab). Three unknown proteins and one conserved unknown protein were found to be down-regulated in the photoautotrophic state. The glutamate synthase complex, as well as the protochlorophyllide reductase complex, were also found to be down-regulated. As expected, the ribulose-bisphosphate carboxylase large chain (RPA1559) and small chain (RPA1560) (RubisCO form I) were up-regulated under autotrophic growth. The only growth states in which RubisCO form I was detected with high sequence coverage were photoautotrophic and phototrophic benzoate growth (see below). This is expected because RubisCO is the key enzyme in CO2 fixation. Both of the two components of RubisCO had relatively high spectral counts under both photoautotrophic and benzoate growth conditions, with an average of 199 for the large subunit and 131 for the small subunit from the photoautotrophic state. Interestingly, RPA1561, a CbbX protein homologue potentially associated with form I RubisCO, was upregulated under photoautotrophic growth. RPA1561 was also detected under benzoate growth indicating coexpression with RubisCO form I. The spectral counts for RPA1561 were much smaller with an average of 30 for replicate runs, suggesting that it may not be as highly expressed as the known components of RubisCO. A total of 10 conserved unknown proteins and 3 unknown proteins were also indicated as up-regulated under photoautotrophic growth. A number of these unknown proteins such as RPA1114, RPA1243, RPA1244, RPA2786, RPA3309, RPA3568, and RPA4704 were strongly expressed under photoautotrophic, stationary phase and benzoate growth with all three indicators of expression. Photoheterotrophic Exponential Phase Compared to Photoheterotrophic Stationary Phase. In this comparison, a total of 13 proteins were found to be down-regulated under the photoheterotrophic stationary state and 25 proteins were found to be up-regulated under the photoheterotrophic stationary phase (Table S4, 5th tab). Of the 13 proteins detected as downregulated in stationary phase, five were also found downregulated in the photoautotrophic state. These include RPA1542 and 1545, components of the protochlorophyllide reductase complex; RPA1975, a periplasmic mannitol binding protein; RPA2977, a ribonucleotide reductase; and two unknown proteins, RPA2297 and RPA3011. Previous work in our laboratories on Escherichia coli K12 comparing mid-log and stationary phase indicated that many proteins involved in protein turnover, folding and stress response were up-regulated in the stationary phase (VerBerkmoes unpublished data); however, this was not found to be the case for R. palustris. Not a single protein annotated as involved in stress response, protein turnover or protein folding was detected to be up-regulated in the stationary phase. This may be due to a lack of knowledge of the function of proteins involved in the R. palustris stress response process. Indeed, nine conserved unknown proteins and 2 unknown proteins were identified as up-regulated in the

research articles

Baseline Proteomes of R. palustris Table 6. Proteins Identified Only under Nitrogen Fixation (partial list)a locus

lhaa1 lhaa2 Aerob1 Aerob2 Anaerob1 Anaerob2 Stat1 Stat2 Auto1 Auto2 N2_1 N2_2 Benz1 Benz2

RPA0274

0

0

0

0

0

0

0

0

0

0

88

92

29

0

RPA0761

0

0

0

0

0

0

0

0

0

0

20

29

0

0

RPA1206

0

0

0

0

0

0

0

0

0

0

14

25

0

0

RPA1927

0

0

0

0

0

0

0

0

0

0

49

58

0

0

RPA1928

0

0

0

0

0

0

0

0

0

0

62

72

0

0

RPA2156 RPA2593

0 0

0 0

0 6

0 0

0 0

0 0

0 0

0 0

0 0

0 0

53 14

28 14

0 4

0 0

RPA3669

0

0

0

0

0

0

0

0

0

0

74

75

0

0

RPA4209

0

0

0

0

0

0

0

0

0

0

42

44

0

0

RPA4602

0

0

0

0

0

0

0

0

0

0

19

65

0

0

RPA4603

0

0

0

0

0

0

0

0

0

0

35

40

0

0

RPA4604

0

0

0

0

0

0

0

0

0

0

29

38

0

0

RPA4605

0

0

0

0

0

0

0

0

0

0

34

54

0

0

RPA4608

0

0

0

0

0

0

0

0

0

0

11

7

0

0

RPA4610

0

0

0

0

0

0

0

0

0

0

11

21

0

0

RPA4612

0

0

0

0

0

0

0

0

0

0

17

17

0

0

RPA4613 RPA4614 RPA4615

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

53 36 76

65 46 76

0 0 0

0 0 0

RPA4618

0

0

0

0

0

0

0

0

0

0

56

82

0

0

RPA4619

0

0

0

0

0

0

0

0

0

0

68

75

0

0

RPA4620

0

0

0

0

0

0

0

0

0

0

53

73

0

0

RPA4623 RPA4631

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

86 53

88 56

0 0

0 0

RPA4632

0

0

0

0

0

0

0

0

0

0

14

18

0

0

RPA4714

0

0

0t

0

0

0

0

0

0

0

33

51

0

0

a

functional assignment

GlnK, nitrogen regulatory protein P-II possible oligopeptide ABC transporter, aldehyde dehydrogenase unknown protein ferredoxin-like protein [2Fe-2S] unknown protein nitrogen assimilation regulatory protein ntrC amino acid uptake ABC transporter glutamine synthetase II ferredoxin like protein, fixX nitrogen fixation protein, fixC electron transfer flavoprotein alpha chain electron transfer flavoprotein beta chain fixA nitrogenase cofactor synthesis protein nifS Protein of unknown function, ferredoxin 2[4Fe-4S] III, fdxB DUF683 DUF269 nitrogenase molybdenum-iron protein nifX nitrogenase molybdenum-iron protein beta chain nitrogenase molybdenum-iron protein alpha chain nitrogenase iron protein, nifH fixU, nifT ferredoxin 2[4Fe-4S], fdxN NIFA, NIF-SPECIFIC REGULATORY protein unknown protein

Numbers in table report the percentage of residues in protein sequence that are represented by at least two peptides passing the filtering criteria.

stationary phase samples. Seven of these RPA1114, RPA1243, RPA1244, RPA2786, RPA3309, RPA3568, and RPA4704, were also up-regulated during photoautotrophic growth. Many of these proteins had high spectral counts ranging from 50 to 100 in the photoautotrophic, benzoate and stationary phase states with very low spectral counts (