Large-Scale Identification of N-Terminal Peptides in the Halophilic Archaea Halobacterium salinarum and Natronomonas pharaonis Michalis Aivaliotis,† Kris Gevaert,‡ Michaela Falb,† Andreas Tebbe,†,§ Kosta Konstantinidis,† Birgit Bisle,† Christian Klein,† Lennart Martens,‡,| An Staes,‡ Evy Timmerman,‡ Jozef Van Damme,‡ Frank Siedler,† Friedhelm Pfeiffer,† Joe1 l Vandekerckhove,‡ and Dieter Oesterhelt*,† Department of Membrane Biochemistry, Max Planck Institute of Biochemistry, 82152 Martinsried, Germany, and Department of Biochemistry and Medical Protein Research, Faculty of Medicine and Health Sciences, Ghent University and Flanders Interuniversity Institute for Biotechnology, Ghent, Belgium Received January 22, 2007
Characterization of protein N-terminal peptides supports the quality assessment of data derived from genomic sequences (e.g., the correct assignment of start codons) and hints to in vivo N-terminal modifications such as N-terminal acetylation and removal of the initiator methionine. The current work represents the first large-scale identification of N-terminal peptides from prokaryotes, of the two halophilic euryarchaeota Halobacterium salinarum and Natronomonas pharaonis. Two methods were used that specifically allow the characterization of protein N-terminal peptides: combined fractional diagonal chromatography (COFRADIC) and strong cation exchange chromatography (SCX), both known to enrich for N-terminally blocked peptides. In addition to these specific methods, N-terminal peptide identifications were extracted from our previous genome-wide proteomic data. Combining all data, 606 N-terminal peptides from Hbt. salinarum and 328 from Nmn. pharaonis were reliably identified. These results constitute the largest available dataset holding identified and characterized protein N-termini for prokaryotes (archaea and bacteria). They allowed the validation/improvement of start codon assignments as automatic gene finders tend to misassign start codons for GC-rich genomes. In addition, the dataset allowed unravelling N-terminal protein maturation in archaea, showing that 60% of the proteins undergo methionine cleavage and thatsin contrast to current knowledgesNR-acetylation is common in the archaeal domain of life with 13-18% of the proteins being NR-acetylated. The protein sets described in this paper are available by FTP and might be used as reference sets to test the performance of new gene finders. Keywords: Halobacterium salinarum • Natronomonas pharaonis • archaea • halophilic • SCX • ESI Q-TOF • LCMS/MS • N-terminal peptide • COFRADIC • gene finder
Introduction Haloarchaea, a phylogenetically well-defined group of the euryarchaeota, are able to grow under extreme conditions in environments such as solar salterns and hypersaline lakes, where NaCl concentrations may reach saturation.1-3 Haloarchaea use the salt-in strategy to cope with osmotic stress by accumulation of intracellular potassium ions to very high concentrations.4,5 Proteins are adapted to this intracellular condition by using an excess of acidic amino acids,6,7 especially on their surface,8,9 and consequently the proteome is highly * To whom correspondence should be addressed. Tel: +49 8985782386. Fax: +49 8985783557. E-mail:
[email protected]. † Max Planck Institute of Biochemistry. ‡ Ghent University and Flanders Interuniversity Institute for Biotechnology. § Current address: Hoffmann-La Roche Ltd., Roche Center for Medical Genomics, CH-4070 Basel, Switzerland. | Current address: EMBL Outstation, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. 10.1021/pr0700347 CCC: $37.00
2007 American Chemical Society
acidic.10 The unique and highly unusual characteristics of halophiles made them attractive for basic research and biotechnology11 and also important model organisms for the development of systems biology.12,13 Many halophilic archaea have GC-rich chromosomes (e.g., Hbt. salinarum 68%14-15 and Nmn pharaonis 63%16). Prediction of the theoretical proteome by commonly used gene finders like Glimmer17 is error-prone for GC-rich organisms.16,18,19 This can be attributed to an underrepresentation of the three TArich stop codons, resulting in a severe overprediction of open reading frames (ORFs). In Hbt. salinarum, there are in average 2.5 additional spurious ORFs with a length of at least 100 codons for each protein-coding gene, reaching a length of up to 1300 codons.10 An additional problem is the correct start codon assignment, which applies to many microbial genomes, especially to those with a high GC content. Correct start codon assignments are crucial for creating an “error-free’’ theoretical proteome that may then be used for the analysis of N-terminal Journal of Proteome Research 2007, 6, 2195-2204
2195
Published on Web 04/20/2007
research articles
Aivaliotis et al.
Figure 1. Workflow for the identification of N-terminal peptides from Hbt. salinarum and Nmn. pharaonis.
protein processing such as methionine cleavage and N-terminal acetylation, which was previously considered not to occur in archaea.20 In addition, the identification of protein export signals (e.g., signal sequences or twin-arginine motifs) depends on a correct start codon assignment. Finally, long extensions beyond the in vivo start codon may seemingly exclude the existence of neighboring protein-coding genes. Therefore, experimental identification of protein N-termini is crucial for the validation of start codon assignments and allows the systematic study of the in vivo pattern of archaeal N-terminal protein processing. In the present work, we report a large-scale systematic identification of N-terminal peptides from the halophilic archaea Hbt. salinarum and Nmn. pharaonis using proteomic tools. Our aim was: (i) to identify as many N-terminal peptides as possible from both organisms and (ii) to use the resulting data for the verification of our genomic data by analyzing the correct start codon assignment and, additionally, studying N-terminal modifications. Both specific and general proteomic approaches were used to obtain optimal results concerning the number and the reliability of the identified N-terminal peptides (Figure 1).
Materials and Methods Archaeal Strains, Growth Conditions, and Protein Preparations. Reference cultures of the archaea Hbt. salinarum (strain R1, DSM 671) and Nmn. pharaonis (strain Gabara, DSM 2160) were grown aerobically in complex medium in the dark to latelog phase (about 40 Klett units) at 37 °C as described previously.10,21,22 Protein preparations for Hbt. salinarum and Nmn. pharaonis have been performed according to Tebbe et al. (2005)10 and Konstantinidis et al. (2006),22 respectively. General Methods for the Identification of N-Terminal Peptides. The mass spectrometric data obtained from our previous proteomic analyses (further details are provided below) of Hbt. salinarum and Nmn. pharaonis were reprocessed with MASCOT for the identification of N-terminal peptides and modified forms thereof (see Data Analysis). In 2196
Journal of Proteome Research • Vol. 6, No. 6, 2007
particular, the following data sources were used: (i) proteome analysis of Hbt. salinarum using 2-DE followed by in-gel digestion and automated MALDI-TOF MS,10 (ii) membrane proteome analysis of Hbt. salinarum using 1-DE, followed by in-gel digestion and on-line LC-MS/ΜS on a Q-TOF,21 (iii) quantitative proteomics data from Hbt. salinarum using 1-DE of ICPL-labeled proteins, followed by in-gel digestion and datadependent, off-line LC-MS/MS on a MALDI-TOF/TOF,23,24 (iv) analysis of low molecular weight proteins from Hbt. salinarum using 1-DE and in-gel digestion with optimized parameters for this protein class, followed by LC-MS/MS on a Q-TOF or a LTQ-FT,25 (v) analysis of the proteome of Nmn. pharaonis using protein prefractionation by SEC in combination with 1-DE, followed by in-gel digestion and LC-MS/MS on a Q-TOF,22 and (vi) some additional data from recently initiated proteomic projects. Experimental details can be found in the cited articles. Specific Methods for the Isolation and Identification of N-Terminal Peptides. COFRADIC: For the specific enrichment of N-terminal peptides from the proteins in cytosolic preparations of both archaea, COFRADIC was applied, as described by Gevaert et al. (2003).26 Peptides isolated this way were analyzed by automated LC-MS/MS either on a Q-TOF (Micromass UK Limited, Cheshire, UK) mass spectrometer (for Nmn. pharaonis, for MS setting see reference Konstantinidis et al. (2006)22) or on an Esquire HCT ion trap (Bruker Daltonics, Bremen Germany) mass spectrometer (for Hbt. salinarum, for MS setting see Lavens et al. (2006)27). SCX Chromatography: SCX chromatography as described by Beausoleil et al. (2004)28 for the enrichment of phosphopeptides was applied with minor modifications for the enrichment of N-terminal peptides. In brief, about 10 mg of protein extract was dissolved and denatured in 6 M urea and 2 M thiourea, reduced with 1 mM dithiothreitol (DTT) for 45 min at room temperature (RT) and carbamidomethylated with 5 mM iodoacetamide for 45 min at RT in the dark. Alkylated proteins were diluted four times with deionized water, and then digested with sequencing grade modified trypsin (Promega) overnight, with a protease/protein ratio of 1/100. Resulting peptide mixture was acidified with trifluoroacetic acid (TFA)
research articles
N-Terminal Peptides in the Halophilic Archaea Table 1. Mass Spectrometers which were Used and their Mass Tolerance Settings instrument
ion source
mass spectrometer
Q-TOF Ultima (Micromass) Q-TOF Ultima (Micromass) (COFRADIC) LTQ-FT (Thermo) Esquire 3000+ ion trap (Bruker) HCT ion trap (Bruker) (COFRADIC) 4700 Proteomics Discovery System, (Applied Biosystems) Reflex III MALDI-TOF (Bruker)
ESI ESI ESI ESI ESI MALDI MALDI
Q-TOF Q-TOF Ion-trap/FT-ICR Ion-trap Ion-trap TOF/TOF TOF
to pH < 3 and SCX chromatography was performed on a SMART HPLC system (Amersham Biosciences) at a flow rate of 200 µL/min, using a 3.0 mm × 20 cm column (Poly LC, Columbia, MD) containing 5 µm polysulfoethyl aspartamide beads with a 200 Å pore size. UV detection was set both at 220 and 280 nm. After equilibrating the column with at least three times the column volume with solvent B (5 mM KH2PO4, 30% acetonitrile, 350 mM KCl, pH 2.7) and solvent A (5 mM KH2PO4, 30% acetonitrile, pH 2.7), the peptides were dissolved in solvent A, loaded on the column and eluted over a linear gradient from solvent A to B over 200 min. Fractions of 2 mL were collected, dried under vacuum, and analyzed by LC-ESIMS using a LTQ-FT mass spectrometer (Thermo Electron) and a Q-TOF Ultima (Waters-Micromass) as described previously.29-31 Two independent biological replicates were analyzed and combined giving similar identifications. Mass Spectrometry. The proteomic data described and used in the present research work were obtained from different mass spectrometers. For ESI-MS/MS, a Q-TOF instrument (Ultima, Waters-Micromass) and two ion-trap instruments (Esquire 3000+ ion trap and HCT ion trap, Bruker Daltonics) were used. For MALDI-MS/MS, a MALDI-TOF/TOF instrument (4700 Proteomics Discovery System, Applied Biosystems) was used. An LTQ-FT instrument (Thermo Electron) was used to collect LC-MS/MS and MS/MS/MS data. In addition, data obtained on a MALDI-TOF MS (Reflex III MALDI-TOF, Bruker Daltonics) are included. Details of the mass spectrometers and the search parameters used are provided in Table 1. Data Analysis. Database Search: The mass spectrometric data from the different experimental approaches were analyzed using the MASCOT 2.1.0.4 algorithm32 to search against the Hbt. salinarum and Nmn. pharaonis protein databases.10 The searches were executed at the 95% significance level with carbamidomethylation of cysteines as a fixed modification and trypsin as the digesting enzyme with maximum allowed missed cleavages of 3. The choice of the variable modifications, the tolerance of the peptide mass and the mass of MS/MS fragments, depended on the experiment and the mass spectrometer which was used (see Table 1). In all cases, oxidation of methionines and N-terminal acetylation and formylation of the proteins were used as variable modifications. For SILAC experiments, isotopic labeling with 13C leucine was chosen as variable modification whereas for ICPL experiments, modification of the lysines and the protein N-terminus with 12C-6-nicotinoyl-N-hydroxy-succinimide (light label) and 13C-6-nicotinoyl-N-hydroxy-succinimide (heavy label) were chosen as variable modifications. The COFRADIC MS/MS data were first searched through the protein databases, followed by searches with non-identified spectra into peptide databases reflecting in vivo trimming at protein N-termini. These peptide databases were made by the DBToolKit algorithm.33 The following variable amino acid modifications were considered when searching the protein databases: acetyl (N-term), N-formyl (protein), oxidation (M),
peptide tolerance
1.00 Da 0.30 Da 0.15 Da 1.50 Da 2.00 Da 150 ppm 200 ppm
MS/MS tolerance
0.20 Da 0.30 Da 0.15 Da 0.50 Da 0.50 Da 0.50 Da -
pyro-cmC (N-term camC), pyro-glu (N-term Q), and carbamidomethyl (C), whereas acetyl (K) was considered as fixed modification. Estimation of the False Positive Rate: For the estimation of the false positive rate, several criteria were used. Initialy, the MASCOT search was executed at the 95% significance level, which means that 5% false positives are allowed.32 Only rank 1 hits exceeding Mascot’s identity threshold score at the 95% significance were used for identification. This procedure was applied to two databases for each organism: (i) the normal database of Hbt. salinarum and Nmn. pharaonis, respectively, and (ii) a “reverted database’’ of both organisms where all sequences are written from the C-terminus to the N-terminus. Data Analysis at Different Significance Cutoff Values: We estimated the effects of modulating the scoring system by analyzing the data at different significance cutoff values. The MASCOT peptide score is not dependent on the setting of the significance parameter. MASCOT provides a significance threshold score at a given significance level, commonly 95% (i.e., 5% false positives). A 10-fold increase of the significance level (99.5% significance, 0.5% false positives) corresponds to an increase of the significance threshold score by 10, a 100-fold higher significance level (99.95% significance, 0.05% false positives) corresponds to an increase by 20. Thus, the data could be filtered at different significance levels, expressed as a “difference score’’ by which the 95% significance threshold score had to be exceeded. Difference scores from 5 to 20 were applied to investigate if data subsets were found predominantly in the low-scoring region. Computation of Statistical Data for Several Genomes: Several genome statistics were compiled for halophiles sequenced in the department of Membrane Biochemistry in MPI of Biochemistry (Hbt. salinarum strain R1, Hqt. walsbyi, Nmn. pharaonis strain Gabara) and for a selection of published genomes (obtained from NCBI). The GC content and the ORF data were computed for the longest replicon of the species, representing its (major) chromosome. Genomes Contain Two Types Of Open Reading Frames (ORFs): (i) ORFs that code for proteins (hereafter referred to as “real proteins’’) and (ii) ORFs for which genome annotators assume that they are not translated to proteins, although many of them are rather long (i.e., open for more than 100 codons, in Hbt. salinarum up to 1340 codons). ORFs from this second set (hereafter called “spurious ORFs’’) are not provided in the NCBI ORF lists. In the HaloLex system15swhere the genome of Hbt. salinarum strain R1 which we have sequenced in the department of Membrane Biochemistry in MPI of Biochemistry, is availablesthey are provided but marked as “spurious ORFs’’. It should be noted that spurious ORFs are especially frequent in GC-rich genomes. The complete set of real proteins (i.e., the “real protein’’ set from HaloLex and the annotated ORF set from NCBI) represents the theoretical proteome of the organism. Journal of Proteome Research • Vol. 6, No. 6, 2007 2197
research articles
Aivaliotis et al.
Figure 2. Loss of identifications upon increasing the stringency of evaluation in Hbt. salinarum. The lines show the percentage of identifications lost when scores must exceed the significance threshold score at 95% significance by the indicated difference score. A difference score of 10 indicates a 10-fold increase in significance. “Real proteins’’ are indicated in blue, spurious ORFs in red. In the pie diagrams with red color is indicated the fraction of identifications that are attributed to spurious ORFs.
An analysis was made to see if ORFs can be extended beyond their annotated start codon as illustrated in Figure 4. For this purpose, all ORFs coding for “real proteins’’ were extended up to the previous in-frame stop codon (indicated by the gray line preceding the coding region in Figure 4A), and then alternative start codons in the extension were identified (indicated by light blue triangles in Figure 4A). The bioinformatic analysis was supported by the MiGenAS software infrastructure.34 Six-frame translation was performed with a minimum length of 100 codons, allowing ATG and GTG start codons. ORFs from six-frame translation were mapped to those of the theoretical proteome by the position of the stop codon which is identical and unique. ORFs from six-frame translation that could not be mapped to the theoretical proteome by this procedure are considered to be spurious ORFs. Such spurious ORFs are indicated by empty red arrows in Figure 4A. Three computations were performed with these ORF sets. (i) The “multiplicity of coding” (red diamonds in Figure 4B) was obtained in three steps. The length of each ORF in bp was computed. After summing up all ORF lengths, the resulting number was divided by the chromosome length. A value of 1 indicates that, on average, the chromosome is completely covered with ORFs, whereas a value of 6 is the theoretical maximum. A value of 3 as approximated by Hbt. salinarum (having 68% GC) is consistent with the fact that 5333 spurious ORFs are found in the chromosome of Hbt. salinarum, which codes for 2131 proteins, resulting in a ratio of 2.5 (Figure 4B, illustrated in the inset boxed red). (ii) The maximal ORF extension was computed by comparing the ORF as currently annotated to the longest possible ORF starting with ATG or GTG. From these data, the average length of the ORF extensions 2198
Journal of Proteome Research • Vol. 6, No. 6, 2007
was computed (blue triangles in Figure 4C, illustrated in the inset of the figure). (iii) The average number of alternative start codons in the ORF extensions was computed after maximal extension of the ORF, counting ATG and GTG codons in the extension (blue squares in Figure 4B, illustrated in the inset boxed blue). From these data, the average over all ORFs was computed. ORF sets for evaluation of gene finders: For each of the species, three ORF sets were computed to allow assessment of microbial gene finders (Supplementary Table 1, see Supporting Information). These ORF sets and the underlying genome sequences are publicly available for FTP downloads under the MiGenAS portal (www.migenas.org). (a) The expert-validated ORF set (EV set) for the chromosomes of Hbt. salinarum strain R115 and Nmn. pharaonis strain Gabara16 was obtained by rigorous evaluation of automatic gene finder data based on additional data and evidence: (i) genomewide proteomic data with stringent settings for protein identification (see above), (ii) intergenome comparison of the ORF sets from four halophiles (Hbt. salinarum,14,15 Nmn. pharaonis,16 Hqr. walsbyi,35 Har. marismortui36) using blast (e-values usually better than e-20) with subsequent evaluation by PERL scripts with respect to protein existence and start codon assignment, or by manual inspection and (iii) consideration of general characteristics of halophilic proteins, which tend to be highly acidic unless containing predicted transmembrane domains.8,10,16 The resulting EV set represents the chromosomal subset of the theoretical proteome of the organism to the best of our knowledge. (b) The proteomics-validated ORF set (PV) for both species contains all chromosomal proteins identified with high strin-
research articles
N-Terminal Peptides in the Halophilic Archaea
Figure 3. Identification of proteins, N-terminal peptides, and acetylated N-terminal peptides attributed to N-terminus-specific methods, in both organisms. The black section indicates the fraction of proteins uniquely identified by the specific method, and the blue section proteins that were identified by the specific as well as by other methods. The light gray section indicates the fraction that was not identified by the specific method. Identification of N-acetylated peptides is not possible for COFRADIC as applied in the present work.
gency in our genome-wide set of proteomic data. For this set, the existence of the proteins is considered to be proven (the false positive rate is expected to be maximally 0.05%). Start codons have been assigned to the best of our knowledge but only a subset of the start codons has been experimentally validated. The PV set is a subset of the EV set. (c) The chromosomal ORF set with proteomic validation of the N-terminus (NV set) under high stringency (see above for details, the false positive rate expected to be only 0.2%) is a subset of the PV set. Assessment of microbial gene finders for GC-rich genomes: The chromosomes of the two GC-rich halophilic archaea, Hbt. salinarum strain R1 and Nmn. pharaonis strain Gabara, were subjected to the gene finder Reganor18 within the GenDB genome annotation system as described.16 In brief, Reganor first calls the gene finder Critica.37 Subsequently, the gene finder Glimmer17 is trained with the Critica ORF set because using this training set is reported to ensure good performance of Glimmer.18 Reganor matches the two ORF sets by the position of the stop codon and selects ORFs and start codons in a rulebased step. One of the fundamental rules is to avoid extensive gene overlaps (illustrated in Figure 5). After execution of Reganor, the ORF set was overloaded within the GenDB genome annotation system with the EV set, again mapping ORFs by their stop codons. GenDB allows retrieval of the original results from the various gene finders. Gene finder results were compared to the EV set to identify false positive and false negative ORF predictions. (i) Predicted ORFs that cannot be mapped to the EV set are considered to be false positives. (ii) ORFs from the EV set that have not been predicted are considered to be false negatives. Gene finder
results were compared to the PV set in order to identify false positives. Despite the high ratio of identified proteins, the PV set is incomplete and thus cannot be used to compute false positives. Gene finder results were compared to the EV, PV, and NV sets in order to asses the performance of start codon assignment, assuming that the annotated start codons are correct. Assignment of different alternative start codons by the gene finders is considered as an error. The length difference between the two ORF versions was computed.
Results and Discussion Results Overview. In the present work, general as well as specific approaches were used for the identification of Nterminal peptides (Figure 1), leading to the reliable identification of 606 and 328 N-terminal peptides from the two halophilic archaea Hbt. salinarum and Nmn. pharaonis, respectively (Supplementary Table 2, see Supporting Information). As a general approach, we used our previous data from genomewide inventory proteomics on Hbt. salinarum and Nmn. pharaonis. Both organisms are phylogenetically related and have GC-rich genomes10,16 that are known to cause severe problems in gene prediction.18,19 Results from gene finders (Reganor, Glimmer, Critica) are especially error-prone on assigning start codons.16 To reach our goal of reliably identifying N-terminal peptides, we searched our large-scale proteomic data for N-terminal peptides. In addition, we present results from two methods tailored to identify N-terminal peptides. The first method is COFRADIC, in which different chemical modifications of N-terminal amines before and after tryptic cleavage are introduced, resulting in a high probability to identify Journal of Proteome Research • Vol. 6, No. 6, 2007 2199
research articles
Aivaliotis et al.
Figure 4. (A) Illustration of the ORF overprediction problem and the origin of the statistical data represented in B and C. Two “real proteins” with proteomic identification are shown as filled green arrows. Possible N-terminal extensions of the reading frames up to the preceding in-frame stop codon are indicated as a gray line. Possible N-terminal extensions that start with ATG or GTG are indicated as a light blue line. The average length of these extensions has been computed (plotted for several species in C). All potential start codons in the extension are indicated as dark blue triangles (their number being plotted for several species in B). Additional long spurious ORFs (starting with ATG or GTG) are indicated as open red arrows. Up to four distinct reading frames are open. The average number of open frames (for real proteins plus spurious ORFs) is referred to as “multiplicity of coding’’ (plotted in B). (B) Statistical data were computed for the chromosome from 26 completely sequenced organisms, including the four halophiles Hbt. salinarum (HS), Nmn. pharaonis (NP), Hqt. walsbyi (HQ), and Har. marismortui (HM). Full details are provided in Supplementary Table 5 (see Supporting Information). Non-halophilic organisms were selected as to cover a wide range of GC contents. Red, “multiplicity of coding’’ (computation detailed in Material and Methods). Blue, average number of alternative start codons in possible N-terminal extensions. (C) Statistical data were computed for the organisms described in B. Plotted is the average length of N-terminal extensions. Note that the scale on the y-axis is logarithmic. 2200
Journal of Proteome Research • Vol. 6, No. 6, 2007
N-Terminal Peptides in the Halophilic Archaea
Figure 5. A region of the Nmn. pharaonis genome, visualized by the GenDB43 genome annotation system, illustrates various gene finder problems. Protein-coding genes are represented by blue arrows, spurious ORFs by empty arrows. Original ORF predictor results are indicated by colored lines below the ORFs (blue, Critica;37 red, Glimmer;17 green, sixframe translation). The type A overlap contains two ORFs predicted by Glimmer, one being a false positive. In the type B overlaps, Critica did not annotate the corresponding genes (false negatives). In both cases, the start codon assignment for the neighboring genes was incorrect, leading to long extensions that result in large gene overlaps with subsequent elimination of the ORF from the gene list.
N-terminal peptides.26 The other method is peptide separation by SCX, which was recently shown to enrich less positively charged peptides like phosphopeptides in the low salt-fractions.28 As N-terminally acetylated peptides lack a positive charge at their R-N-terminal amino group, such peptides are also enriched in the non-binding fraction and in the early fractions under low salt eluting conditions.30,38 This resulted in the identification of a large number of N-terminally acetylated peptides which was surprising because N-terminal acetylation had been rarely detected in archaea before and was considered to be very infrequent. A full description of the N-terminal modification analysis by bioinformatics methods including all data on N-terminal peptides shown in the present work is presented in Falb et al. (2006).39 In the present work, we describe the methodological details of the identification of N-terminal peptides by proteomics. In addition, we compare the accuracy of gene prediction and especially the start codon assignment by several gene finders using proteomics-validated ORF sets (set of reliably identified proteins and set of proteins with a reliably identified N-terminal peptide). These validated ORF sets will be made available by FTP and thus can be used to evaluate the performance of other gene finders. The quality of this dataset and the reliability of our biological conclusions critically depend on the reliability of the underlying proteomic data. Therefore, we first describe our evaluation of the false positive rate, which leads us to apply a stringent scoring system. Reliability of Proteomic Identification. A standard method to estimate the false positive rate is a database search against a reverted database where the sequences are written from the C-terminus to the N-terminus.40 The resulting sequences are highly similar to the original sequences with respect to amino acid composition as well as to the proteolytic fragments. However, the MS/MS fragmentation pattern is different. The false positive rate is estimated under the assumption that all identifications in the reverted database are false positives and all identifications in the normal database are correct calls. We used MASCOT as the search engine, which provides a probability-based MOWSE score and a significance threshold score for the applied significance level,32 commonly set at 95%
research articles (i.e., 95% of the results are expected to be correct, while 5% are potentially false). In our case, the false positive rate as determined by searches against the reverted database is comparable to the expected false positive rate for the applied 95% significance level. Due to the mathematical fundaments of the MASCOT scoring system, data analysis at a different significance level does not require to repeat the database search as only the significance threshold score is affected by the significance level and not the MOWSE score itself. We define a “difference score” as the MOWSE score minus the threshold score for identity at 95% significance level. A difference score of 10 indicates a 10fold increase in reliability, being equivalent to application of a 99.5% significance level (which is equivalent to a 0.5% false positive rate). The commonly used significance level of 95% still allows for a sometimes intolerable number of false positive identifications, which may lead to misinterpretation of data especially from peptide-centric (gel-free) proteome analyses. This is illustrated by the proteomic “identification” of N-terminal peptides from spurious ORFs, i.e., from open reading frames longer than 100 codons, for which we assume that they are not translated into proteins (for further details see below). A first interpretation of the data at 95% significance level in Hbt. salinarum, results in 88 “identifications” from spurious ORFs, which would indicate that they are translated even though this is considered as highly unlikely by the genome annotators. However, with increasing stringency spurious ORF “identifications” are nearly completely lost while “normal” protein identifications are only moderately reduced (Figure 2, Supplementary Table 3, see Supporting Information). This is consistent with the expectation that false positive identifications are overrepresented in the low-scoring region. With a difference cutoff value of 10 (i.e., at a 99.5% significance level), spurious ORF “identifications” are reduced by 86.4% from 88 to 12, while “real protein” identifications are only reduced by 16.4% from 794 to 664. With a difference cutoff of 15 (i.e., at a 99.8% significance level) spurious ORF “identifications” are nearly completely eliminated (96.6% reduction, 3 spurious ORFs remaining), retaining 603 of the “real protein” identifications (reduction by 24.1%). We therefore evaluate proteomic data under high stringency, requesting a difference score of at least 15, to minimize false positive identifications although this also results in false negatives and thus reduces our results set. This conceptual decision avoids drawing conclusions that are incorrect, whereas still allowing to obtain sufficient large datasets for biologically meaningful conclusions. Accordingly, throughout this report, peptides are considered as identified when their difference score exceeds 15, i.e., data are accepted at a 99.8% significance level. General Approach for Identification of N-Terminal Peptides. In the general approach, proteomic identification data obtained from various proteomic projects on Hbt. salinarum and Nmn. pharaonis10,21-25 were evaluated for the identification of N-terminal peptides and modified forms thereof. A high number of protein and N-terminal peptide identifications were obtained by scanning through huge sets of proteomic data as presented in Supplementary Tables 2 and 4 (Supporting Information) (Figure 3). Specific Approaches for Identification of N-Terminal Peptides and Modified Forms Thereof. In addition to the identification of N-terminal peptides from general proteomic datasets, two specific N-terminal peptide methods were applied. Using Journal of Proteome Research • Vol. 6, No. 6, 2007 2201
research articles
Figure 6. Circle diagrams for Hbt. salinarum and Nmn. pharaonis showing the overlap of N-terminal peptide identifications between specific and general approaches. The blue circle represents the N-terminal peptides identified from general approaches, and the light turquoise circle the N-terminal peptides identified from specific approaches.
these methods, 283 and 220 N-terminal peptides were reliably identified from Hbt. salinarum and Nmn. pharaonis, respectively, with most peptides identified by COFRADIC (240 and 220, respectively) (Figure 3, Supplementary Table 4, Supporting Information). In Figures 3 and 6, it is nicely shown that the contribution of COFRADIC and in general of specific methods in the identification of N-terminal peptides is dramatically increasing in the cases where many large-scale proteomic data are not available. COFRADIC, as one of the specific methods, preferably retrieves N-terminal peptides by application of in vitro acetylation to block protein N-termini and free amino groups.26 This consequently hides information about in vivo acetylation of the proteins and therefore this dataset was excluded from our statistics regarding NR-acetylation of the proteins (Figure 3). It was previously shown that N-terminally blocked peptides are also strongly enriched in the flow-through and low-affinity binding SCX fractions at low pH: non-acetylated N-terminal peptides carry a positive charge at the N-terminus which is removed upon N-acetylation with a concomitant decrease in affinity to the SCX matrix. As can be seen in Supplementary Table 4 (Supporting Information) and in Figure 3, SCX chromatography with subsequent mass spectrometry resulted in a comparably high number (11) of unique identifications of acetylated N-termini, although only few data have been collected and accordingly only few proteins were uniquely identified. The identification of N-terminal acetylation in archaea was rather unexpected since only few acetylated proteins have been reported for archaea so far. Also, only few bacterial proteins have been found to be N-acetylated and it was assumed that this holds true for all prokaryotes, i.e., also for archaea.20 A full report on N-terminal processing in archaea, based on the proteomic datasets described here, is given in Falb et al. (2006).39 N-Terminal Protein Maturation in Archaea. N-terminal protein maturation39 consists of two common processes, methionine cleavage and N-terminal acetylation. With reference to initiator methionine cleavage, in organisms from all three domains of life, the initiator methionine is removed from more than half of the proteins if the second residue is small. In contrast, each domain of life shows a specific pattern with respect to N-terminal acetylation. Proteins are only rarely acetylated in bacteria, whereas most eukaryotic proteins are acetylated (using yeast and mammals as reference systems) and the acetylation occurs on the initiator methionine as well as on the second residue after methionine removal.20 About 15% 2202
Journal of Proteome Research • Vol. 6, No. 6, 2007
Aivaliotis et al.
of the archaeal proteins are N-terminally acetylated (but partial acetylation is quite common). NR-acetylation occurs nearly exclusively subsequent to methionine removal. Only N-terminal serine and alanine were found to be acetylated, whereas N-terminal threonine, which occurs in many proteins, was never found to be acetylated. It is not yet known whether an N-end rule pathway for the specific protein degradation42 exists also in archaea and what’s its relation with the N-terminal modifications studied in the present work. GC-Rich Halophilic Archaea Exemplify an ORF Overprediction Problem. Genome annotation starts with gene prediction, for which a series of standard tools has been developed, among them Critica,37 Glimmer,17 and Reganor.18 GC-rich genomes are a considerable challenge to gene finders18,19 as revealed by a large number of gene prediction errors, especially concerning start codon assignment.16 It is evident that GC-rich genomes have a reduced density of AT-rich stop codons. Figure 4 illustrates this ORF overprediction problem for three theoretical computations. The “multiplicity of coding” (Figure 4B, red diamonds) indicates how many reading frames are on average open at every base position of the chromosome. This value was computed (for details, see Materials and Methods) after six frame translation using ATG and GTG as start codons and a minimum ORF length of 100 codons. Whereas organisms with moderate or low GC content have a “multiplicity of coding” of about 1 as expected, GC-rich organisms reach values up to 3 (Supplementary Table 5, Supporting Information) as also illustrated by the high number of spurious ORFs (as illustrated in the inset boxed blue, Figure 4B). In addition, the genes from the theoretical proteome were extended beyond the assigned start codon up to the previous in-frame stop codon and the number of start codons in this extension was computed (Figure 4B, blue squares, illustrated in the inset boxed red). On average, proteins from GC-rich organisms have a higher number of additional start codons in the extensions. The average length of the extensions increases with increasing GC content of the chromosome (Figure 4C, illustrated in the inset of the figure). These data illustrate why gene finder performance is rather weak for GC-rich genomes. Resulting problems can be overcome by proteomic validation, thus allowing us to obtain a reliable theoretical proteome. Validated Datasets Allow Evaluation of Gene Finders. Here, we describe three protein sets with an increasing level of reliability with respect to protein existence and start codon assignment. These sets will be made public available to provide reference sets allowing performance analysis for several gene finders (Supplementary Table 1, Supporting Information). The first set (EV set) corresponds to the theoretical proteome of Hbt. salinarum and Nmn. pharaonis. To generate the EV set, automatic gene finder results were questioned (as further detailed in Materials and Methods) based on sequence homology with known proteins and other characteristics of halophilic proteins like acidic pI values.6-10 Nevertheless, a significant fraction of the theoretical proteome is not based on experimental evidence and thus may contain errors concerning both ORF selection and start codon assignment. A second set contains all proteins that have been reliably identified by proteomics (PV set). This set allows the determination of false negative rates for gene finders and the derivation of additional statistical data for halophilic proteins. The third set (NV set)
research articles
N-Terminal Peptides in the Halophilic Archaea
contains only proteins with a reliably identified N-terminal peptide as described in this work. It can be used to evaluate the performance of gene finders with respect to start codon assignments. Usage of the three sets to assess the performance of the three gene finders used by our group is described below. Gene Finders Exhibit Severe Start Codon Assignment Problems. Using the datasets described above, three microbial gene finders were assessed with respect to their performance on gene prediction and start codon assignment. The three gene predictors are Critica,37 Glimmer,17 and Reganor,18 the latter combining results from the two former and being part of the GenDB annotation platform.41 For this purpose, predicted genes and gene starts for the GC-rich Hbt. salinarum and Nmn. pharaonis chromosomes were compared with the established validated gene sets. Compared to the EV gene set (Supplementary Tables 6 and 7, Supporting Information), Critica tends to underpredict (9.5-13.3% false negatives) whereas Glimmer tends to overpredict genes (13.2-18.5% false positives). Reganor combines the strength of Critica and Glimmer so that it has the best overall performance. Thus, Reganor is qualified for gene prediction in GC-rich genomes as stated previously.18 Furthermore, the proteome-validated gene set PV was used to determine false negative genes. The results confirm Critica’s tendency to underpredict genes. However, the false negative rates are lower for the PV set as compared to the EV set. This may indicate that the EV set is not error-free despite of rigorous postprocessing of automatic gene finder results and that some ORFs currently classified as ’’real proteins’’ actually should be marked as “spurious ORF’’. Besides, proteins that are difficult to predict could also be difficult to identify by proteomics (small proteins25 or proteins with low expression rates exhibiting low CAI values22). Gene starts from the EV set as well as proteomics-verified N-termini (NV set) were used to assess the performance of gene finders with respect to start codon assignments (Supplementary Table 7, Supporting Information). Defective gene prediction is indicated by differences in gene finders’ performance. Critica and Reganor have an error rate of 10-18%, whereas Glimmer has an error rate of 32-44%. All three gene finders frequently predict genes which are too long (genes which are too short are much less frequent). Start codon misassignments do not necessarily affect overall gene prediction significantly in case of small start shifts. However, for all gene finders, about 3-6% of the predicted gene starts were too long by more than 50 amino acids. This systematic preference of longer gene versions also leads to misinterpretations in subsequent domain prediction; e.g. signal sequences and lipid anchor motifs, which are searched in regions close to the N-terminus, could escape detection. An example is a large N-terminal cytoplasmic domain of TatC1 assigned for Hbt. salinarum strain NRC-1,43 which is probably the result of a misassigned start codon. The start codon selection also affecting the overall gene assignment in GC-rich genomes. Due to potential gene extensions of considerable length (Figure 4C), large overlaps of neighboring genes are rather frequent and have to be corrected by expert-validation by shortening one or both overlapping genes. Overlapping genes are not solely the result of start codon misassignments (overlap type B) but can also arise from gene overprediction (overlap type A) (Figure 5). Thus, for each of the overlapped ORFs, it has to be decided whether genes should be shortened (type B) or whether one of the genes should be excluded (type A). This overlap ambiguity (misassigned gene or misassigned start codon) leaves space for improvement in
the handling of overlaps by Reganor and Critica in GC-rich genomes. These tools select only one gene per genome position and generally assume type A overlaps. This program behavior, which always results in exclusion of one of the overlapping genes, even if the overlap can be resolved by gene shortening, significantly contributes to the false negative rate of Reganor and Critica (Supplementary Tables 6 and 7, Supporting Information). For Reganor, this problem is reduced by the inclusion of Glimmer data, resulting in an acceptable false negative rate. Glimmer generally permits overlaps and Reganor excludes only Glimmer genes with extensive overlaps, which are commonly true type A overlaps.
Conclusions In the present work, we performed a large-scale identification of N-terminal peptides from the two halophilic archaea Hbt. salinarum and Nmn. pharaonis. Combining general and specific proteomic methods, 606 N-terminal peptides from Hbt. salinarum and 328 from Nmn. pharaonis were reliably identified, constituting the largest available dataset containing identified and characterized protein N-termini for prokaryotes. Its quality was ensured by increasing the reliability of proteomic identifications using a very stringent scoring system. The information contained is valuable and significant for the study of N-terminal protein maturation in archaea and in prokaryotes in general. In addition, the dataset of proteins with reliably identified N-terminal peptide was used for the comparison of the accuracy of gene prediction and especially the start codon assignment by several gene finders. This validated protein set may be a valuable tool for groups developing or testing gene finders, and therefore will be made available by FTP.
Acknowledgment. We thank Prof. Dr. M. Mann and his co-workers for the LTQ-FT data that are included in the data pool for the general approach. We also thank Dr. F. Lottspeich and his co-workers for their contribution in MALDI-TOF/TOF data, and Dr. Markus Rampp for supporting bioinformatic computations, including access to the completely sequenced genomes, through the MiGenAS infrastructure. The contribution of Sigrid Bauer and Beatrix Scheffer providing excellent technical support is highly appreciated. M.A. thanks Alexander von Humboldt Foundation for the financial support. The lab in Ghent acknowledges the support of research grants from the Fund for Scientific Research-Flanders (Belgium) (Project Number G.0008.03), the GBOU-research initiative (Project Number 20204) of the Flanders Institute of Science and Technology (IWT). Both labs acknowledge the financial support of the European Union Interaction Proteome (6th Framework Program). Supporting Information Available: Supplementary Tables 1-7. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) Valera, F. R. Characteristics and microbial ecology of hypersaline environments; CRC Press: Boca Raton, FL, 1988; Vol. I. (2) Tindall, B. J.; Trueper, H. G. Syst. Appl. Microbiol. 1986, 7, 202212. (3) Gunte-Cimerman, N.; Oren, A.; Plamenitaˇs, A. Adaptation to Life at High Salt Concentrations in Archaea, Bacteria, and Eukarya. Springer: New York, 2005.
Journal of Proteome Research • Vol. 6, No. 6, 2007 2203
research articles (4) Christian, J. H.; Waltho, J. A. Solute concentrations within cells of halophilic and non-halophilic bacteria. Biochim. Biophys. Acta 1962, 65, 506-508. (5) Ginzburg, M.; Sacks, L.; Ginzburg, B. Z. Ion metabolism in Halobacterium. J. Gen. Physiol. 1970, 55, 178-207. (6) Lanyi, J. K. Salt-Dependent Properties of Proteins from Extremely Halophilic Bacteria. Bacteriol. Rev. 1974, 38(3), 272-290. (7) Danson, M. J.; Hough, D. W. The structural basis of protein halophilicity. Comp. Biochem. Physiol. 1997, 117A, 307-312. (8) Kennedy, S. P.; Ng, W. V.; Salzberg, S. L.; Hood, L.; DasSarma, S. Understanding the adaptation of Halobacterium species NRC-1 to its extreme environment through computational analysis of its genome sequence. Genome Res. 2001, 11, 1641-1650. (9) Marg, B. L.; Schweimer, K.; Sticht, H.; Oesterhelt, D. A Two-Helical Extra Domain Mediates the Halophilic Character of a Plant Type Ferredoxin from the Archaeon Halobacterium salinarum. Biochemistry 2005, 44, 29-39. (10) Tebbe, A.; Klein, C.; Bisle, B.; Siedler, F.; Scheffer, B.; Garcia-Rizo, C.; Wolfertz, J.; Hickmann, V.; Pfeiffer, F.; Oesterhelt, D. Analysis of the cytosolic proteome of Halobacterium salinarum and its implication for genome annotation. Proteomics 2005, 5(1), 168179. (11) Hampp, N.; Oesterhelt, D. Bacteriorhodopsin and its Potential in Technical Applications. Wiley-VCH-Verlag: Weinheim, 2004; p 146-167. (12) Gan, R. R.; Yi, E. C.; Chiu, Y.; Lee, H.; Kao, Y. P.; Wu, T. H.; Aebersold, R.; Goodlett, D. R.; Ng, W. V. Proteome Analysis of Halobacterium sp. NRC-1 Facilitated by the Biomodule Analysis Tool BMSorter. Mol. Cell. Proteomics 2006, 5(6), 987-997. (13) Nutsch, T.; Oesterhelt, D.; Gilles, E. D.; Marwan, W. A quantitative model of the switch cycle of an archaeal flagellar motor and its sensory control. Biophys. J. 2005, 89(4), 2307-2323. (14) Ng, W. V.; Kennedy, S. P.; Mahairas, G. G.; Berquist, B.; Pan, M.; Shukla, H. D.; Lasky, S. R.; Baliga, N. S.; Thorsson, V.; Sbrogna, J.; Swartzell, S.; Weir, D.; Hall, J.; Dahl, T. A.; Welti, R.; Goo, Y. A.; Leithauser, B.; Keller, K.; Cruz, R.; Danson, M. J.; Hough, D. W.; Maddocks, D. G.; Jablonski, P. E.; Krebs, M. P.; Angevine, C. M.; Dale, H.; Isenbarger, T. A.; Peck, R. F.; Pohlschroder, M.; Spudich, J. L.; Jung, K. H.; Alam, M.; Freitas, T.; Hou, S. B.; Daniels, C. J.; Dennis, P. P.; Omer, A. D.; Ebhardt, H.; Lowe, T. M.; Liang, R.; Riley, M.; Hood, L.; DasSarma, S. Genome sequence of Halobacterium species NRC-1. Proc. Natl. Acad. Sci. U.S.A. 2000, 97(22), 12176-12181. (15) Pfeiffer, F. et al. Unpublished, http://www.halolex.mpg.de/ (16) Falb, M.; Pfeiffer, F.; Palm, P.; Rodewald, K.; Hickmann, V.; Tittor, J.; Oesterhelt, D. Living with two extremes: Conclusions from the genome sequence of Natronomonas pharaonis. Genome Res. 2005, 15(10), 1336-1343. (17) Delcher, A. L.; Harmon, D.; Kasif, S.; White, O.; Salzberg, S. L. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 1999, 27(23), 4636-4641. (18) McHardy, A. C.; Goesmann, A.; Puhler, A.; Meyer, F. Development of joint application strategies for two microbial gene finders. Bioinformatics 2004, 20(10), 1622-1631. (19) Nielsen Pernille, K. A. Large-scale prokaryotic gene prediction and comparison to genome annotation. Bioinformatics 2005, 21, 4322-4329. (20) Polevoda, B.; Sherman, F. N-terminal acetyltransferases and sequence requirements for N-terminal acetylation of eukaryotic proteins. J. Mol. Biol. 2003, 325(4), 595-622. (21) Klein, C.; Garcia-Rizo, C.; Bisle, B.; Scheffer, B.; Zischka, H.; Pfeiffer, F.; Siedler, F.; Oesterhelt, D. The membrane proteome of Halobacterium salinarum. Proteomics 2005, 5(1), 180-197. (22) Konstantinidis, K.; Tebbe, A.; Klein, C.; Scheffer, B.; Aivaliotis, M.; Bisle, B.; Falb, M.; Pfeiffer, F.; Siedler, F.; Oesterhelt, D. Genomewide proteomics of Natronomonas pharaonis. J. Proteome Res. 2007, 6, 185-193. (23) Tebbe, A.; Schmidt, A.; Konstantinidis, K.; Falb, M.; Bisle, B.; Klein, C.; Kellermann, J.; Siedler, F.; Pfeiffer, F.; Lottspeich, F.; Oesterhelt, D. Life-Style changes of a Halophilic Archaeon analyzed by Quantitative Proteomics. Mol. Cell Proteomics 2007, submitted. (24) Bisle, B.; Schmidt, A.; Scheibe, B.; Klein, C.; Tebbe, A.; Kellermann, J.; Siedler, F.; Pfeiffer, F.; Lottspeich, F.; Oesterhelt, D. Quantitative profiling of the membrane proteome in a halophilic archaeon. Mol. Cell. Proteomics 2006, 27, 1543-1558. (25) Klein, C.; Aivaliotis, M., Olsen, J. V.; Falb, M.; Besir, H.; Scheffer, B.; Bisle, B.; Tebbe, A.; Konstantinidis, K.; Siedler, F.; Pfeiffer, F.; Mann, M.; Oesterhelt, D. Proteome analysis of low molecular weight proteins in Halobacterium salinarum. J. Proteome Res. 2007, 6, 1510-1518.
2204
Journal of Proteome Research • Vol. 6, No. 6, 2007
Aivaliotis et al. (26) Gevaert, K.; Goethals, M.; Martens, L.; Van Damme, J.; Staes, A.; Thomas, G. R.; Vandekerckhove, J. Exploring proteomes and analyzing protein processing by mass spectrometric identification of sorted N-terminal peptides. Nat. Biotechnol. 2003, 21(5), 566569. (27) Lavens, D.; Montoye, T.; Piessevaux, J.; Zabeau, L.; Vandekerckhove, J.; Gevaert, K.; Becker, W.; Eyckerman, S.; Tavernier, J. A complex interaction pattern of CIS and SOCS2 with the leptin receptor. J. Cell Sci. 2006, 11(119), 2214. (28) Beausoleil, S. A.; Jedrychowski, M.; Schwartz, D.; Elias, J. E.; Villen, J.; Li, J.; Cohn, M. A.; Cantley, L. C.; Gygi, S. P. Large-scale characterization of HeLa cell nuclear phosphoproteins. Proc. Natl. Acad. Sci. U.S.A. 2004, 101(33), 12130-12135. (29) Olsen, J. V.; Mann, M. Improved peptide identification in proteomics by two consecutive stages of mass spectrometric fragmentation. Proc. Natl. Acad. Sci. U.S.A. 2004, 101(37), 1341713422. (30) Gruhler, A.; Olsen, J. V.; Mohammed, S.; Mortensen, P.; Faergeman, N. J.; Mann, M.; Jensen, O. N. Quantitative phosphoproteomics applied to the yeast pheromone signaling pathway. Mol. Cell. Proteomics 2005, 4(3), 310-327. (31) Gruhler, A.; Schulze, W. X.; Matthiesen, R.; Mann, M.; Jensen, O. N. Stable isotope labeling of Arabidopsis thaliana cells and quantitative proteomics by mass spectrometry. Mol. Cell. Proteomics 2005, 4(11), 1697-1709. (32) Pappin, D.; Hojrup, P.; Bleasby, A. J. Rapid identification of proteins by peptide-mass fingerprinting. Curr. Biol. 1993, 3(6), 327-332. (33) Martens, L.; Vanderkerckhove, J.; Gevaert, K. DBToolkit: processing protein databases for peptide-centric proteomics. Bioinformatics 2005, 21(17), 3564. (34) Rampp, M.; Soddemann, T.; Lederer, H. The MIGenAS integrated bioinformatics toolkit for web-based sequence analysis. Nucleic Acids Res. 2006, 34(Web Server issue), W15-9. (35) Bolhuis, H.; Palm, P.; Wende, A.; Falb, M.; Rampp, M.; RodriguezValera, F.; Pfeiffer, F.; Oesterhelt, D. The genome of the square archaeon Haloquadratum walsbyi: life at the limits of water activity. BMC Genomics 2006, 7(1), 169. (36) Baliga, N. S.; Bonneau, R.; Facciotti, M. T.; Pan, M.; Glusman, G.; Deutsch, E. W.; Shannon, P.; Chiu, Y.; Weng, R. S.; Gan, R. R.; Hung, P.; Date, S. V.; Marcotte, E.; Hood, L.; Ng, W. V. Genome sequence of Haloarcula marismortui: a halophilic archaeon from the Dead Sea. Genome Res. 2004, 14(12), 2510. (37) Badger, J. H.; Olsen, G. J. CRITICA: coding region identification tool invoking comparative analysis. Mol. Biol. Evol. 1999, 16(4), 512-524. (38) Crimmins, D. L.; Gorka, J.; Thoma, R. S.; Schwartz, B. D. Peptide characterization with a sulfoethyl aspartamide column. J. Chromatogr. 1988, 29(443), 63-71. (39) Falb, M.; Aivaliotis, M.; Garcia-Rizo, C.; Bisle, B.; Tebbe, A.; Klein, C.; Konstantinidis, K.; Frank, S.; Pfeiffer, F.; Oesterhelt, D. Archaeal N-terminal protein maturation commonly involves N-terminal acetylation: a large-scale proteomics survey. J. Mol. Biol. 2006, 362(5), 915-924. (40) Qian, W. J.; Liu, T.; Monroe, E. M.; Strittmatter, F. E.; Jacobs, M. J.; Kangas, J. L.; Petritis, K.; Camp, G. D., II; Smith, D. R. Probability-Based Evaluation of Peptide and Protein Identifications from Tandem Mass Spectrometry and SEQUEST Analysis: The Human Proteome. J. Proteome Res. 2005, 4, 53-62. (41) Erbse, A.; Schmidt, R.; Bornemann, T.; Schneider-Mergener, J.; Mogk, A.; Zahn, R.; Dougan, D. A.; Bukau, B. ClpS is an essential component of the N-end rule pathway in Escherichia coli. Nature 2006, 439, 753-756. (42) Meyer, F.; Goesmann, A.; McHardy, A. C.; Bartels, D.; Bekel, T.; Clausen, J.; Kalinowski, J.; Linke, B.; Rupp, O.; Giegerich, R.; Puhler, A. GenDB-an open source genome annotation system for prokaryote genomes. Nucleic Acids Res. 2003, 31(8), 21872195. (43) Bolhuis, A. Protein transport in the halophilic archaeon Halobacterium sp. NRC-1: a major role for the twin-arginine translocation pathway? Microbiol. Mol. Biol. Rev. 2002, 148(11), 3335-3346.
PR0700347