Analysis of Shotgun Proteomics and RNA Profiling Data from Arabidopsis thaliana Chloroplasts Sacha Baginsky,* Torsten Kleffmann, Anne von Zychlinski, and Wilhelm Gruissem† Institute of Plant Science and Functional Genomics Center Zu ¨ rich, Swiss Federal Institute of Technology, ETH Zu ¨ rich, Universita¨tstrasse 2, 8092 Zu ¨ rich, Switzerland Received December 14, 2004
The integration of data from transcriptional profiling and shotgun proteomics experiments provides additional information about the identified proteins that goes beyond their plain detection. We have analyzed results from MS/MS shotgun detection of 426 Arabidopsis chloroplast proteins and genomewide RNA profiling to identify correlations between gene expression, protein abundance and protein characteristics that influence their detection in high-throughput proteome analyses. The integrated data analysis revealed a significant molecular mass bias for the detection of proteins that were expressed at low transcript levels. Overall, the sequence coverage of most of the identified proteins increases with transcript levels indicating a positive correlation between transcript and relative protein abundance. This does not apply to a subset of the identified proteins suggesting specific properties that alter their detection in shotgun proteomics. This integrative comparison is a suitable strategy to validate large scale proteomics data and offers an assessment of the depth of the proteome analysis and the confidence in protein identification. Keywords: mass spectrometry • shotgun proteomics • transcriptional profiling • chloroplast
Introduction Systems biology attempts to understand biological processes through an integrative analysis of gene expression patterns, proteomes and metabolomes (reviewed in refs 1-4). To approach a full understanding of system functions, it is necessary to understand the relationship between transcript and protein levels as well as how enzyme concentrations and activities contribute to the steady state concentration of a metabolite. These relationships contain the actual information on the construction and function of cellular networks and ultimately the workings of the cell. Several studies have already analyzed the relationship between transcript and protein levels to understand the extent to which transcriptional or posttranscriptional regulation are active for multiple pathways under different experimental conditions at a specific point in time. For example, the results obtained with yeast range from “no correlation” to “pathway-specific correlation” or “positive correlation”.5-7 This broad variance confirms that gene expression is regulated at different levels and that cells can shift regulatory levels to adjust to prevailing conditions. These experiments also point out that more data are required to establish the biological significance of such correlations. We report here a detailed analysis of transcript levels and protein * To whom correspondence should be addressed. Sacha Baginsky, Institute of Plant Sciences, Swiss Federal Institute of Technology, ETH Zentrum, LFW E51.1, Universita¨tstrasse 2, CH-8092 Zu ¨ rich, Switzerland. Phone:+41 1 632 3866. Fax: +41 1 632 10 79. E-mail: sacha.baginsky@ ipw.biol.ethz.ch. † Functional Genomics Center Zu ¨ rich. 10.1021/pr049764u CCC: $30.25
2005 American Chemical Society
detection in a recent analysis of the Arabidopsis thaliana chloroplast proteome.8 Our results show that the majority of all identified proteins are also expressed at high RNA levels. Identified proteins whose genes are expressed at low transcript levels were on average significantly larger than identified proteins from higher expression level groups or all proteins whose genes were detected on the GeneChip. These data suggest a technical bias of shotgun proteomics that significantly contributes to protein detection. The knowledge about protein detection requirements in our shotgun proteome study helped us to assess the depth of our analysis and the confidence of plastid protein identification.
Experimental Procedures Protein Identification. Chloroplast proteins were identified from 7-week-old plants that were grown under short-day conditions (8 h light of 100 µE, 16 h dark) by Percoll density gradient centrifugation as described before.8 In brief, proteins from purified Arabidopsis chloroplasts were separated into 300 complex fractions using a combination of solubility-based serial extraction, ion exchange and Cibacron blue affinity chromatography. All fractions were further separated by SDS-PAGE followed by in gel tryptic digest and reversed phase chromatography of the peptides on C18 material. Columns were laboratory-made silica tubing nano-tip capillaries with an inner diameter of 75 µm filled with C18 material at a length of 8 cm. For the analysis, the flow rate was adjusted to approximately 300 nL/min. Peptide and protein identification were performed as described in Kleffmann and colleagues.8 In brief, peptides eluting from the RP column were analyzed in the dataJournal of Proteome Research 2005, 4, 637-640
637
Published on Web 03/16/2005
letters
Baginsky et al. 9
Figure 1. Box plots depicting RNA expression levels of identified chloroplast proteins (PLprot), all proteins encoded in the Arabidopsis genome (ATH) and all proteins predicted to localize to the chloroplast by TargetP (TargetP). The RNA expression levels of all identified proteins (PLprot) were compared with those of all genes on the Arabidopsis thaliana GeneChip and all predicted chloroplast proteins (TargetP). Indicated are the median values (line in the box), the 25 and 75% percentile (lower and upper end of the box), the 5% and 95% percentile (lower and upper bars) and the outlier (single dots).
dependent acquisition mode on an LCQDeca XP (Thermo Finnigan, San Jose), with one full scan and four data dependent MS/MS scans of the four most intense parent ions. MS/MS data interpretation was based on the SEQUEST software (Thermo Finnigan, San Jose) with the following hierarchical criteria for peptide identification: (1) XCorr value of at least 2.0. (2) Ion ratio (ratio between detected and expected b- and y-ions for each peptide) of at least 40%. (3) Doubly charged parent ion. (4) Long peptides with more than 50 theoretical ions were discarded. (5) At least four different peptides with the above criteria per identified protein. (6) Protein identifications based on less than four peptides [as detailed in (1)-(5)] were manually examined for correct peak assignment and spectra quality. Transcriptional Profiling. The mRNA expression analysis was performed as described previously.8 In brief, 100 mg of leaves were harvested from Arabidopsis plants of identical growth conditions and developmental age as plants used for the proteome analysis (see Isolation of intact chloroplasts). Total RNA was isolated in triplicates using the Trizol reagent according to manufacturer’s instructions (Life Technologies, USA). The RNA was further purified using the Qiagen RNAesy miniprep kit (Qiagen GmbH, Germany). Twenty µg of total RNA was used to prepare the cDNA and biotin-labeled cRNA as recommended by Affymetrix (Santa Clara, USA). Hybridization to the full genome Arabidopsis GeneChip, detection of the labeled cRNA with the streptavidin-phycoerythrin system and subsequent scanning of the GeneChip with a confocal scanner were performed according to the manufacturer’s instructions (Affymetrix). Raw data were processed using the Microarray Suite 5.0 software (Affymetrix).
Results and Discussion Judging the Depth of a Proteome Analysis. We previously reported the identification of 499 proteins from a multidimensional protein fractionation approach.8 After subtracting all proteins for which no information of their transcript levels was available (e.g., all plastid encoded genes) we calculated average RNA levels for all remaining proteins detected by shotgun proteomics (Figure 1, n ) 426). The RNA expression levels of the identified proteins (PLprot) are significantly higher than the average RNA expression levels of all genes on the Arabidopsis genome chip (ATH) that gave a presence call and all 638
Journal of Proteome Research • Vol. 4, No. 2, 2005
TargetP predicted chloroplast proteins (Figure 1, presented is the geometric mean). Similar results were reported previously from a shotgun analysis of the Escherichia coli proteome, where the RNA expression levels of all identified proteins were three to five times higher when compared to the mean of all genes present on the E. coli genome chip.10 Our results suggest that correlations exist between transcript levels and protein detection by MS/MS shotgun proteomics. Using the protein coverage (i.e., how much of the amino acid sequence of an identified protein is covered by the identified tryptic peptides) as a semiquantitative measure of protein abundance, we previously determined a weak (Spearman rank correlation of 0.53) but significant positive correlation between transcript and protein levels.8 The expression data presented in Figure 1 therefore suggest that we detected the most abundant chloroplast proteins with our experimental setup. This is expected and several proteome analyses described difficulties in the detection of low abundant proteins. Thus, we can expect that a significant number of chloroplast proteins are still undetected and that a deeper analysis is necessary to approach the full plastid proteome. Information of this type is important for two main reasons. First, the mean RNA expression levels of genes whose protein products were identified from different protein fractions allows judging the efficiency of a fractionation strategy to enhance proteome coverage. On this basis, optimal fractionation strategies can be designed. Second, organelle proteomics as reported here creates knowledge about protein targeting and the sensitivity of targeting prediction programs. These numbers have become important figures for computational biology. The reliability of this information depends to a significant extend on the depth of the analysis that can be judged by an integrative analysis as reported here. Revealing a Detection Bias for Large Proteins. With the known abundance bias, it is notable that we still detected some proteins whose genes are expressed at low RNA levels. Therefore, we used “detected” as a criterion to compare the characteristics of the identified proteins in five different classes of RNA levels ranging from 5000 arbitrary units (these units are normalized signal intensity readouts from the GeneChip scanner and provide a direct assessment of transcript abundance). Proteins that were identified in the first two classes (RNA levels 2000) this bias becomes progressively less relevant, i.e., we also identified more low molecular weight proteins (Table 1, Figure 2). The molecular mass bias for proteins in the classes of RNA levels 60% for each expression level bin.
value of 94.6 kDa for proteins that were predicted to localize to “any other location”. Our analysis clearly shows that high molecular mass is not a genuine feature of plastid proteins that are expressed at low RNA levels but rather a result of technical limitations of shotgun proteomics. Consequently, this observation has no biological basis and thus no biological relevance. Since statistics about protein characteristics that were extracted from shotgun proteomics data are important for bioinformatics approaches, it is necessary to be aware of any bias that is created by the employed technique or the experimental design. Our data uncover such a technical bias that creates a potential pitfall for statistics-based computational biology in cases where molecular mass is used as a variable. We made an interesting observation concerning the TargetP prediction sensitivity for the proteins in the different expression level groups. TargetP prediction is best for the group of highly expressed proteins [83% for those in >5000 (Table 1)] and decreases progressively to 51% sensitivity for the proteins in the expression group 40%, and >60% and plotted sequence coverage against transcript levels (Figure 3). As expected, the sequence coverage for the majority of the identified proteins increases with the transcript levels (Figure 3). Thus the detection characteristics of proteins that fall in the edges of Figure 3, i.e., low sequence coverage/high transcript levels and high sequence coverage/low transcript levels deviate from those of the majority of the other proteins. We find, e.g., that known contaminations from other cellular compartments fall into the region that contains highly expressed proteins that were detected with sequence coverage below 20% (Figure 3). ExJournal of Proteome Research • Vol. 4, No. 2, 2005 639
letters
Baginsky et al. 14
Figure 4. Distribution of proteins predicted to localize to mitochondria among different confidence criteria for true plastid proteins. Three different criteria were used to characterize the confidence in protein localization: sequence coverage above 20%, expression levels below 2000 and absence from a dedicated analysis of the mitochondrial (mt) proteome.14 Only three of the 25 identified proteins did not comply with any of the above criteria.
amples for these contaminants are catalase, plasma membrane intrinsic protein 2A, glycine decarboxylase complex-protein, and delta tonoplast integral protein (>5000/