Technical Note pubs.acs.org/jpr
Proteomics Wants cRacker: Automated Standardized Data Analysis of LC−MS Derived Proteomic Data Henrik Zauber and Waltraud X. Schulze* MPI for Molecular Plant Physiology, Am Muehlenberg 1, 14476 Potsdam-Golm, Germany S Supporting Information *
ABSTRACT: The large-scale analysis of thousands of proteins under various experimental conditions or in mutant lines has gained more and more importance in hypothesis-driven scientific research and systems biology in the past years. Quantitative analysis by large scale proteomics using modern mass spectrometry usually results in long lists of peptide ion intensities. The main interest for most researchers, however, is to draw conclusions on the protein level. Postprocessing and combining peptide intensities of a proteomic data set requires expert knowledge, and the often repetitive and standardized manual calculations can be time-consuming. The analysis of complex samples can result in very large data sets (lists with several 1000s to 100 000 entries of different peptides) that cannot easily be analyzed using standard spreadsheet programs. To improve speed and consistency of the data analysis of LC−MS derived proteomic data, we developed cRacker. cRacker is an R-based program for automated downstream proteomic data analysis including data normalization strategies for metabolic labeling and label free quantitation. In addition, cRacker includes basic statistical analysis, such as clustering of data, or ANOVA and t tests for comparison between treatments. Results are presented in editable graphic formats and in list files. KEYWORDS: quantitative proteomics, large scale data analysis, multivariate statistics, automation, software tool, ion intensities
■
INTRODUCTION The field of mass spectrometry based proteomics, that is, the large-scale quantitative analysis of proteins and their modifications, is a fast evolving and developing research field in life science. In high throughput proteomic experiments, the liquid chromatography−coupled tandem mass spectrometric (LC−MS) analysis of digested peptide mixtures has found wide applications.1−4 In general, modern LC−MS is able to identify up to several thousands of proteins in a few runs of complex sample peptide mixtures.5 Additionally during past decade, the development of mass spectrometers with high mass accuracy and high dynamic range pushed mass spectrometric analyses from being merely qualitative to a quantitative discipline.6−8 To account for technical variance between different LC−MS runs, it is important to normalize the quantitative values of each sample before overall comparative analysis can be done. In general there are two major strategies to perform this sampleto-sample normalization. One is by providing a labeled standard mixture, so that peptide ion intensities are compared to their coeluting isotopically labeled counterpart. In the other, the label-free quantitation, peptide ion intensities are usually expressed relative to total ion intensities before comparison can be done between samples. Both quantitation strategies have their advantages and disadvantages and depending on the type of biological experiment either method may be used.9 The advantage of the isotope labeling strategy, such as 15Nlabeling10 or SILAC,11 clearly lies in the high accuracy of © 2012 American Chemical Society
quantitative ratios by providing an internal standard but may result in lower proteome coverage due to higher sample complexity.9 Therefore, in the label free strategy samples have lower complexity and it can be applied to any kind of proteomic data set. It is less lab work intensive but results in peptide ion intensities relative to total ion intensities. In contrast to isotopic labeling quantitation strategy, the label free strategy is more affected by technical variations and biases. Therefore, an appropriate data normalization and filtering of peptide intensities is crucial for a meaningful data interpretation.12−17 Irrespective of the quantitation strategy applied, modern quantitative proteomic experiments involve the majority of data processing work after peptide identification and quantitation.4 Bioinformatic processing, such as functional categorizations,18,19 statistical analysis and correction for multiple testing need to be considered. For protein identification, commercial and noncommercial search engines (e.g., Mascot,20 Sequest, Andromeda,21 OMSSA,22 X!Tandem23) are available, and quantitation of ion intensities can be done by respective quantitation platforms like MaxQuant,21 MSQuant,24 the Trans-Proteomic Pipeline25 or Thermo Proteome Discoverer. Although they differ in their precise quantitation algorithms (peak recognition, etc.) all of these tools in the end provide a Received: May 3, 2012 Published: September 17, 2012 5548
dx.doi.org/10.1021/pr300413v | J. Proteome Res. 2012, 11, 5548−5555
Journal of Proteome Research
Technical Note
format of different softwares can vary, defined cRacker import settings for different quantitation outputs are necessary. The supplied cRacker version includes predefined definitions for the softwares MSQuant and MaxQuant. New definitions can easily be added by the user by editing the import-config.csv file in the program folder “import definitions”.
list of quantified peptide ions. Based on these ion intensity lists, in most cases researchers aim at combining peptide informations to quantitative information on protein groups. With cRacker (Figure 1), we aimed at designing a program that
General Processing of Peptide Intensity Lists
Before cRacker can process the imported peptide lists, the user needs to set the processing parameters. cRacker has a Graphical User Interface to set up the parameters for the current session. Detailed information about all editable parameters and their function are given in the cRacker manual on http://cracker. mpimp-golm.mpg.de (Supporting Information document 1). Depending on the type of experiment to be processed, different options need to be set to achieve an optimal analysis. In the following we describe a general workflow for processing peptide ion intensities with cRacker, which is divided into five major steps: I. All peptide intensities in the list are parsed according to the experimental design of the data set. Peptide species are distinguished based on their mass to charge ratios. Thus, peptides with the same identified sequence, but with different mass-to-charge ratio will be handled separately at this step. II. All ion intensity values are normalized according to the selected quantitation mode (Table 1). cRacker is
Figure 1. Scheme of cRacker build up. Peptide intensity lists are automatically imported and analyzed by the cRacker algorithms. For each processing run, experiment-specific parameters can be set using the graphical user interface (GUI). The implemented statistics and graphical outputs enable an immediate interpretation of the calculated results.
integrates different normalization strategies for various quantitative proteomic experimental designs and thus automates and standardizes the calculation and general statistic analysis of peptide ion intensities in proteomic experiments.
■
Table 1. Equations for Calculating Fraction of Total (fot) and Fraction of Total with n Correction, Performed on Each Sample in the Label Free Normalization in cRackera
RESULTS
Installation of cRacker
cRacker is written in R and distributed freely under the URL http://cracker.mpimp-golm.mpg.de. Furthermore, a R source package is distributed under http://r-forge.r-project.org.26 It is recommended to install the latest R version. cRacker itself needs no installation and works directly out of any directory. The cRacker Web site provides additional information about cRacker. A manual, a step by step installation guide for different operating systems as well as movie tutorials are provided. Furthermore setting files for different general pipelines can be downloaded and used as parameter template in a cRacker session. In principle cRacker can run on all machines that can execute R. We recommend for fast calculations at least a CPU at 1 GHz with 2 GB of RAM.
fraction of total (fot):
xnormalized =
xraw ∑ xraw
fot with n correction:
xnormalized =
nsample xraw · ∑ xraw max(nsample 1, nsample 2 , ..., nsample n)
z-score scaling: mean/median scaling:
xscaled =
x − x̅ + min(dataset) + min(dataset) × 10−6 σ
xscaled =
(x1, x 2 , ..., xn) mean/median(xnormalized)
a
After normalization z-score, mean or median scaling is applied on intensities of the same peptide species across all samples. x, ion intensity; n, peptide count.
optimized for processing intensity ratios from metabolic labeling and label free data sets. For label-free data sets, normalization is done by calculating peptide ion intensities as fractions of total sum of intensities (“fraction of total”).17 This type of normalization is optimal for complex samples with the assumption that the majority of protein abundances in a cell remains unchanged between different samples or treatments. In less complex samples (e.g., individual gel slices), fraction of total normalization can create artificially high intensity representations. To compensate this effect, cRacker provides the option to correct the fraction of total values based on the number of identified peptides in each sample (Table 1). For metabolic labeling (15N metabolic labeling,10 SILAC11), ion intensity ratios of labeled and nonlabeled peptides are calculated and averaged, in the end also resulting in protein average ratios. Resulting ratio distributions from metabolic labeling can be mean
Import of Peptide Lists
cRacker reads lists of peptide ion intensities after the raw files had been processed by another quantitation software and after database search. All text based input formats containing a list of identified peptides in tabular format can be imported in cRacker. Input lists are independent of mass spectrometer instruments. For correct data handling, input lists have to include at least information about the peptide sequence, mass to charge ratio, ion intensity, and sample identity in a standard column-based format. cRacker can handle file batches coming from the same quantitation software. After setting the working directory containing the peptide lists, cRacker will automatically import the list files by preset and user-editable import definitions. cRacker was particularly built to process data coming from various quantitation softwares. Since the output 5549
dx.doi.org/10.1021/pr300413v | J. Proteome Res. 2012, 11, 5548−5555
Journal of Proteome Research
Technical Note
Figure 2. (a) The k-means clusters of 146 proteins with increased (cluster 1), 103 proteins with decreased (cluster 3) and 419 proteins with almost constant relative abundance under sucrose starvation compared to full nutrition (cluster 2). (b) The MapMan53 functional mappings (including only representation counts >2) of these two clusters of responsive proteins revealed functional differences. Fisher’s Exact test has been used (** Benjamini Hochberg corrected p-value < 0.05; * uncorrected p-value < 0.05). (c) Median distribution of relative standard deviation (sd) between all calculated protein intensities ranges between 14 and 18% using additional scaling of peptides (scaled data). Without scaling, the relative standard deviation is drastically higher (unscaled data).
or median centered.27 Furthermore, relative protein normalization based on a specific reference protein is possible. In this case, all ion intensities in a sample are normalized to the mean protein intensities of a particular standard protein, either a spiked-in protein or a house keeping reference protein. III. In a third step, peptides will be filtered out if the number of missing ion intensity values for a given peptide ion across the analyzed sample set is higher than the selected threshold. This step is optional and can be disabled. IV. In a last optional step, the peptide intensities are scaled across samples or treatments. This step can significantly reduce the variance in the data sets (Figure 2a). cRacker provides three scaling types: In the first two scaling types, intensities of a peptide are scaled on the median or mean intensity across all samples or treatments. The third type is based on z-scoring but linearly transforms the whole data matrix to positive values to allow implementation of log2 calculations (Table 1). V. All peptide ion intensities related to the same protein group are finally averaged using mean or median, resulting in a protein intensity with a corresponding standard deviation. Optionally, protein group averages can be calculated from proteotypic peptides only, thereby
excluding those peptides that are shared between different proteins. Use of emPAI as a Relative Protein Abundance Measure within Samples
Instead of using ion intensities it can be necessary to use spectral count-derived measures for quantitation of protein abundances. Within cRacker it is possible to calculate emPAI28 values for each protein per data set in an independent run. emPAI can be used to calculate the abundance representation of a given protein in the total of analyzed proteins, and it is as accurate as calculation of cellular protein abundances as from enzyme activity measurements.29 However, to calculate emPAI values in cRacker, it is necessary to load the corresponding sequence library in FASTA-format. Each library will be in silico trypsin-digested during the import process, thereby in silico rejecting peptides below a length of 6 amino acids for the emPAI calculation. cRacker Includes Multivariate Statistical Analysis
cRacker provides an automated multivariate statistical analysis module that helps to data mine and interpret the processed data set. Statistical methods implemented include heatmap visualization, principal component analysis, hierarchical clustering and k-means clustering. Additionally, custom sample mappings can be applied to the data set by import of a specific 5550
dx.doi.org/10.1021/pr300413v | J. Proteome Res. 2012, 11, 5548−5555
Journal of Proteome Research
Technical Note
Figure 3. Combined analysis of replicate samples is useful for statistical testing of differences in protein intensities between experimental treatments. For the analysis of sucrose starvation vs full nutrition, (a) 668 proteins could be tested using t test analysis (α = 0.01) and visualized in a volcano plot. (b) Functional categorizations with a representation count >2 of the significantly different abundant proteins (** Benjamini Hochberg corrected p-value < 0.05; * uncorrected p-value < 0.05) have been used to compare systemic status of sucrose starvation response and full nutrition.
Example of cRacker Functions: Analyses of Sucrose Starvation Response in Arabidopsis Cell Cultures
sample mapping table. Multiple mapping types, like gene ontology terms19 or functional classification by MapMan,30 are already supported, and cRacker will treat each class independently in the analysis. ANOVA and pairwise t tests are used to test for significant independence between quantitative values on the protein level in the various samples and treatments of the data set.15,27 Volcano plots can be used for efficient pairwise comparison of significant differences between samples and experimental treatments. In general, pvalue adjustments are used to overcome the multiple testing problem.31−33 cRacker exports all plots in editable pdf or eps format. Processed data files, generated matrices and statistical results are available in csv (comma separated values) format and can be used further.
Plants produce sucrose during photosynthesis throughout the light period, and sucrose is produced from starch to support growth and metabolism at night. In plants, sucrose is not just a metabolite but similar to hormones, it can induce developmental changes through direct influence of gene expression and protein activities. Therefore, the study of sucrose-induced protein changes after sucrose starvation gives insights into regulatory mechanisms associated with this important metabolite. A sucrose starvation response experiment has been analyzed with cRacker to demonstrate its range of functions. The data set contained extracted soluble proteins from A. thaliana cell suspension cultures which was exposed to sucrose starvation for two days or kept under full nutrition as control. A peptidelist text file (“evidence.txt” file) containing quantified peptide ion intensities was obtained from MaxQuant and was subsequently loaded in cRacker for downstream data analysis. The list contains 17 455 peptide identifications from four treatments (sucrose starvation) and four control (full nutrition) LC−MS/MS runs. Altogether 1945 unique peptide sequences identifying 690 proteins have been processed. Using a requirement for each peptide ion to be quantified in at least 50% of the replicates the cRacker program could analyze quantitative ion intensities for 668 proteins. The calculation time of the example data set, containing these eight samples, took 3 or 5 min for the analysis with averaging between replicates or when keeping values from replicates separately, respectively. Proteins responding to sucrose starvation by a change in their abundance could clearly be identified using kmeans clustering and comparative t test analysis. The median of the relative standard deviation of the averaged protein intensities ranges between 14 and 18% (Figure 2a). An automated k-means clustering analyses by cRacker forming three clusters revealed a strong increase in relative abundance
Experimental Design
cRacker can read the experimental design out of the imported peptide list, if specific columns have been created. In some cases, however this information is not readily provided, or the user wants to change the experimental design. Therefore, it is possible to create a new experimental design for a given data set using cRacker. Besides possibilities for changing names and comparative groupings, the cRacker experimental design can be filled with more specific, but optional information, such as coloring schemes for graphic outputs. Also, filter groups can be defined for independent missing value filtering of peptides in defined groups of samples or, for time series data, time points can be defined. Analysis of Phosphopeptide Enriched Samples
cRacker includes an option for phosphopeptide analysis on the individual peptide level. In this case ion intensities from all phosphopeptides will be excluded from averaging to protein groups and the phosphopeptides undergo a separate multivariate analysis and will appear separately on graphic outputs. 5551
dx.doi.org/10.1021/pr300413v | J. Proteome Res. 2012, 11, 5548−5555
Journal of Proteome Research
Technical Note
detected with both workflows, from which 34% showed a significant difference in abundance only with the cRacker workflow and 15% only with the LFQ-Perseus workflow (Supporting Information Figure 1b). The differences can be explained with differences in the applied algorithm for calculation of protein mean values. cRacker uses all peptides passing the filtering threshold (i.e., peptides at least quantified in half of the replicates) for testing. Displayed differences on protein level are always based on statistics carried out on peptide intensity level, whereas in the chosen LFQ-Perseus workflow, the significance testing uses protein based LFQvalues from MaxQuant. Therefore, since cRacker is doing the comparison on peptide level, proteins can be quantified based on different peptide intensities even when only identified in one sample. In contrast, with LFQ-values, low abundant proteins quantified in only one sample cannot be tested, using twopaired t test, because information is already reduced to a single protein value. The proteins identified as significantly different only in cRacker, are particularly those, with few peptide counts (Supporting Information Figure 2). This explains the lower number of detected differentially expressed proteins in Perseus. However, most of the significant differentially represented categories between the two nutrient treatments have been identified with Perseus (Supporting Information Figure 1c) as well as with the cRacker workflow (significant differentially abundant in both workflows: “protein − synthesis”, “misc − peroxidases”, “protein − degradation”, “misc − gluco-, galactoand mannosidases”; significant only in LFQ-Perseus workflow: “stress biotic”). Thus, both workflows extracted similar biologically meaningful results. The higher number of significant differentially abundant proteins identified with cRacker can be explained by differences in the applied algorithms for testing (based on peptides or protein LFQ in cRacker or LFQ-Perseus workflows).
of 146 proteins (cluster 1, Figure 2b) and a strong decrease of 103 proteins (cluster 3) upon sucrose starvation. The majority of the proteins (419 proteins, in cluster 2) showed no strong changes, comparing full nutrition and sucrose starvation. The functional categorizations according to MapMan30 of proteins with strongly increased or decreased abundance upon sucrose starvation revealed a coordinated and rather specific cellular response (Figure 2c). Compared to full nutrition, the abundance particularly of proteins with functions in protein synthesis (i.e., ribosomal proteins) was significantly less represented in sucrose starved cells (Fisher’s exact test; uncorrected p values, α = 0.05). A similar, but not significant underrepresentation under sucrose starvation was found for proteins involved in “amino acid metabolism”, “protein targeting”, “RNA processing”, “RNA regulation of transcription”, “RNA − RNA Binding”, “protein folding” as well as “development unspecified”. Protein with functions in “DNA − synthesis/chromatin structure” and “misc − peroxidases” showed a significant increase in abundance (uncorrected p values, α = 0.05). Examples of protein functional categories with tendency for up-regulation were “misc − gluco-, galactoand mannosidases” or “signalling − in sugar and nutrient physiology”. This global analysis revealed already functional differences between up and down regulated protein classes under sucrose starvation. To extract the individual significant differentially abundant proteins, cRacker performs an automated pairwise t test analysis for all proteins. This feature performs a combined and comparative analysis of all replica measurements within a treatment class. For the analysis of the sucrose starvation response 637 proteins were analyzed using unpaired two sample t test. In total, based on individual protein analysis, 134 proteins were significantly down-regulated (uncorrected p-value