Anal. Chem. 2009, 81, 8015–8024
Workflow for Large Scale Detection and Validation of Peptide Modifications by RPLC-LTQ-Orbitrap: Application to the Arabidopsis thaliana Leaf Proteome and an Online Modified Peptide Library Boris Zybailov,*,† Qi Sun,‡ and Klaas J. van Wijk*,† Department of Plant Biology and Computational Biology Service Unit, Cornell University, Ithaca, New York 14853 Post-translational modifications (PTMs) of proteins add to the complexity of proteomes, thereby complicating the task of proteome characterization. An efficient strategy to identify this peptide heterogeneity is important for determination of protein function, as well as for mass spectrometry-based protein quantification. Furthermore, studies of allelic variation or single nucleotide polymorphisms (SNPs) at the proteome level, as well as mRNA editing, are increasingly relevant, but validation and determination of false positive rates are challenging. Here we describe an effective workflow for large scale PTM and amino acid substitution identification based on high resolution and high mass accuracy RPLC-MS data sets. A systematic validation strategy of PTMs using RPLC retention time shifts was implemented, and a decision tree for validation is presented. This workflow was applied to Arabidopsis proteome preparations; 1.5 million MS/MS spectra were processed resulting in 20% sequence assignments, with 5% from modified sequences and matching to 2904 proteins; this high assignment rate is in part due to the high quality spectral data. A searchable modified peptide library for Arabidopsis is available online at http:// ppdb.tc.cornell.edu/. We discuss confidence in peptide and PTM assignment based on the acquired data set, as well as implications for quantitative analysis of physiologically induced and preparation-related modifications. Reversible or irreversible amino acid modifications of proteins are often critical to regulate protein function and protein-protein interactions. Such protein modifications typically occur after the corresponding mRNA has been translated; hence, they are termed post-translational modifications (PTMs). Given the recognized importance of PTMs and the increased mass accuracy and resolution of the latest generation of mass spectrometers,1 interest in large-scale analysis of PTMs is rapidly increasing.2 Several search algorithms and search strategies for PTM * To whom correspondence should be addressed: Tel.: 1-607-255-3664. Fax: 1-607-255-5407. E-mail:
[email protected] (K.v.W.) or
[email protected] (B.Z.). † Department of Plant Biology. ‡ Computational Biology Service Unit. (1) Mann, M.; Kelleher, N. L. Proc. Natl. Acad. Sci. U.S.A. 2008, 105 (47), 18132-18138. (2) Larsen, M. R.; Trelle, M. B.; Thingholm, T. E.; Jensen, O. N. Biotechniques 2006, 40, 790–798. 10.1021/ac9011792 CCC: $40.75 2009 American Chemical Society Published on Web 09/02/2009
detection have been developed in recent years,3-5 and the main challenges include the computational intensity of PTMs searches, in particular for truly large scale data sets with millions of MS/MS spectra, determination and control of false positive rates, and experimental validation of PTMs. Protein variants that result from alternative splicing, allelic variation, or single nucleotide polymorphisms (here named PMs for peptide modifications), while etiologically different from PTMs and chemical modifications, can in most cases be treated as PTMs during proteome analysis. Consequently, proteomics-based PTM discovery methods can be applied to SNP discovery and validation, complementary to high-throughput DNA sequencing. In fact, the new generation DNA-sequencing technologies, Illumina/Solexa, 454/Roche, and related approaches,6 are matured to the point that relatively inexpensive high-throughput analysis of gene-associated SNPs and allelic variation has become feasible. For example, 454 sequencing was used to discover SNPs between two inbred lines of maize, with about 85% of found SNPs validated by traditional sequencing methods, and ∼4900 SNPs discovered in total within ∼2400 maize genes.7 A typical MS-based approach to large-scale proteome analysis is the “bottom-up” method (as opposed to the “top-down” method8,9), whereby proteins present in a complex mixture are inferred from the presence of signature peptides, obtained by proteolysis and MS/MS analysis. Peptides in the proteolytic digests are identified through interpretation of tandem MS spectra together with the known mass of the fragmented precursor; identification of PTMs is usually done as part of this analysis. Rapid cycling hybrid mass spectrometers, such as the LTQOrbitrap,10 offer excellent mass accuracy, resolution, and sensitivity and therefore allow for more in-depth proteome coverage and increased confidence in protein identifications. (3) Havilio, M.; Wool, A. Anal. Chem. 2007, 79, 1362–1368. (4) Liu, J.; Erassov, A.; Halina, P.; Canete, M.; Nguyen, D. V.; Chung, C.; Cagney, G.; Ignatchenko, A.; Fong, V.; Emili, A. Anal. Chem. 2008, 80, 7846–7854. (5) Tanner, S.; Payne, S. H.; Dasari, S.; Shen, Z.; Wilmarth, P. A.; David, L. L.; Loomis, W. F.; Briggs, S. P.; Bafna, V. J Proteome Res 2008, 7, 170–181. (6) Morozova, O.; Marra, M. A. Genomics 2008, 92, 255–264. (7) Barbazuk, W. B.; Emrich, S. J.; Chen, H. D.; Li, L.; Schnable, P. S. Plant J. 2007, 51, 910–918. (8) Siuti, N.; Kelleher, N. L. Nat. Methods 2007, 4, 817–821. (9) Breuker, K.; Jin, M.; Han, X.; Jiang, H.; McLafferty, F. W. J. Am. Soc. Mass Spectrom. 2008, 19, 1045–1053. (10) Makarov, A.; Denisov, E.; Lange, O.; Horning, S. J. Am. Soc. Mass Spectrom. 2006, 17, 977–982.
Analytical Chemistry, Vol. 81, No. 19, October 1, 2009
8015
For identifying PTMs, SNPs, or other types of mismatches between predicted protein sequence and real protein sequence, a high mass resolution and accuracy is required to distinguish with confidence between PTMs that results in relative small mass shifts. Furthermore, high mass accuracy also provides a much smaller effective peptide “search space” and therefore potentially a lower false positive rate of the PTM discovery. The confidence in determination of the exact modification site also depends on the number of possible sites present, on the quality of y and b ions in the MS/MS spectra, and on the mass-shift of the modification. Reverse phase (RP) chromatography is used in most online separations of complex peptide mixtures; the peptide hydrophobicity, which is measured as elution time from the RP column, can be employed as an additional piece of information to validate peptide modifications. Smith’s research group developed an accurate mass and time approach11 for analysis of LC-MS runs, as well as a number of software tools for accurate retention time determination, such as MASIC.12 These tools can be employed in the validation of PTMs as will be demonstrated in the current study. Thorough characterization of the peptide heterogeneity due to PTMs is important in mass-spectrometry-based quantitative proteomics studies, since differences in analytical properties between modified and nonmodified peptides, if not addressed, can lead to inaccurate quantification. This is particularly relevant in so-called “targeted” quantitative proteomics analysis with a predefined list of peptides or spiking of isotopically labeled peptides;13-15 it is important to select those peptides that undergo the least amount of PTMs. In the current study, we used the maximum resolving power (100 000) of the LTQ-Orbitrap for detection of precursor ions, followed by Mascot 2.2 error-tolerant searches to identify peptide modifications (PMs) present in a complex mixture of leaf proteins of Arabidopsis thaliana. A systematic validation strategy of PMs using RPLC retention time shifts was implemented, and a decision tree for validation is presented. In this work we consider three classes of PMs: (1) PTMs, which are the biologically induced chemical modifications of amino acids in proteins; (2) chemical modifications that occur due to sample handling and processing; and (3) amino acid sequence changes due to single amino-acid substitutions. In the case of amino acid substitutions, a protein may consist solely of nonmodified amino acids, yet its sequence will deviate from the sequence in protein database. By detection of protein products of edited mRNAs in chloroplasts of the plant model species Arabidopsis thaliana, we demonstrate that such amino acid substitutions can be discovered by treating them as PTMs. A searchable modified peptide library for Arabidopsis is available online through the Plant Proteome Database at http:// ppdb.tc.cornell.edu/. We discuss confidence in peptide and PTM (11) Monroe, M. E.; Tolic, N.; Jaitly, N.; Shaw, J. L.; Adkins, J. N.; Smith, R. D. Bioinformatics 2007, 23, 2021–2023. (12) Monroe, M. E.; Shaw, J. L.; Daly, D. S.; Adkins, J. N.; Smith, R. D. Comput. Biol. Chem. 2008, 32, 215–217. (13) Mallick, P.; Schirle, M.; Chen, S. S.; Flory, M. R.; Lee, H.; Martin, D.; Ranish, J.; Raught, B.; Schmitt, R.; Werner, T.; Kuster, B.; Aebersold, R. Nat. Biotechnol. 2007, 25, 125–131. (14) Kirkpatrick, D. S.; Gerber, S. A.; Gygi, S. P. Methods 2005, 35, 265–273. (15) Pratt, J. M.; Simpson, D. M.; Doherty, M. K.; Rivers, J.; Gaskell, S. J.; Beynon, R. J. Nat. Protoc. 2006, 1, 1029–1043.
8016
Analytical Chemistry, Vol. 81, No. 19, October 1, 2009
assignment, as well as implications for quantitative analysis of physiologically induced and preparation-related modifications. METHODS Plant Materials and LC-MS/MS. A. thaliana wt (Col-0 background) plants were grown on soil or on agar plates supplemented with 2% sucrose (for details see Supporting Information). Total leaf proteomes were extracted as described in ref 16, and 400 µg total leaf proteins of each preparation were run out on a 1D SDS-PAGE gel. Each of the gel lanes was cut into 12 bands followed by reduction, alkylation, and in-gel digestion with trypsin as described previously.17 Peptides extracted from these gel bands were analyzed in duplicate by data-dependent tandem mass spectrometry (MS/MS) using online LC-LTQ-Orbitrap (Thermoelectron) as described in detail.16 Finally, we also included previously acquired data sets, using the same gel-based separation and MS/MS settings, on chloroplast proteomes enriched for either thylakoid and soluble chloroplast proteins, isolated from fully grown plants.18 Protein Database Searches with Mascot 2.2. Arabidopsis protein database, ATH1v8, was downloaded (www.arabidopsis.org) in a fasta format, and known potential contaminants, including human keratins and trypsin, were added. LTQ-Orbitrap RAW files were analyzed with DTA Supercharge 1.18,19 yielding extracted fragmentation spectra in MGF format. The initial MGF files were recalibrated off-line, with an in-house written Perl script, using the results of Mascot 2.2 search with a widened precursor mass tolerance window (50 ppm) and a fragment ion tolerance of 1 Da against all predicted Arabidopsis protein models. The recalibrated MGF files (this calibration affects only precursor ions) were searched against all predicted Arabidopsis protein models, now using a narrow precursor ion mass tolerance of 3 or 6 ppm. Three separate searches were performed in parallel, and methionine oxidation and cysteine carbamidomethylation were set respectively as variable and fixed modifications: (#1) normal tryptic search allowing up to 2 missed cleavages and a 6 ppm maximum precursor error, (#2) error-tolerant search, using the corresponding Mascot 2.2 feature and a 3 ppm maximum precursor error, and (#3) semi-tryptic search including variable N-terminal acetylation at a 3 ppm precursor mass error. Maximum fragment ion error tolerance was 1 Da. Standard scoring, and not “MudPIT” scoring, was used as this was the most appropriate since all MS analysis was done on indivual 1D gel sections and each 1D gel section was searched as an individual sample. The results of the three searches were merged together, ensuring that one MS/ MS spectrum yielded no more than one peptide ID. In the case when the same MS/MS spectra produced identifications above score thresholds in more than one search, the result with the highest ion score was chosen. In the case of equal scores, the regular search (#1) was given priority. For a protein to be identified it required to have at least one peptide identification in (16) Zybailov, B.; Friso, G.; Kim, J.; Rudella, A.; Ramirez Rodriguez, V.; Asakura, Y.; Sun, Q.; van Wijk, K. J. Mol. Cell Proteomics 2009, 8 (8), 1789-1810. (17) Shevchenko, A.; Wilm, M.; Vorm, O.; Mann, M. Anal. Chem. 1996, 68, 850–858. (18) Zybailov, B.; Rutschow, H.; Friso, G.; Rudella, A.; Emanuelsson, O.; Sun, Q.; van Wijk, K. J. PLoS ONE 2008, 3, e1994. (19) Andersen, J. S.; Wilkinson, C. J.; Mayor, T.; Mortensen, P.; Nigg, E. A.; Mann, M. Nature 2003, 426, 570–574.
the search #1. Ion score thresholds for each of the searches were adjusted to 33 so that peptide false positive rate was below 1% as determined by target-decoy database searches. Calculation of the Precursor Elution Times from the Reverse-Phase Gradient. To extract elution times, each individual RAW file was processed with MASIC, a program developed in Richard Smith’s group (www.pnl.gov). MASIC performs extraction of the selected ion chromatogram for each of the precursors, determines peak apexes, and calculates the retention times. The MASIC output was combined with peptide identification data from the previous step, yielding the final data set (Supporting Information Table S1). Peptide Hydrophobicity and Retention Time Prediction. To calculate predicted hydrophobicities of the precursor peptides, Version 3.2 of the Sequence Specific Retention Calculator (peptides retention prediction in RP HPLC) program was used via an online interface at the Manitoba Center for Proteomics and Systems Biology (http://www.proteome.ca/p/tools.html). Plant Proteome Database. Results of the merged and filtered Mascot searches are uploaded in the Plant Proteomics DataBase, PPDB (http://ppdb.tc.cornell.edu/).20 The output parameters of Mascot searches and postsearch filtering are captured and stored in the database. This information is available to the online users by using the search function “Proteome Experiments” and selecting the desired output parameters; this search can be restricted to specific experiments. Alternatively, information for specific accessions can be extracted using the search function “Accessions”, and if desired, this search can be limited to specific experiments. Finally, information for a particular accession can also be found on each “protein report page”. RESULTS Total proteomes were extracted in the presence of SDS from leaves of Arabidopsis thaliana plants of different developmental stages and grown under different conditions. In addition, to compare between organelles and the rest of the cell, we also analyzed chloroplast proteomes enriched for either chloroplast membranes or for soluble chloroplast proteins, isolated from fully grown plants. A total of 1 508 518 MS/MS spectra were processed, with 607 038 spectra from five chloroplast preparations and 901 480 spectra from four independent leaf extracts. For the chloroplast preparations and total leaf extracts the percentages of MS/MS spectra that led to sequence assignments were 18% and 22%, respectively, with modified sequences contributing 5% and 6%, respectively. In comparison, a comprehensive analysis of Drosophila melanogaster proteome, using a combination of different MS platforms, yielded ∼5% of MS/MS to sequence assignment rate.21 In total 2904 Arabidopsis proteins were identified corresponding to 3589 gene models; for identification only “full-tryptic” peptides were considered (at