ArMone: A Software Suite Specially Designed for ... - ACS Publications

Mar 24, 2010 - Abstract: The development of new phosphoproteomic technologies has led to a ... was specially designed for the management and analysis...
0 downloads 0 Views 4MB Size
ArMone: A Software Suite Specially Designed for Processing and Analysis of Phosphoproteome Data Xinning Jiang,†,‡ Mingliang Ye,*,† Kai Cheng,†,‡ and Hanfa Zou*,† CAS Key Laboratory of Separation Sciences for Analytical Chemistry, National Chromatographic R&A Center, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian 116023, China, and Graduate School of Chinese Academy of Sciences, Beijing 100049, China Received July 22, 2009

Abstract: The development of new phosphoproteomic technologies has led to a rapid increase in the number of phosphoprotein identifications. Managing and extracting valuable information from the phosphoproteome data sets and generating output information in user-friendly formats require special data management and process platform. Even though a few proteome pipelines have been developed, they are mainly designed for processing data set of unmodified peptide/protein identifications. Because of the different characteristics of phosphorylated peptides/proteins, these pipelines are inconvenient, sometimes inappropriate, to process the phosphoproteome data sets. In this study, a software suite named ArMone was specially designed for the management and analysis of phosphoproteome data. It can readily identify phosphopeptides with high reliability and high sensitivity, and can effectively pinpoint the most probable phosphorylation site. A few well-designed postvalidation process tools are also available to extract and export valuable information. ArMone is a stand-alone application with friendly graphic user interface. It can run on different operating systems and can process data sets obtained by most of the commonly used database search engines. Keywords: phosphoproteome analysis • automatic validation • bioinformatics • software

Introduction Protein phosphorylation plays a key role in eukaryotic signal transduction, gene regulation, and metabolic control in cells. Abnormal phosphorylation is a cause of various diseases, including cancer.1,2 It is one of the most extensively studied protein post-translational modifications (PTMs). It has been predicted that about 30% or more of proteins are phosphorylated at some point during their life cycle.3 The confident localization of phosphorylation sites on phosphoproteins is * To whom correspondence should be addressed. Prof. Dr. Mingliang Ye, National Chromatographic R&A Center, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian 116023, China. Tel: +86-411-84379620. Fax: +86-411-84379620. E-mail: [email protected]. Prof. Dr. Hanfa Zou, National Chromatographic R&A Center, Dalian Institute of Chemical Physics, The Chinese Academy of Sciences, Dalian 116023, China. Tel: +86-41184379610. Fax: +86-411-84379620. E-mail: [email protected]. † Dalian Institute of Chemical Physics, Chinese Academy of Sciences. ‡ Graduate School of Chinese Academy of Sciences. 10.1021/pr9009904

 2010 American Chemical Society

critical for elucidating the system biology of complex disease mechanisms and global regulatory networks. A Tandem mass spectrometer is an important tool to analyze protein phosphorylation as it not only can identify the amino acid sequence of a peptide but also can pinpoint the location of phosphorylation site(s) within the sequence. It has become one of the most commonly used tools for high through-put analysis of protein phosphorylation.4-11 As most of the database search engines, especially the free search engines, do not provide user-friendly pre- and postsearch data processing tools, various proteome pipelines have been developed.12-15 These pipelines commonly contain modules with various functions such as validation of peptide and protein identification, quantitative analysis, and so on. They allow extraction of valuable information from proteome data more efficiently and more reliably. However, currently existing pipelines were developed for processing unmodified peptide/ protein identifications, and typically did not take account some important features of phosphopeptide/phosphoprotein identifications. For example, because of the insufficient fragmentation of phosphopeptides in collision induced dissociation (CID) MS, confident identification of phosphopeptide and accurate localization of phosphorylation sites need special tools. However, the currently existing pipelines typically do not have such tools. Detailed information on phosphorylation sites, such as the number of distinct phosphorylation sites, localization of the identified phosphorylation sites on protein sequence, and so on, are needed to report the results of a phosphoproteomic analysis. Unfortunately, this information is lacking in most of the existing pipelines. In addition, the proteomics guideline16 requires that the identified phosphopeptides/phosphoproteins be accompanied by the annotated and mass labeled spectra for the convenience of other researchers.16,17 Although some programs can draw the labeled mass spectrum one at a time, the generation of 1000s of images of annotate spectra derived from phosphoproteomic analysis remains a challenge. In this study, we developed and integrated a series of modules into a software suite named ArMone to facilitate the processing of phosphoproteome data. A module implementing MS2/MS3 strategy18 is used to improve the reliability and sensitivity for phosphopeptide identifications. A phosphorylation site localization module can determine the phosphorylation sites on phosphopeptides using mass spectra obtained with different fragmentation modes. Manual validation and auto filtering modules are included in the software suite to Journal of Proteome Research 2010, 9, 2743–2751 2743 Published on Web 03/24/2010

technical notes

Figure 1. Modules for the processing of phosphoproteome data set.

facilitate the validation of phosphoproteome data with customizable parameters. Other modules such as the batch spectra drawing module and phosphorylation site statistic module facilitate the reporting of phosphoproteome data set according to the proteome guideline. Overall, the ArMone software greatly simplifies the processing, analysis, and reporting of phosphoproteome data sets.

Methods ArMone Software Suite. The ArMone phosphoproteome software was written in Java (J2SDK 6.0 update 12)sa platform independent programming language. The JFreeChart class library (http://www.jfree.org/jfreechart/index.html) was used for the drawing of spectra with matched ions information and the itext library (http://www.lowagie.com/iText/) was used for the creation of PDF file. ArMone is a lightweight Java application with friendly graphic user interface, which can be used easily, even without any further settings. As both the ArMone program and other third party class libraries are written entirely in Java, ArMone is completely portable and is executable on Windows, Mac OSX, and Linux operating systems. For academic usage, ArMone is free and can be downloaded from http:// bioanalysis.dicp.ac.cn/proteomics/software/ArMone.html. Modules for the Processing of Phosphoproteome Data. ArMone contains a few modules to facilitate the management and analysis of phosphoproteome data. These modules allow the preprocessing of the mass spectra, the parsing of the search results, the validation of the peptide, and protein identification (Figure 1). (1) The mass spectra preprocess modules include: (i) A charge evaluation module (ZEvaluer),18 which can parse the charge states of precursor ions based on the loss of mass when neutral loss occurs. This module is especially useful in instances in which the charge states of precursor ions cannot be determined by isotopic peaks due to low resolution mass spectrometry; and (ii) a peak list format conversion module (PklistConverter), which converts the peak list files (MS/MS data files) between different formats (e.g., .dta, .mgf, .ms2, and so on), allowing the data to be searched by different types of database search algorithms. (2) The search result parser modules extract peptide identification information from the search results obtained from different database search algorithms. For example, the SE2744

Journal of Proteome Research • Vol. 9, No. 5, 2010

Jiang et al. QUESTOutParser handles the information from the “.out” output files from SEQUEST while the MascotDatParser handles the “.dat” output files from Mascot. (3) The validation modules include a peptide list conversion module, which combines the peptides identified by one of the database search algorithms and the MS/MS spectra data into a well formatted peptide list file (ppl); the ppl file will be used as the input or output for all the following data processing modules. The second module utilizes MS2/MS3 and MS2 only strategies to identify phosphopeptide from mass spectra acquired by data dependent neutral loss triggered MS3 (NLMS3) or MS2 methodologies. The third module (Sitelocalizer) implements and extends the Ascore Algorithm19 to effectively determine the most probable site of phosphorylation for phosphopeptides identified by MS2 and MS3. Furthermore, this module can handle spectra obtained using different types of fragment modes including CID, electron transfer dissociation (ETD) and electron collision dissociation (ECD). (4) Postvalidation modules. The auto filtering module and manual validation module can be used to validate phosphopeptide identifications with customizable criteria. The phosphorylation site statistic module provides information on the phosphorylated proteins, including the localization of the phosphorylation sites, the amino acid sequences around the phosphorylation sites, and so on. The protein clustering module integrates the high confidence peptide identifications into protein identifications. The batch spectra drawing module can draw the annotated mass spectra for all identified peptides into html format or PDF format. The file format conversion modules allows data to be exported to external tools for further processing. For example, the Pepxml converter converts the ppl file into another popular file format, pepxml,12 which is used by Census for quantification of identified proteins.20 The majority of the modules in ArMone are compatible with results derived from different search algorithms including the two most popular commercial algorithms, that is, SEQUEST21 and Mascot,22 as well as some free database search algorithms including X!Tandem,23 OMSSA,24 Inspect,25 Crux,26 and others. Once the database search results have been converted into ppl files, the remaining data analysis steps are nearly the same. However, the MS2/MS3 target-decoy strategy module is only applicable to results from SEQUEST and Mascot as the remaining search algorithms cannot handle the neutral loss MS3 spectra. For example, X!Tandem cannot perform database search with more than one variable modifications on a specific amino acid, and Crux or Inspect may produce erroneous identification if searches include neutral loss MS3 spectra.

Results Overview of ArMone. In this study, we developed a software suite for the management and processing of phosphoproteome data sets to meet the special needs for phosphoproteome analysis. Most existing pipelines are designed for the processing of unmodified peptides/proteins with limited or no functionalities dedicated to modified peptides. Here we present a specially designed software suite with modules and functions dedicated to phosphoproteome data sets. In particular, it provides modules to control the confidence of phosphopeptide identifications and to localize the phosphorylation sites to the most probable sites on peptide sequences. As well, user-friendly modules that provide auto filtering and manual validation are integrated in ArMone for the validation of phosphopeptide

technical notes

ArMone: A Software Suite

Figure 2. Flowchart of ArMone for the processing of phosphoproteome data set.

identifications. Other postvalidation modules are also provided to mine and export the valuable information on protein phosphorylation. As shown in Figure 2, there are seven steps for processing of phosphoproteome data set. 1. Acquisition of the Mass Spectra for Phosphopeptides. Different types of dissociations including CID, ETD, and ECD can be applied to fragment phosphopeptides. For CID source, MS2 could be acquired for phosphopeptide identifications. However, due to the dominant neutral loss, the CID based MS2 spectra often do not have enough fragment ions to confidently identify phosphopeptides. To improve the sensitivity and reliability of phosphopeptide identifications, NLMS3 methodology is preferably employed, especially for a low accuracy mass spectrometer. Thus, two fragmentation patterns based on MS2 and MS3 are obtained per phosphopeptide. For ETD/ECD source, as the phosphorylated peptide will not lose the phosphate group, data dependent MS2 strategy is able to confidently identify the phosphopeptides. In these instances, only one fragment spectrum based on MS2 is necessary per phosphopeptide. 2. Preprocessing of the Mass Spectra Data. The raw fragment ion spectra are extracted by the peak list extraction algorithmssuchasextract_msn.exeinBioworksormzxml2other.exe in TPP.27 If the mass spectra are acquired by the NLMS3 method, the MS2/MS3 DTA preprocess module is used to validate the charge states based on the lost of a neutral fragment as described in ref.18 For convenience, a spectra data format conversion module is provided by ArMone to allow different database search algorithms (e.g., conversion from SEQUEST “.dta” format to Mascot generic format). 3. Parsing of the Database Search Results. The results from the database search algorithm are parsed and exported to peptide list file (ppl) using the relevant search result parsers designed for different database search algorithms. All further processing are performed using the ppl file. For NLMS3 method, two ppl files containing the peptides identified from the MS2 spectra and the MS3 spectra are generated. 4. Identification of Phosphopeptide with Specific FDR. (i) If the mass spectra are acquired using NLMS3 method, the MS2/ MS3 module is used to confidently identify the phosphopeptides and site of phosphorylation by combining the information

from MS2 and the associated MS3 spectra.18 Scores are refined to combine the identification information from the MS2 and MS3 spectra. Then the new defined scores are used to filter the identifications to a specific FDR. (ii) Besides the MS2/MS3 strategy, ArMone also provides easy to use MS2 filtering module to process the search results from the mass spectra collected by MS2 only. 5. Localization of the Phosphorylation Site. After the identification of the phosphopeptide sequences, the algorithm deals with the localization of phosphorylation sites. Using the phosphorylation site localization module, which implements the Ascore algorithm,19 phosphorylation sites on the phosphopeptides can be effectively and easily determined. 6. Validation of the Identified Phosphopeptide. After the generation of peptide identification, the auto filtering module and the manual validation module can be used to process the phosphoproteome data set with customizable criteria to further improve the confidence of the identifications. 7. Extraction of Information on Protein Phosphorylation. A few convenient postvalidation modules were designed for phosphorylation analysis, such as the phosphorylation site statistic module, protein clustering module based on Occam’s razor strategy,28 batch spectrum drawing module, and others. These modules make the reporting of phosphoproteome data much more convenient. ArMone does not acquire the mass spectra (step 1). Instead, it provides integrated modules for the downstream processing of phosphoproteome data. As described in step 1, a single phosphopeptide can be identified by subsequent MS2 and MS3 using the NLMS3 method, and a phosphopeptide can also be identified by a single MS2 spectrum. In the following sections, these two phosphopeptide identification strategies will be first introduced, and then the functions of other modules will be described in detail. Phosphopeptide Identification using MS2/MS3 Target-Decoy Strategy for MS2 and MS3 Spectra Collected from CID. For phosphopeptide identification by CID better results are obtained when MS2 and MS3 mass spectra are acquired for each phosphopeptide because typically MS2 alone does not have sufficient ions due to neutral loss. The combination of the results from consecutive MS2 and MS3 spectra can improve Journal of Proteome Research • Vol. 9, No. 5, 2010 2745

technical notes the sensitivity and reliability of phosphopeptide identification.29,30 Recently, we presented a MS2/MS3 target-decoy strategy for confident phosphopeptide identification which integrates the search results from both MS2 and MS3 spectra.18 In MS2/MS3 strategy, MS2 spectra and MS3 spectra are separately searched against the same target-decoy database, and the search results are combined before filtering. Phosphopeptide that are identified by both MS2 and its corresponding MS3 are retained and their identification scores are refined by combining the identification information from the MS2 and MS3 spectra. For SEQUEST database search results, these new defined scores include Xcorr′s which represents the crosscorrelation between phosphopeptide and the MS2/MS3 spectra pair and ∆Cn′m which is the minimum ∆Cn score for the phosphopeptide identification by MS2 or MS3 spectra. While for Mascot, under the assumption that peptide matches to MS2 and MS3 spectra are independent events,31 the new score (Ionscores) is defined as the sum of the ion-score for the phosphopeptide identification from MS2 and MS3 spectra. Then, these new defined scores are used for the filtering of the results to acceptable FDR. As the phosphopeptides are identified from both MS2 and MS3 spectra, this strategy can greatly improve the reliability of the identified phosphopeptides. Furthermore, the false positive identifications are drastically reduced by the MS2/MS3 strategy; therefore, the final filtering threshold can be less stringent and still generate phosphopeptide identifications with high confidence while increasing the sensitivity for phosphopeptide identifications. In the ArMone software suite, MS2/MS3 target-decoy strategy is implemented by several modules. First, the MS2/MS3 DTA preprocess module is used to process the “.dta” files of MS2 and MS3 spectra to remove the spectra data with invalid charge states or without significant neutral loss peaks. This can significantly reduce the database search time (nearly 1 order of magnitude decrease in search time) and with limited loss in sensitivity for the MS2/MS3 strategy.18 After the database search by SEQUEST or Mascot, the search results are converted to ppl files for MS2 and MS3 spectra separately. Then the MS2/MS3 target-decoy module is used to combine the search results from MS2 and MS3 spectra. The output of MS2/MS3 module is also a ppl file containing phosphopeptide identifications from MS2/ MS3 spectra pairs. Finally, phosphopeptide identifications with specific confidence can be generated easily using the FDR control module by setting appropriate filters for the new defined scores. As the ppl file contains information on both peptide identification and mass spectra, it can be used as input file for other postvalidation modules, for example, the ppl file can be used for protein clustering, phosphorylation sites statistics, and so on. Phosphopeptide Identification with FDR Control for MS2 Collected from CID/ETD/ECD. The MS2 spectra generated by CID/ETD/ECD are directly searched by any database search algorithm with variable modification of phosphorylation on S, T, Y residues and the results can be filtered to a specific confidence level by ArMone. To generate peptide identifications with different FDR, ArMone provides an easy to use module for the creation of composite database containing both target and decoy sequences and a FDR control module for the generation of peptide identifications with different confidence levels. After the searches against the target-decoy database, the FDR of the final identifications can be controlled by setting of different criteria, providing a convenient way for peptide identifications. 2746

Journal of Proteome Research • Vol. 9, No. 5, 2010

Jiang et al. Although many phosphoproteomic approaches include phosphopeptide enrichment, there are always a proportion of unmodified peptides which are analyzed by the mass spectrometer.32 Therefore, both phosphopeptides and nonphosphopeptides could be identified after database searching. It can be found in supplementary Figure 1 that the FDR values for all peptide identifications, for the subset of phosphopeptide identifications and for nonphosphopeptide identifications are always different when the same filtering score was applied to process data. This may be due to the differences in fragmentation of phosphopeptides and nonphosphopeptides in MS. Thus, criteria for the filtering of unphosphorylated and phosphorylated identifications should be different. Otherwise, bias would be introduced in the identification of phosphorylated peptides when global FDR are used. In ArMone, phosphopeptide identifications can be easily differentiated from the total peptide identifications. Therefore, filters can be set to control the FDR for the subset of phosphorylated peptides. The idea is that FDR determined for the subset of phosphopeptides should be more closely aligned with their actual confidence. In addition, the subset of unphosphorylated peptide identifications can also be distilled and filtered separately, which enables the comparison between these two subsets of peptide identifications. Phosphorylation Site Localization and Ascore Calculation Modules. The next step after the phosphopeptides are identified is to localize the site of phosphorylation for phosphopeptides with more than one potential phosphorylation sites (i.e., S, T or Y). Currently, nearly all of the proteome pipelines, which were originally developed for the process of unmodified peptide data sets, do not provide modules to localize the phosphorylation sites after the identification of phosphopeptide sequences. Therefore, when using these pipelines to process the phosphoproteome data sets, external algorithms are needed for phosphorylation site localizations. Beausoleil et al. presented a nice Ascore algorithm for the localization of phosphorylation sites.19 Ascore is a probability based score that can effectively localize the phosphorylation sites to the most probable location on a peptide sequence. However, the online version of Ascore (http://ascore.med. harvard.edu/ascore.php) can only process phosphopeptides identified by SEQUEST. This seriously limits the usage for this algorithm. To overcome this limitation, the Ascore module in ArMone was modified to localize phosphorylation site on phosphopeptides identified by other search engines including Mascot, X!Tandem, OMSSA and others. After phosphopeptides are identified by any of above database search algorithms, the Ascore module is used to localize the phosphorylation sites to the most probable location. In this way, high confident phosphorylation site localization can be achieved. The performance of the Ascore module was also evaluated using the test data set (134 phosphopeptides, 161 phosphorylation sites) available on the Ascore web site (http:// ascore.med.harvard.edu/ascore.php). It was found that Ascores calculated by Ascore module in the ArMone and the online Ascore algorithm showed high correlation (R2 ) 0.96), indicating that the Ascore calculation module in ArMone correctly implemented the Ascore algorithm (Supplemental Figure 2, Supporting Information). Compared with the online version of Ascore algorithm, the Ascore module in ArMone is also designed to localize the phosphorylation sites for phosphopeptides identified from mass spectra with different types of fragment ions (e.g., b and y or c and z). In this way, the Ascore algorithm can also be

ArMone: A Software Suite

Figure 3. Graphic user interface for the setting of criteria for auto filtering.

used to localize the phosphorylation site for phosphopeptides analyzed by ETD or ECD. That is, instead of using b and y ions, the ion types are set as c and z ions for the localization of phosphorylation sites on phosphopeptides identified from ETD/ECD data. Auto Filtering and Manual Validation Modules. Even though the target-decoy strategy can effectively determine the overall confidence of peptide identifications in the data set, this strategy cannot determine the confidence of individual peptide identification. To determine which peptide identification is more likely to be true positive, customized criteria are needed to further verify the peptide identifications. Manual verification of the match between the predicted and experimental spectra is one of the mostly widely used strategies to improve the reliability of the identification. It may be the most direct way to judge whether an identification is a true positive as it does not make any assumptions about the data set. This is a very important validation strategy for the identification of peptide with phosphorylation, especially when the spectra are collected by low accuracy mass spectrometry. However, because manual validation is very time-consuming, it is the biggest bottleneck for large scale proteome analysis. To circumvent this limitation, an auto filtering module is present in ArMone to process the data set with customizable criteria. As the auto filtering module is only valid to use when uniform validation criteria are available, for validation of individual peptide identification with special criteria, an easy to use manual validation module is also available in ArMone. Auto Filtering Module. The auto filtering module in the ArMone software suite can automatically process the peptide identifications in a batch using the filtering criteria configured by users. In this way, researchers can validate their data set on the fly using uniform criteria for all the peptide identifications. As shown in Figure 3, researchers can customize the auto filtering by specifying the fragment tolerance, intensity threshold, spectrum intensity filters and the number of continuous b and y or c and z type ions. After the criteria are set for auto filtering, all the peptide identifications will be filtered one by one using these parameters. When the auto filtering is completed, the subset of peptide identifications that do not pass the preset filtering criteria will be removed in the original peptide list and will be shown in a new window. This auto filtering module is developed to provide a simple robot to further process the peptide identifications using

technical notes customizable filter criteria. However, a small portion of peptide identifications that do not pass the customized filter criteria may also be true identifications. Manual validation of these identifications is required. To improve the efficiency of data processing, we recommend to first use strict criteria to generate a subset of peptide identifications with high confidence, and then manually check the identifications that do not pass the uniform criteria in the new window provided by ArMone. Manual Validation Module. An easy to use manual validation module is also integrated in ArMone. As shown in Figure 4, when the manual validation mode is switched on, the spectra panel will be shown as the top level frame. And the spectrum marked with matched ions will be shown on the spectra panel when a peptide in the peptide list table is selected. This way, the matched spectrum for each peptide will be shown in a real time manner. Peptide identification can be deselected from the peptide list if it does not meet peptide identification criteria or the spectrum is of poor quality. The option for different fragment types (e.g., b and y or c and z) allows ArMone to process mass spectra acquired by different types of dissociation modes. For mass spectra acquired by CID, the neutral loss peak, which is very useful for the validation of phosphopeptide identifications in CID, can also be selected to be labeled on the spectrum. In addition, two mass spectra, that is, MS2 and its consecutive MS3, for a phosphopeptide identification will be shown for phosphopeptides identified by MS2/MS3 spectra pairs (Figure 4). After all the peptides are validated manually, the resulting peptide identifications can be exported to a new peptide list file or can be used to generate protein identifications directly. By using the auto filtering and manual verification module, the time to validate the results will be greatly reduced. Besides the processing of phosphopeptide identifications, the manual validation module and auto filtering module can also be used for the validation of unmodified peptides and peptides with other PTMs such as glycosylation. Modules for Generating Data with Suitable Format for Publication. After the generation of phosphopeptide identifications with acceptable confidence level in large scale phosphoproteome analysis, useful information can be extracted from the phosphopeptide identifications. Those include the identified phosphorylation sites and their localization on the protein, the total number of identified phosphorylation sites and others. This information is generally not provided by existing proteomic software. ArMone includes modules that can extract and export this information. Phosphorylation Site Statistic Module. Two types of reports can be generated by the phosphorylation site statistic module. One is the global report in which only global information for the data set is included, such as: (1) number of the distinct phosphopeptides, number of nonphosphopeptides and the percentage of these two types of peptides in the data set; (2) total number of identified phosphorylation sites; (3) number and percentage of distinct phosphopeptides with one, two, and three phosphorylation sites; (4) the number and percentage of phosphorylation on STY. A global report can also be provided in a separate window once the statistical command are activated. Finally, the a detail report can be exported to a csv file (a comma delimited plain file which could be opened and edited by Excel). The detail report contains information of the identified phosphorylation sites on every phosphoprotein: (1) the information on the identified phosphoproteins such as protein name and accession number, identified phosphopepJournal of Proteome Research • Vol. 9, No. 5, 2010 2747

technical notes

Jiang et al.

Figure 4. Graphic user interface of the annotated mass spectrum viewer and the manual validation module in ArMone. For phosphopeptides identified from both MS2 and its consecutive MS3 spectra, both of the MS2 and MS3 spectra with annotated ions are shown.

tides; (2) the identified phosphorylation sites and the location on the protein; and (3) the peptide sequence with 7 additional amino acids around the identified phosphorylation sites. A snapshot of a detail report is shown in Figure 5 and an example of the csv file for the full report can be found at the homepage of ArMone. Using the information provided in the detail report, one can easily carry on additional investigations, for example, the peptide sequence with 7 amino acids around the identified phosphorylation sites is given in the report, which can be conveniently submitted to phosphorylation site prediction algorithms (e.g., ScanSite33). Module for Drawing Spectra Match Information in Batch. Researchers may want to report the peptide or protein identifications together with the original mass spectra with matched ions information as part of submission or manuscripts.16,17 Even though some of the current proteome software suites have modules to show mass spectra with matched ion information for the identified peptides, graphics of the matched spectra are shown one by one. There is currently no publicly available tool for the batch creation of graphics of the matched spectra for all of the identified peptides. The batch drawing module of ArMone provides an easy way for the generation of mass spectra graphics for public evaluation. It draws spectrum graphics containing labeled peaks of the matched theoretical ions for all the identified phosphopeptides in a batch. The results can be exported in PDF format and as html file with hyper links to the spectrum images. For laboratories with home pages, the html format may be preferred so that others can access the data spectra using the Internet. Files in PDF format are easy to handle and can be 2748

Journal of Proteome Research • Vol. 9, No. 5, 2010

uploaded as Supporting Information for manuscripts. However, the file size will become larger with increasing number of identified peptides. If the phosphopeptide identifications are generated by the MS2/MS3 target-decoy strategy, the module can draw the spectra for MS2 and the consecutive MS3 together. Example files can be found at the homepage of ArMone. A representative graph for the phosphopeptides identified by MS2/MS3 spectra is shown in Figure 6.

Discussion Compared with unmodified peptide and protein, phosphopeptide and phosphoprotein have some unique properties, for example, the low stoichiometry, variance of phosphorylated sites, poor fragmentation in tandem mass spectrometry, and the fact that phosphoproteins are often identified only by one distinct peptides after the enrichment of phosphopeptides. These unique properties make it difficult to confidently identify phosphoproteins and localization of the phosphorylation sites. For example, low search scores are often obtained for the identification of phosphopeptides because of the poor fragmentation, which significantly reduce the sensitivity for phosphopeptide identifications. As well, the enlarged search space resulted from the variable modification of phosphorylation on S, T, or Y also introduces more false positive identifications. Therefore, specific validation methods are required to generate high confident phosphopeptide identifications; also, additional phosphorylation site localization algorithm is needed because of the variance of phosphorylation sites in phosphorylated peptides. As most of the current existing pipelines

ArMone: A Software Suite

technical notes

Figure 5. Example of the phosphorylation site detail report. The detail information includes the phosphopeptide sequence, protein name, phosphorylation site localization on proteins, the amino acid sequence around the phosphorylation site, and so on.

Figure 6. Example of identified peptides and the image of the mass spectra in html format. Mass spectra with labeled ions are shown following the hyperlink of each identified peptide. If phosphopeptides are identified from MS2/MS3 spectra pairs, two spectra will be shown, one is the MS2 spectrum and the other is the MS3 spectrum.

do not provide validation algorithms for phosphopeptide identification, the accuracy and sensitivity of these pipelines for the processing of the phosphoproteome data set tend to be poor. Moreover, all of the current pipelines commonly

do not provide modules for the precise localization of phosphorylation site and phosphorylation specific statistic. These drawbacks make it difficult to process phosphoproteome data sets on these pipelines. Journal of Proteome Research • Vol. 9, No. 5, 2010 2749

technical notes Fortunately, the ArMone software suite is specially developed to manage and process the phosphoproteome data set. For a single MS2 spectrum collected by CID/ETD/ECD, ArMone has convenient modules to control the confidence of phosphopeptide identifications. While for mass spectra pair of MS2/ MS3 collected by CID, the automatic phosphopeptide validation algorithm can significantly improve the sensitivity and reliability of the phosphopeptide identification. Phosphopeptides and unphosphorylated peptides can be easily distinguished and distilled to a different subset so that the control of FDR for these two subsets can be separately achieved. The manual validation and auto filtering modules can help to reduce the false positive identifications using customizable criteria. As well, modules for the automatic generation of reports on the phosphoproteomic data set are available, such as statistical reports on the phosphorylation site and matched spectra information in accordance with the proteome guideline.16,17 However, these reports are commonly not provided by most of the currently existing proteome software suites and pipelines. It should be noted that even though ArMone is designed for the processing of phosphoproteome data set, it can also be used to process data set of unmodified peptides/proteins and data set of proteins with other post-translation modifications, such as glycosylation. In addition, ArMone phosphoproteome software suite is designed in a pluggable manner. Thus, other modules such as the quantification module can be easily integrated. These new tools are under design and will be released in future versions of ArMone.

Conclusions In summary, ArMone is a software suite specially designed for the processing of phosphoproteome data set. It is applicable to process search results obtained by most of the commonly used database search algorithms and can be run on different operating systems. A few well designed modules including the data preprocess modules, search result parsers, identification validation modules and postvalidation modules, are integrated to facilitate the processing and reporting of phosphoproteome data sets. Furthermore ArMone allows the export of results in formats required for publication. We are currently developing additional modules and new features to further enhance ArMone as a powerful program for the management and processing of proteome data. Abbreviations: FDR, false discovery rate; NLMS3, data dependent neutral loss-triggered MS3; PTMs, post-translational modifications; CID, collision induced dissociation; ETD, electron transfer dissociation; ECD, electron collision dissociation.

Acknowledgment. Financial supports from the National Natural Sciences Foundation of China (No. 20675081, 20735004), the China State Key Basic Research Program Grant (2005CB522701, 2007CB914104), the China High Technology Research Program Grant (2006AA02A309), National Key Special Program on Infection Diseases (2008ZX10002-017), the Analytical Method Innovation program of MOST (2009IM031800), the Knowledge Innovation program of CAS (KJCX2.YW.HO9, KSCX2.YW.079) and the Knowledge Innovation program of DICP to H.Z., and National Key Special Program on Infection Diseases (2008ZX10002-020), National Natural Sciences Foundation of China (No. 20605022, 90713017) to M.Y. are gratefully acknowledged. We also thank Dr. Daniel Figeys in University of Ottawa for his valuable comments and suggestions. 2750

Journal of Proteome Research • Vol. 9, No. 5, 2010

Jiang et al.

Supporting Information Available: Supplemental Figures S1 and S2. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) Hunter, T. Signaling - 2000 and beyond. Cell 2000, 100 (1), 113– 127. (2) Mazanetz, M. P.; Fischer, P. M. Untangling tau hyperphosphorylation in drug design for neurodegenerative diseases. Nat. Rev. Drug Discovery 2007, 6 (6), 464–479. (3) Cohen, P. The regulation of protein function by multisite phosphorylation - a 25 year update. Trends Biochem. Sci. 2000, 25 (12), 596–601. (4) Beausoleil, S. A.; Jedrychowski, M.; Schwartz, D.; Elias, J. E.; Villen, J.; Li, J. X.; Cohn, M. A.; Cantley, L. C.; Gygi, S. P. Large-scale characterization of HeLa cell nuclear phosphoproteins. Proc. Natl. Acad. Sci. U.S.A. 2004, 101 (33), 12130–12135. (5) Gruhler, A.; Olsen, J. V.; Mohammed, S.; Mortensen, P.; Faergeman, N. J.; Mann, M.; Jensen, O. N. Quantitative phosphoproteomics applied to the yeast pheromone signaling pathway. Mol. Cell. Proteomics 2005, 4 (3), 310–327. (6) Larsen, M. R.; Thingholm, T. E.; Jensen, O. N.; Roepstorff, P.; Jorgensen, T. J. D. Highly selective enrichment of phosphorylated peptides from peptide mixtures using titanium dioxide microcolumns. Mol. Cell. Proteomics 2005, 4 (7), 873–886. (7) Lee, J.; Xu, Y. D.; Chen, Y.; Sprung, R.; Kim, S. C.; Xie, S. H.; Zhao, Y. M. Mitochondrial phosphoproteome revealed by an improved IMAC method and MS/MS/MS. Mol. Cell. Proteomics 2007, 6 (4), 669–676. (8) Olsen, J. V.; Blagoev, B.; Gnad, F.; Macek, B.; Kumar, C.; Mortensen, P.; Mann, M. Global, in vivo, and site-specific phosphorylation dynamics in signaling networks. Cell 2006, 127 (3), 635–648. (9) Pinkse, M. W. H.; Uitto, P. M.; Hilhorst, M. J.; Ooms, B.; Heck, A. J. R. Selective isolation at the femtomole level of phosphopeptides from proteolytic digests using 2D-nanoLC-ESI-MS/MS and titanium oxide precolumns. Anal. Chem. 2004, 76 (14), 3935–3943. (10) Trinidad, J. C.; Specht, C. G.; Thalhammer, A.; Schoepfer, R.; Burlingame, A. L. Comprehensive identification of phosphorylation sites in postsynaptic density preparations. Mol. Cell. Proteomics 2006, 5 (5), 914–922. (11) Villen, J.; Beausoleil, S. A.; Gerber, S. A.; Gygi, S. P. Large-scale phosphorylation analysis of mouse liver. Proc. Natl. Acad. Sci. U.S.A. 2007, 104 (5), 1488–1493. (12) Keller, A.; Eng, J.; Zhang, N.; Li, X. J.; Aebersold, R. A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Mol. Syst. Biol. 2005, 1, E1-E8. (13) Hartler, J.; Thallinger, G. G.; Stocker, G.; Sturn, A.; Burkard, T. R.; Koerner, E.; Rader, R.; Schmidt, A.; Mechtler, K.; Trajanoski, Z. MASPECTRAS: a platform for management and analysis of proteomics LC-MS/MS data. BMC Bioinformatics 2007, 8, 13. (14) Kohlbacher, O.; Reinert, K.; Gropl, C.; Lange, E.; Pfeifer, N.; SchulzTrieglaff, O.; Sturm, M. In TOPP - the OpenMS proteomics pipeline; Oxford Univ. Press: Cambridge, 2007; pp , E191-E197. (15) Hakkinen, J.; Vincic, G.; Mansson, O.; Warell, K.; Levander, F. The Proteios software environment: An extensible multiuser platform for management and analysis of proteomics data. J. Proteome Res. 2009, 8 (6), 3037–3043. (16) Molecular & Cellular Proteomics, Apr. 2007, http://www.mcponline. org/misc/ParisReport_Final.dtl. (17) Wilkins, M. R.; Appel, R. D.; Van Eyk, J. E.; Chung, M. C. M.; Gorg, A.; Hecker, M.; Huber, L. A.; Langen, H.; Link, A. J.; Paik, Y. K.; Patterson, S. D.; Pennington, S. R.; Rabilloud, T.; Simpson, R. J.; Weiss, W.; Dunn, M. J. Guidelines for the next 10 years of proteomics. Proteomics 2006, 6 (1), 4–8. (18) Jiang, X. N.; Han, G. H.; Feng, S.; Jiang, X. G.; Ye, M. L.; Yao, X. B.; Zou, H. F. Automatic Validation of Phosphopeptide Identifications by the MS2/MS3 Target-Decoy Search Strategy. J. Proteome Res. 2008, 7 (4), 1640–1649. (19) Beausoleil, S. A.; Villen, J.; Gerber, S. A.; Rush, J.; Gygi, S. P. A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat. Biotechnol. 2006, 24 (10), 1285–1292. (20) Park, S. K.; Venable, J. D.; Xu, T.; Yates, J. R. A quantitative analysis software tool for mass spectrometry-based proteomics. Nat. Methods 2008, 5 (4), 319–322. (21) Eng, J. K.; McCormack, A. L.; Yates, I. I. I. J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5 (11), 976–989.

technical notes

ArMone: A Software Suite (22) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20 (18), 3551–3567. (23) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20 (9), 1466–1467. (24) Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.; Maynard, D. M.; Yang, X. Y.; Shi, W. Y.; Bryant, S. H. Open mass spectrometry search algorithm. J. Proteome Res. 2004, 3 (5), 958– 964. (25) Tanner, S.; Shu, H. J.; Frank, A.; Wang, L. C.; Zandi, E.; Mumby, M.; Pevzner, P. A.; Bafna, V. InsPecT: Identification of posttransiationally modified peptides from tandem mass spectra. Anal. Chem. 2005, 77 (14), 4626–4639. (26) Park, C. Y.; Klammer, A. A.; Kall, L.; MacCoss, M. J.; Noble, W. S. Rapid and accurate peptide identification from tandem mass spectra. J. Proteome Res. 2008, 7 (7), 3022–3027. (27) The Trans-Proteomic Pipeline (TPP), http://tools.proteomecenter. org/wiki/index.php?title)Software:TPP. (28) Nesvizhskii, A. I.; Aebersold, R. Interpretation of shotgun proteomic data - The protein inference problem. Mol. Cell. Proteomics 2005, 4 (10), 1419–1440.

(29) Olsen, J. V.; Mann, M. Improved peptide identification in proteomics by two consecutive stages of mass spectrometric fragmentation. Proc. Natl. Acad. Sci. U.S.A. 2004, 101 (37), 13417– 13422. (30) Feng, S.; Ye, M. L.; Zhou, H. J.; Jiang, X. G.; Jiang, X. N.; Zou, H. F.; Gong, B. L. Immobilized zirconium ion affinity chromatography for specific enrichment of phosphopeptides in phosphoproteome analysis. Mol. Cell. Proteomics 2007, 6, 1656–1665. (31) Ulintz, P. J.; Bodenmiller, B.; Andrews, P. C.; Aebersold, R.; Nesvizhskii, A. I. Investigating MS2/MS3 matching statistics. Mol. Cell. Proteomics 2008, 7 (1), 71–87. (32) Dai, J.; Jin, W. H.; Sheng, Q. H.; Shieh, C. H.; Wu, J. R.; Zeng, R. Protein phosphorylation and expression profiling by Yin-yang multidimensional liquid chromatography (Yin-yang MDLC) mass spectrometry. J. Proteome Res. 2007, 6 (1), 250–62. (33) Obenauer, J. C.; Cantley, L. C.; Yaffe, M. B. Scansite 2.0: proteomewide prediction of cell signaling interactions using short sequence motifs. Nucleic Acids Res. 2003, 31 (13), 3635–3641.

PR9009904

Journal of Proteome Research • Vol. 9, No. 5, 2010 2751