GAPP: A Fully Automated Software for the Confident Identification of

We also thank the following colleagues for invaluable scientific discussions regarding GAPP: Kieran Todd at GlaxoSmithKline, and Kathryn Lilley and To...
4 downloads 0 Views 106KB Size
GAPP: A Fully Automated Software for the Confident Identification of Human Peptides from Tandem Mass Spectra Ian Shadforth,† Weibing Xu,† Daniel Crowther,‡ and Conrad Bessant*,† Cranfield University, Silsoe, Bedfordshire MK45 4DT, United Kingdom, and Target Bioinformatics, GlaxoSmithKline, Stevenage, SG2 1NY, United Kingdom Received May 2, 2006

Abstract: This paper introduces the genome annotating proteomic pipeline (GAPP), a totally automated publicly available software pipeline for the identification of peptides and proteins from human proteomic tandem mass spectrometry data. The pipeline takes as its input a series of MS/MS peak lists from a given experimental sample and produces a series of database entries corresponding to the peptides observed within the sample, along with related confidence scores. The pipeline is capable of finding any peptides expected, including those that cross intron-exon boundaries, and those due to single nucleotide polymorphisms (SNPs), alternate splicing, and posttranslational modifications (PTMs). GAPP can therefore be used to re-annotate genomes, and this is supported through the inclusion of a Distributed Annotation System (DAS) server, which allows the peptides identified by the pipeline to be displayed in their genomic context within the Ensembl genome browser. GAPP is freely available via the web, at www. gapp.info. Keywords: bioinformatics • mass spectrometry • proteomics • high throughput • grid

Introduction Proteomics based on tandem mass spectrometry is a powerful tool for identifying novel biomarkers and drug targets. A major bottleneck in high-throughput proteomics is that the quantity of data generated is substantial, and the computational techniques needed to reliably identify proteins from proteomic data have to date lagged behind the ability to collect it. Although a number of powerful peptide identification algorithms have been produced,1 the output from these algorithms often requires human intervention to make calls regarding the presence or absence of peptides in borderline cases, of which there can be many. Furthermore, such tools promote a piecemeal approach to proteomic data analysis, where data is not automatically analyzed in the context of other data available from the same experiment, or other experiments. The computational challenge for high-throughput proteomics is therefore to automatically and rapidly generate high-confidence protein * Corresponding author, [email protected]. † Cranfield University. ‡ GlaxoSmithKline. 10.1021/pr060205s CCC: $33.50

 2006 American Chemical Society

identifications, including post-translational modifications (PTMs), from large datasets in a searchable format covering multiple experimental runs. GAPP addresses this challenge, by providing a totally automated peptide identification pipeline with a falsepositive rate of close to zero.

Experimental Procedures 1. System Overview. A high level overview of the GAPP system is shown in schematic form in Figure 1. The heart of the system is the peptide identification pipeline, which runs on a high-performance computing facility (HPCF) and is described in the next section. In essence, GAPP retrieves MS/ MS peak list data from fileservers in remote laboratories, identifies the peptides within the data, and deposits the resulting identifications in a database for subsequent analysis. Flow of information through the system is controlled by a pipeline request processor, which watches fileservers for new data, and monitors capacity on the high-performance computer facility, initiating the transfer of new data from laboratory fileservers to the HPCF when computer power is available. Secure data transfer is mediated by Storage Resource Broker (SRB).2 The distributed system design has two distinct advantages. First, laboratory groups do not need to manually upload data, and second, it negates the need for the actual peak list files, which are relatively large (typically 500 MB per experiment), to be transported to, or stored on, the GAPP server. The GAPP server is therefore only lightly loaded, as the burden of storage and analysis is distributed among the laboratory fileservers and the HPCF. All GAPP identifications are retained on the GAPP server regardless of ongoing access to the original spectra, so there is no impact on the availability of the identifications even if the original MS/MS data goes off-line. 1.1. Automated Data Submission. To make their data available to GAPP, a proteomics laboratory need only register their fileserver at the GAPP Web site and install a small piece of software called GAPPclient on their fileserver, to facilitate the transfer of data to GAPP. Submitting groups may flag data as either public or private. When data is flagged as public, the data are processed and peptides identified are deposited in the publicly accessible peptide identification database. If data is flagged as private, it is still processed, but access to the resulting peptide identifications is restricted to the group that provided the data. Ideally, data made available to GAPP should conform to the open standard mzData format, so that relevant metadata can be transported through the pipeline and stored alongside identified proteins. However, GAPP accepts popular legacy peak Journal of Proteome Research 2006, 5, 2849-2852

2849

Published on Web 08/16/2006

GAPP: The Genome Annotating Proteomic Pipeline

technical notes

Figure 1. Schematic overview of the GAPP system. Dashed boxes indicate physical entities within the system. Arrows show data flow, with the weight of the arrow roughly indicating the amount of data being transferred.

Figure 2. Schematic of the peptide identification pipeline.

list formats including PKL and MGF. Some basic information about the experiment is requested, to facilitate more useful querying of the peptide identifications that result. 1.2. Peptide Identification Database. Identified peptides are stored in a relational database on the GAPP server. This database schema is closely mapped to that of the European Bioinformatics Institute’s PRIDE3 database, facilitating future export of peptide identifications to PRIDE using PRIDE XML and ultimately to other databases via AnalysisXML, once that schema is finalized. The detailed scoring information related to the identification of each peptide is stored alongside the data. Furthermore, any annotations associated with the peak list data from which the peptides were identified are carried through to this database, creating the potential to search the database according to tissue type, and so forth. At present, simple database searches can be carried out via the GAPP Web site (www.gapp.info). The data are also made available via a DAS server, such that the peptides may be visualized in their genomic context in Ensembl. 2850

Journal of Proteome Research • Vol. 5, No. 10, 2006

2. Peptide and Gene-Product Identification Pipeline Algorithms. The peptide identification pipeline used in GAPP is shown in Figure 2. The input to the pipeline consists of MS/ MS spectra in the form of peak lists. The spectra are analyzed in batches, where each batch comprises all spectra acquired from a given experimental sample. The aim of the identification algorithms is to assign each of these spectra to a specific peptide and to group these assignations into gene product identifications. The steps in this process are described below, in the order in which they occur. 2.1. Primary Scoring. GAPP’s primary scoring consists of a peptide database search, for which there are many well-established implementations to choose from, such as X!Tandem,4 Mascot,5 and SEQUEST.6 For license-independent scalability, we have chosen to use X!Tandem for the public release of GAPP, but have also implemented GAPP systems based around Mascot for in-house use. The result of the primary scoring is a list of candidate peptides from the search database (in this case Ensembl), along with the probability that

technical notes each peptide matches the search spectrum. The subsequent elements of the peptide identification pipeline aim to covert these probability scores into higher confidence observations, and identify any spectra that could not be assigned during primary scoring. 2.2. Secondary Scoring. One of the biggest problems with peptide database searching is that a single spectrum can be matched to multiple peptides within the 95% confident level commonly considered. GAPP tackles this issue by grouping peptides according to gene product, such that the existence of a peptide is given added credibility if other peptides from the same protein have been observed in the same experiment. This is achieved using the advanced average peptide score (advanced APS) which we described in a previous publication.7 Briefly, the average peptide score (the sum of peptide probability scores divided by the number of peptides) is computed for each gene within which a peptide has been matched by X!Tandem. Gene products with average peptide scores below a given threshold are discarded, along with the peptide identifications linked to them. This threshold is determined for each experiment by performing a peptide database search against a copy of the search database in which all sequences are reversed. Any matches found during the reversed database search are necessarily not statistically significant, so adopting the maximum APS found in the reverse search as the threshold for gene-product observations in the forward search allows the assignation of peptides with a false-positive rate very close to zero. 2.3. PTMs and Splice Variants. Many spectra (typically more than 75%) emerge from the secondary scoring without being assigned to a peptide. This is typical of proteomic MS/MS analysis and is generally caused by either poor spectral quality (due to contamination, experimental noise, etc.) or absence of the exact peptide sequence in the search database. The latter is typically due to PTMs, splice variants, and single amino acid polymorphisms (SAPs). The GAPP pipeline employs a second stage of peptide assignment with the aim of capturing good quality unassigned spectra. The obvious solution is to search all unassigned spectra against a database containing the peptides produced by all possible splice variants, mutations, and PTMs, but the immense increase in search space would render the search intractable. GAPP therefore employs a twopronged approach to substantially reduce the search space: First, the ‘Pexon’ search database, consisting of all possible peptides from the exons of each gene (hence to name ‘pexon’), is filtered to leave only peptides originating from gene products that the secondary scoring algorithm indicates are present in the sample. This is similar in logic to the error-tolerant search approach available in Mascot.8 Second, only spectra which meet a certain quality threshold are considered. As the X!Tandem algorithm considers the top n ions in each spectrum (set to 50 in the GAPP system), the quality threshold used here is based on comparing the normalized intensity of the top 50 ions in each spectrum. If the mean normalized intensity of a candidate unidentified spectrum passes the mean of the mean intensities of all successfully identified spectra, then it is considered for further processing using the ProbID algorithm9 and the Pexon-derived search space. As the search space has been substantially reduced, a large number of variable PTMs may be searched for without incurring punitively long search times. Currently GAPP considers all modifications from the UniMod database10 that are not marked as “Hidden”. ProbID is a Bayesian-based peptide identification system that reports

Shadforth et al. Table 1. Comparison of GAPP Performance to the Performances of Six Other Peptide Identification Systemsa system

true positive spectrum identifications

GAPP Peptide Prophet Spectrum Mill (tag > 1) SEQUEST X!Tandem Mascot Sonar

291 279 254 231 191 123 100

a All results except those for GAPP are taken from the comparison by Kapp et al.12,13 in which correct identifications are those ranked first in each respective results list, with the FP rate set at 0.1%, calculated separately for each algorithm.

a probability that each peptide has been correctly matched alongside the number of ions that were accounted for by the match. As the probabilities generated are inflated due to the small search space being considered, these factors are further processed by GAPP by combining them using a set of Gaussian distributions, calculated with simulated data, such that the accuracy of identifications and location of any PTM may be more accurately assessed.11

Results To demonstrate the efficacy of the GAPP system, the pipeline has been used to analyze a subset of the data obtained by the HUPO Human Plasma Proteome project, which has recently been the focus of a comparative analysis of peptide identification algorithms by Kapp et al.12 The dataset and list of manually confirmed peptide identifications were downloaded.13 Test Data. A total of 5675 spectra was provided12,13 from which 671 correct spectrum identifications had been manually verified, based on results from Mascot and SEQUEST. These represented 448 unique peptide sequences. Kapp et al. report two sets of results provided by the algorithms tested: total correct, top-ranked peptide identifications not constrained by false-positive (FP) rates and total correct, top-ranked hits constrained to a 0.1% FP rate.12 Peptide Prophet performed best in both cases, with 499 correctly identified spectra with unconstrained FP rate, reducing to 279 spectra at the 0.1% FP rate. The list of manually confirmed identifications13 at the 0.1% FP rate was loaded into a relational database table to allow straightforward comparison of these results with those from GAPP. GAPP Peptide and Gene-Product Identifications. The 5675 spectra were processed using GAPP with the Ensembl protein database (Ensembl Core 36, NCBI 35, February 2006). The X!Tandem search parameters were set as for the comparator set, specifically two missed cleavages allowed, peptide mass tolerance set to (3 Da, and fragment mass tolerance to (0.5 Da. No fixed or variable modifications were set. GAPP identified 36 distinct gene products. The veracity of the peptide identifications comprising these was tested by comparison with the manually verified set described above.13 Table 1 shows the numbers of distinct spectra and peptides identified by GAPP’s primary scoring system, at its near 0% false-positive gene-product identification rate, compared to the numbers of spectra identified by the other algorithms considered by Kapp et al. at the 0.1% false-positive rate. It can be seen from these results that GAPP’s performance across this sample of the HPP dataset is favorably comparable to that of the other algorithms considered. Journal of Proteome Research • Vol. 5, No. 10, 2006 2851

technical notes

GAPP: The Genome Annotating Proteomic Pipeline

In addition to the 291 spectra correctly identified by GAPP’s primary search algorithm and also present in the manually verified set, four spectra were identified by the secondary search, and 13 could be confidently assigned to the geneproducts identified by the primary system, although these additional spectra did not match sequences included in the manually verified dataset. It could be argued that these may be false positives given that they do not appear in the manually verified reference set. However, the reference set was derived with the aid of Mascot and SEQUEST results, and it is more likely that these identifications were simply not suggested for manual consideration by these algorithms. In the case of those identified by the secondary scoring system, three of these contained PTMs, which were not searched for in the previous work. Furthermore, some sequences are likely to be present in the Ensembl and Pexon databases that were not in those used by Kapp et al. and hence could not have been identified. Finally, as each algorithm can identify slightly different sets of peptides, it is likely that when two algorithms are used in such an advisory capacity, another algorithm may well suggest further, equally correct, identifications. In terms of distinct peptides identified, GAPP found 225, including the 17 additions to the manually verified dataset. It may appear surprising that GAPP can identify a greater number of peptides than X!Tandem at similar FP rates. To understand the difference, it should be remembered that the reported X!Tandem results are for correct top-ranked peptide identifications at this FP rate. The additional peptide identifications found by GAPP are originally made by X!Tandem as well, but are not top-ranked. The advanced APS algorithm essentially lifts these lower ranked, but correct, identifications into consideration through the combination of gene grouping and thresholding.7

Conclusions and Future Work GAPP has been designed, implemented, and tested and is, at the time of writing, the only totally automated publicly available system for the identification of peptides from MS/ MS data. The automated nature of the system and its ability to analyze data from distributed databases are key in keeping pace with the ever increasing throughput of proteomic analysis. Priorities for future work on GAPP include a more advanced database interface, to allow biologically relevant querying without recourse to SQL, and support for quantitative proteomic data from techniques such as iTRAQ.14,15 Support for

2852

Journal of Proteome Research • Vol. 5, No. 10, 2006

other organisms is also planned, with a local installation for Arabidopsis having already been implemented and used in novel research.16

Acknowledgment. We thank GlaxoSmithKline and EPSRC for funding the initial development of GAPP through an EngD studentship, and BBSRC for funding the further development of GAPP into a public system. We also thank the following colleagues for invaluable scientific discussions regarding GAPP: Kieran Todd at GlaxoSmithKline, and Kathryn Lilley and Tom Dunkley at Cambridge University. References (1) Shadforth, I.; Crowther, D.; Bessant, C. Proteomics 2005, 5, 40824095. (2) Rajasekar, A.; Wan, M.; Moore, R.; Schroeder, W.; Kremenek, G.; Jagatheesan, A.; Cowart, C.; Zhu, B.; Chen, S. Y.; Olschanowsky, R. Comput. Soc. India J. 2003, 33, 42-54. (3) Jones, P.; Cote, R.; Martens, L.; Quinn, A.; Taylor, C.; Derache, W.; Henning, H.; Apweiler, R. Nucleic Acids Res. 2006, 34, D659D663. (4) Craig, R.; Beavis, R. C. Rapid Commun. Mass Spectrom. 2003, 17, 2310-2316. (5) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551-3567. (6) Eng, J. K.; McCormack, A. L.; Yates, I. I. I. J. Am. Soc. Mass Spectrom.1994, 5, 976-989. (7) Shadforth, I.; Dunkley, T.; Lilley, K. S.; Crowther, D.; Bessant, C. Rapid Commun. Mass Spectrom. 2005, 19, 3363-3368. (8) Creasy, D. M.; Cottrell, J. S. Proteomics 2002, 2, 1426-1434. (9) Zhang, N.; Aebersold, R.; Schwikowski, B. Proteomics 2002, 2, 1406-1412. (10) Creasy, D. M.; Cottrell, J. S. Proteomics 2004, 4, 1534-1536. (11) Shadforth, I. P. Ph.D. Thesis, Cranfield University, 2005, pp 106115 (available at www.gapp.info). (12) Kapp, E. A.; Schutz, F.; Connolly, L. M.; Chakel, J. A.; Meza, J. E.; Miller, C. A.; Fenyo, D.; Eng, J. K.; Adkins, J. N.; Omenn G. S.; Simpson, R. J. Proteomics 2005, 5, 3475-3490. (13) Human Plasma Proteome Dataset and Peptide Identifications downloaded from http://ludwig.edu.au/archive/, 2005. (14) Ross, P. L.; Huang, Y. L. N.; Marchese, J. N.; Williamson, B.; Parker, K.; Hattan, S.; Khainovski, N.; Pillai, S.; Dey, S.; Daniels, S.; Purkayastha, S.; Juhasz, P.; Martin, S.; Bartlet-Jones, M.; He, F.; Jacobson, A.; Pappin, D. J. Mol. Cell. Proteomics 2004, 3, 11541169. (15) Shadforth, I.; Dunkley, T.; Lilley, K. S.; Bessant, C. BMC Genomics 2005, 6, 145. (16) Dunkley, T. P. J.; Hester, S.; Shadforth, I. P.; Runions, J.; Weimar, T.; Hanton, S. L.; Griffin, J. L.; Bessant, C.; Brandizzi, F.; Hawes, C.; Watson, R. B.; Dupree, P.; Lilley, K. S. Proc. Natl. Acad. Sci. U.S.A. 2006, 103, 6518-6523.

PR060205S