Parallel Tandem: A Program for Parallel Processing of Tandem Mass Spectra Using PVM or MPI and X!Tandem Dexter T. Duncan,† Robertson Craig,‡ and Andrew J. Link*,† Department of Microbiology and Immunology, Vanderbilt University School of Medicine, Nashville, Tennessee 37232, Beavis Informatics Ltd., Winnipeg, MB, Canada, R3B 1G7 Received March 9, 2005
Abstract: A method for the rapid correlation of tandem mass spectra to a list of protein sequences in a database has been developed. The combination of the fast and accurate computational search algorithm, X!Tandem, and a Linux cluster parallel computing environment with PVM or MPI, significantly reduces the time required to perform the correlation of tandem mass spectra to protein sequences in a database. A file of tandem mass spectra is divided into a specified number of files, each containing an equal number of the spectra from the larger file. These files are then searched in parallel against a protein sequence database. The results of each parallel output file are collated into one file for viewing through a web interface. Thousands of spectra can be searched in an accurate, practical, and time effective manner. The source code for running Parallel Tandem utilizing either PVM or MPI on Linux operating system is available from http:// www.thegpm.org. This source code is made available under Artistic License from the authors. Keywords: proteomics • mass spectrometry • tandem MS/MS • database search • protein identification • X!Tandem • PVM • MPI • post-translational modifications
Introduction Proteomics is increasingly in need of more rapid methods for identification of proteins and post-translational modifications. Advanced chromatography methods coupled with tandem mass spectrometry (MS/MS) can produce tens of thousands of spectra in a single experiment.1-3 Therefore, high performance/high throughput computer algorithms are required to correlate tandem mass spectra with a protein sequence database. To interpret the acquired spectra, several computer algorithms and statistical approaches have been developed to compare the unedited experimental MS/MS spectra to the theoretical values in a protein sequence database.4-8 The combination of chromatography, tandem mass spectrometry, and computational database searching allows thousands of proteins to be identified without the need for initial separation of individual proteins.3 This “shotgun” * To whom correspondence should be addressed. Tel: (615) 343-6823. Fax: (615) 343-7392. E-mail:
[email protected]. † Vanderbilt University School of Medicine. ‡ Beavis Informatics Ltd.
1842
Journal of Proteome Research 2005, 4, 1842-1847
Published on Web 08/09/2005
approach has reduced the amount of time necessary to identify proteins in complex mixtures. However, the increasing amount of data being generated in protein identification experiments, as well as the application of these methods to post-translational modification research, demand an even greater reduction in analysis time. Post-translational modification (PTM) identification is necessary for the comprehensive understanding of protein function.9 However, most modified amino acids are not represented in a protein database. Computer algorithms to interpret MS/MS spectra must generate “in silico” a database of modified protein sequences, by theoretically adding specified post-translational modifications to sequences in a standard database.10 The search time required for MS/MS spectra with these modifications increases dramatically with each additional modified residue. The otherwise efficient “shotgun” approach is a slow process when applied to post-translational modifications.9 A two- or three-modification search of several thousand MS/MS spectra against a large protein sequence database on a single microprocessor can take several days with current search algorithms. Parallel computer message passing libraries, such as the Parallel Virtual Machine (PVM) and the Message Passing Interface (MPI), operate in a networked computer cluster environment and boost the performance of computer search engines severalfold.11,12 When applied to computational analysis of MS/MS spectra, this increase in performance results in a scalable reduction in the time needed to complete spectrumto-protein matching.13 For example, if one computer analyzes 25 000 spectra against a 50 000 protein sequences database in 100 hours, then 20 computers operating in parallel (assuming 100% scalability) can perform the same analysis in one twentieth of the time or 5 h. Generally, PVM is used with a heterogeneous population of computer architectures and operating systems, whereas MPI is the choice for a performance boost when message passing is extensive and a large homogeneous cluster of computers are utilized.14 Programs and clusters utilizing MPI tend to be easier to implement and maintain. However, the ultimate decision to use either PVM or MPI tends to be a matter of personal choice and experience. X!Tandem is an open-source application that computationally compares MS/MS spectra to amino acid sequences in a protein sequence database.7,15 It was designed to perform these analyses with high speed and accuracy on a single computer in a shared memory environment.7 X!Tandem initially searches a complete protein sequence database for matches to MS/MS spectra by assuming consensus proteolytic digestion. This method is known as a nonrefined search. The application then 10.1021/pr050058i CCC: $30.25
2005 American Chemical Society
technical notes
Duncan et al.
conducts a refined search of the “candidate proteins” identified in the nonrefined search, but this time allows for missed proteolytic cleavages and post-translational amino acid modifications.16 X!Tandem uses survival functions and expectation values to statistically estimate the likelihood a particular peptide sequence and protein represents a stochastic match to the acquired MS/MS data.15 In particular, a protein’s “expectation value” statistically measures the combined chance of stochastic peptide matches to a protein in a protein database. In this paper, we demonstrate the computer algorithm “Parallel Tandem” which utilizes PVM or MPI with X!Tandem in order to significantly reduce the time necessary for computational analysis of MS/MS spectra against a protein sequence database. The speed of this parallel driver makes it practical to perform searches for multiple, independent post-translational amino acid modifications. The smaller database file of proteins generated by a parallel, nonrefined search can be used for subsequent refined searches with X!Tandem or searches by other MS/MS search algorithms.
Materials and Methods A White Box Linux cluster, consisting of a front-end dualprocessor node and 20 back-end dual-processor nodes, was used for the development and testing of PVM and MPI Parallel Tandem.17 The front-end node was a 2GHz Intel dual-Xeon processor computer with 1 Gb of RAM. The 20 back-end nodes were 2.4 GHz AMD dual-Opteron processor nodes with 1Gb of RAM each. For experiments running X!Tandem sequentially, the application was run on one processor of a 2.4 GHz AMD dual-Opteron processor node with 1 Gb of RAM. PVM version 3.4.5 and X!Tandem version 2004.11.15 were installed on the front-end node.18,19 For MPI experiments, MPICH version 1.2.5.2 was installed.20 To view benchmark results, the Global Proteome Machine (GPM) web interface was installed on a 1 GHz Intel dual-P3 processor computer with 1 Gb of RAM. The parallel driver for X!Tandem and utility programs were written in the C programming language and compiled with GCC 3.2 on the front-end node.21 The source code for applications developed in this project can be obtained at http://www.thegpm.org along with documentation describing installation and instructions for operation. Experimental MS/MS spectral data were generated employing DALPC techniques.1 A FASTA database of yeast Saccharomyces cerevisiae proteins was obtained from Saccharomyces Genome Database (SGD), and a FASTA database of human proteins from the International Protein Index was obtained from the European Bioinformatics Institute (EMBL-EBI).22,23 The directories containing the acquired spectra files and the FASTA database files were mounted across all computational nodes. All benchmarks were performed utilizing the PVM version of Parallel Tandem and the default input parameters of X!Tandem, unless otherwise specified.7
Results and Discussion This study investigated the ability to launch X!Tandem in parallel on several computers utilizing the PVM or MPI computing environments in order to significantly reduce the time necessary to match MS/MS spectra against proteins in a database. Multiple utility programs were developed to facilitate this parallelization of X!Tandem (Figure 1). First, the program “catfiledivide” divides a file of MS/MS spectral data (file formats in DTA, PKL, or MGF are acceptable) into a specified number
Figure 1. Flowchart of Parallel Tandem. A file of MS/MS spectra is divided into a specified number of smaller MS/MS files using “catfiledivide”. Next, “tandem_pvm”, or “tandem_mpi” searches these files against a database of protein sequences, producing XML output files. The output files are parsed by either “candidates” or “dbasebuild” in order to build a sub-database of candidate proteins for further searches, either in parallel with “tandem_pvm” or “tandem_mpi” or with any MS/MS search engine. The program “xmlcat” collates the results of the XML output files of the second parallel search. Finally, the concatenated parallel output XML file is run through X!Tandem for the computation of the protein expectation values. “autotandem_pvm” or “autotandem_mpi” function as a wrapper to automate the applications on the right side of the flowchart. Bold text represents computer programs.
of smaller files depending on the number of parallel processors. Each file contains an equal number of spectra. Second, the program “tandem_pvm” or “tandem_mpi” launches X!Tandem in parallel for a nonrefined search of each of these smaller files. Upon completion, the program “candidates” extracts the candidate proteins from the individual XML output files and builds a sub-database of those proteins identified in the parallel, nonrefined search. Additionally, “candidates” retrieves the MS/MS spectra files that identified each candidate protein. One MS/MS spectrum with the highest peptide expectation value for each distinct candidate protein is then appended to each file generated by the program “catfiledivide”. This step is necessary to ensure each candidate protein is selected for refinement on each parallel processor. Next, a refined search of the experimental spectra is performed in parallel using the candidate protein sub-database and “tandem_pvm” or “tandem_mpi”. Upon completion, the program “xmlcat” concatenates the results of the individual XML output files into one XML output file. Redundant information is removed from the file and candidate peptides are sorted into groups according to the identified protein. The “xmlcat” program sorts these groups in descending order with respect to the number of MS/ MS spectra assigned per protein, while retaining the X!Tandem output file format. To compute the protein expectation values, X!Tandem, running on a single processor (uniprocessor) or in a shared memory environment, is launched on the single XML output file generated by “xmlcat”. This step is necessary to replicate the protein expectation value computation method provided by original X!Tandem.7,15,16 One of the features of X!Tandem is the ability to use the XML output files as inputs Journal of Proteome Research • Vol. 4, No. 5, 2005 1843
Parallel Tandem
technical notes
Figure 2. Benchmarks of nonrefined PTM searches. (A) Benchmarks of nonrefined searches, including six modifications, for 24 000 spectra against a 6400-entry protein database, comparing uniprocessor (sequential) and multiple multi-processor (“tandem_pvm”) searches. (B) Benchmarks of nonrefined searches, including six modifications, for 25 000 spectra against a 50 000-entry protein database, comparing uniprocessor (sequential) and multiple multi-processor (“tandem_pvm”) searches. (C) Benchmarks of nonrefined searches, including six modifications, for 24 000 spectra against a 6400-entry protein database, comparing uniprocessor (sequential) and multiple multi-processor (“tandem_pvm”) searches with the additional step of running the concatenated parallel output XML file through X!Tandem on one processor. (D) Benchmarks of nonrefined searches, including six modifications, for 25 000 spectra against a 50 000entry protein database, comparing uniprocessor (sequential) and multiple multi-processor (“tandem_pvm”) searches with the additional step of running the concatenated output XML file through X!Tandem on one processor.
to the program. For nonrefined protein expectation scores, the concatenated parallel output XML file is run against the original database. For the refined search, the concatenated parallel XML output file is run against the candidate protein sub-database. Because the XML output files generated by “tandem_pvm” or “tandem_mpi” only contain output data for MS/MS spectra that meet the requirements of the X!Tandem input parameters, this final step is significantly faster compared to sequentially running X!Tandem on all the acquired MS/MS data. Finally, the program “autotandem_pvm” or “autotadem_mpi” acts as a wrapper for these programs to automate the execution of the entire process. To evaluate the performance of Parallel Tandem in nonrefined searches, timed trials were performed in which the program “tandem_pvm” searched for peptides with any of three potential modifications on six different amino acids (acetylations at lysine (K) and arginine (R), phosphorylations at tyrosine (Y), threonine (T), serine (S), and methylations at histidine (H)). A sequential X!Tandem search of 24 000 spectra against a 6400entry yeast protein database was completed in 150 min as compared to 4 min with “tandem_pvm” running on 40 processors (Figure 2A). Another sequential search of 25 000 spectra against a 50 000-entry human protein database was completed in 18 h as compared to 0.5 h with “tandem_pvm” running on 40 processors (Figure 2B). Hence, the improvement in performance of nonrefined searches with “tandem_pvm” over a 1844
Journal of Proteome Research • Vol. 4, No. 5, 2005
sequential search on a single computer scaled closely to the number of processors used, e.g., 18-fold faster with 20 processors and 36-fold faster with 40 processors. These results show that “tandem_pvm” is 90% scalable for the parallel portion of a nonrefined search. Finally, the derivation of the protein expectation values by running the concatenated parallel output XML file through X!Tandem adds a nonscalable compute time to these parallel benchmarks (Figure 2C,D). Nonetheless, Parallel Tandem significantly reduces the time required for post-translational modification searches of MS/MS spectra against a protein database. To evaluate the performance of Parallel Tandem in refined searches, experiments were performed in which the program “autotandem_pvm” searched MS/MS spectra for any of three potential modifications. A sequential search of 24 000 spectra against the 6400-entry yeast protein database on a single computer required 24 h (Figure 3A). An “autotandem_pvm” search of the 24 000 spectra against the yeast protein database was completed in 4 h on 10 processors. Another sequential search of 25 000 spectra against a 50 000-entry protein database with six modifications was completed in 200 h on a single processor compared to 21 h on 40 processors (Figure 3B). Overall, the timed benchmarks resulted in a marked performance improvement by Parallel Tandem. To evaluate the outputs from X!Tandem and Parallel Tandem, the peptide and protein expectation scores were com-
technical notes
Duncan et al.
Figure 3. Benchmarks of refined PTM searches. (A) Benchmarks of a fully automated refined search using “autotandem_pvm”, including three modifications, for 24 000 spectra against a 6400-entry protein database. (B) Benchmarks of a fully automated refined search using “autotandem_pvm”, including six modifications, for 25 000 spectra against a 50 000-entry protein database.
Figure 4. Comparison of nonrefined outputs from X!Tandem and Parallel Tandem. These results represent a small portion of peptide and protein expectation values from a search of 24 000 spectra against a 6400 protein yeast database. (A) Nonrefined peptide expectation value outputs. (B) Nonrefined protein expectation value outputs. The minor differences in peptide and protein expectation values were due to running the concatenated parallel output XML file through X!Tandem to compute the protein expectation values. The differences in the spectra numbers in the column “spectrum” are a result of the original file of spectra being divided into several smaller files by “catfiledivide”.
pared. A nonrefined search illustrated that X!Tandem running on a single computer and Parallel Tandem on a cluster of computers had minor differences in peptide and protein
expectation scores (Figure 4A,B). Comparison of the performance parameters in the output files generated by sequential X!Tandem and Parallel Tandem revealed a small difference in Journal of Proteome Research • Vol. 4, No. 5, 2005 1845
Parallel Tandem
technical notes
Figure 5. Comparison of refined outputs from X!Tandem and Parallel Tandem. These results represent a small portion of the scores from a search of 24 000 spectra against a 6400 protein yeast database. (A) Refined peptide expectation value outputs. (B) Refined protein expectation value outputs. The minor difference between X!Tandem and Parallel Tandem peptide and protein expectation values is due to minor differences in the protein input models used for refinement and a difference in the number of unique spectra assigned. The differences in the spectra numbers in the column “spectrum” is a result of the original file of spectra being divided into several smaller files by “catfiledivide”.
the number of unique spectra assigned in the nonrefined search.24 This accounted for the small differences in peptide and protein expectation values. When comparing the peptide and protein expectation scores from the refined searches generated by X!Tandem and Parallel Tandem, minor differences in the peptide and protein expectation values are also observed (Figure 5A,B). Once again, examination of the performance parameters in the output files generated by sequential X!Tandem and Parallel Tandem revealed a small difference in the number of unique spectra assigned, as well as a difference in protein input models for the refined search step. We expect newer versions of the applications will eventually resolve these minor discrepancies. Overall, the peptide and protein expectation values are in excellent agreement between X!Tandem and Parallel Tandem. Comparisons between the PVM and MPI versions of Parallel Tandem produced identical nonrefined and refined search outputs. “tandem_pvm” and “tandem_mpi” launch an instance of X!Tandem on each processor in their respective parallel 1846
Journal of Proteome Research • Vol. 4, No. 5, 2005
environments. Each instance of X!Tandem, executing in parallel, writes the results directly to a mounted working directory on the front end node or a file server. Therefore, there is very little message passing in either parallel method, and hence neither PVM nor MPI should demonstrate a performance advantage over the other. Timed benchmark comparisons of the PVM parallel method and the MPI parallel method were performed which confirmed this hypothesis. Parallel Tandem can also be used to generate sub-databases of candidate proteins that may be used by any MS/MS search algorithm. Upon the completion of “catfiledivide” followed by “tandem_pvm” or “tandem_mpi”, the program “dbasebuild” can be used to create a candidate protein sub-database (Figure 1). Because the candidate protein sub-database represents a smaller number of proteins from a larger database, the search time for other algorithms to analyze the original MS/MS spectra are significantly reduced. Because X!Tandem is an open source application, it is expected that various versions of the application will be
technical notes developed. To update Parallel Tandem with a newer version of X!Tandem, all that is required is to download and compile the latest version of X!Tandem into the proper directory on the front-end compute node.
Conclusions The time required to correlate large numbers of tandem mass spectrometry spectra to proteins in a database is significantly reduced by performing searches in parallel with the combination of PVM or MPI and X!Tandem. Although X!Tandem as a multi-threading algorithm has inherent scalability, Parallel Tandem provides an alternative to costly shared memory, multiprocessor systems by taking advantage of a distributed parallel computing environment. Furthermore, searches including post-translational modifications of an entire database are time-efficient and practical with this method. The method is simple to implement in a highly cost-effective parallelcomputing environment.
Acknowledgment. We thank Jill McAfee for numerous discussions during the development of these applications. We thank Elizabeth M. Link for comments during the preparation of this manuscript. We thank the entire staff at Vanderbilt University’s Advanced Computing Center for Research and Education (ACCRE). The project was supported by NIH grants ES11993 and GM64779. This work was conducted extensively using the resources of ACCRE. In addition, this project was funded in part with federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under Contract No. HHSN266200400079C/N01-AI-40079. D.T.D. is supported by NIH grant ES11993. A.J.L. is supported by NIH grants GM64779, HL68744, ES11993, and CA098131. References (1) Link, A. J.; Eng, J.; Schieltz, D. M.; Carmack, E.; Mize, G. J.; Morris, D. R.; Garvik, B. M.; Yates, J. R., 3rd Nat. Biotechnol. 1999, 17, 676-682.
Duncan et al. (2) Washburn, M. P.; Wolters, D.; Yates, J. R., 3rd Nat. Biotechnol. 2001, 19, 242-247. (3) Cantin, G. T.; Yates, J. R., 3rd J. Chromatogr. A 2004, 1053, 7-14. (4) Eng, J. K.; McCormack, A. L.; Yates, I., J. R. J. Am. Soc. Mass Spectrom. 1994, 5, 976-989. (5) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551-3567. (6) Field, H. I.; Fenyo, D.; Beavis, R. C. Proteomics 2002, 2, 36-47. (7) Craig, R.; Beavis, R. C. Bioinformatics 2004, 20, 1466-1467. (8) Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.; Maynard, D. M.; Yang, X.; Shi, W.; Bryant, S. H. J. Proteome Res. 2004, 3, 958-964. (9) Lu, B.; Chen, T. Bioinformatics 2003, 19 Suppl 2, II113-II121. (10) Yates, J. R., 3rd; Eng, J. K.; McCormack, A. L.; Schieltz, D. Anal. Chem. 1995, 67, 1426-1436. (11) Geist, G. A.; Beguelin, A. L.; Dongarra, J. J.; Jiang, W.; Manchek, R. J.; Sunderam, V. S. PVM: Parallel Virtual Machine- A Users Guide and Tutorial for Network Parallel Computing; MIT Press: Cambridge, MA, 1994. (12) Snir, M.; Otto, S.; Huss-Lederman, S.; Walker, D.; Dongarra, J. MPI-The Complete Reference, 2nd ed.; MIT Press: Cambridge, MA, 1998. (13) Sadygov, R. G.; Eng, J.; Durr, E.; Saraf, A.; McDonald, H.; MacCoss, M. J.; Yates, J. R., 3rd J. Proteome Res. 2002, 1, 211-215. (14) Geist, G. A.; Kohl, J. A.; Papadopoulos, P. M. Calculateurs Paralleles 1996, 8, 137-150. (15) Fenyo, D.; Beavis, R. C. Anal. Chem 2003, 75, 768-774. (16) Craig, R.; Beavis, R. C. Rapid Commun. Mass Spectrom. 2003, 17, 2310-2316. (17) http://www.accre.vanderbilt.edu/. (18) http://www.netlib.org/pvm3/. (19) http://www.thegpm.org/. (20) http://www.mpi-forum.org/. (21) http://gcc.gnu.org/. (22) Balakrishnan, R.; Christie, K. R.; Costanzo, M. C.; Dolinski, K.; Dwight, S. S.; Engel, S. R.; Fisk, D. G.; Hirschman, J. E.; Hong, E. L.; Nash, R.; Oughtred, R.; Skrzypek, M.; Theesfeld, C. L.; Binkley, G.; Lane, C.; Schroeder, M.; Sethuraman, A.; Dong, S.; Weng, S.; Miyasato, S.; Andrada, R.; Botstein, D.; Cherry, J. M. ftp:// ftp.yeastgenome.org/yeast/. (23) Kersey, P. J.; Duarte, J.; Williams, A.; Karavidopoulou, Y.; Birney, E.; Apweiler, R. Proteomics 2004, 4, 1985-1988. (24) http://thegpm.org/TANDEM/api/.
PR050058I
Journal of Proteome Research • Vol. 4, No. 5, 2005 1847