E ditor - in - chief
William S. Hancock
editorial
Barnett Institute and Department of Chemistry Northeastern University Boston, MA 02115 617-373-4881; fax 617-373-2855
[email protected] Associate E ditors Joshua LaBaer Harvard Medical School György Marko-Varga AstraZeneca and Lund University Martin McIntosh Fred Hutchinson Cancer Research Center Cons u lting E ditor Jeremy K. Nicholson Imperial College London E ditorial adv isory board Ruedi H. Aebersold ETH Hönggerberg Leigh Anderson Plasma Proteome Institute Rolf Apweiler European Bioinformatics Institute Ronald Beavis Manitoba Centre for Proteomics John J. M. Bergeron McGill University Rainer Bischoff University of Groningen Richard Caprioli Vanderbilt University School of Medicine R. Graham Cooks Purdue University Thomas E. Fehniger AstraZeneca Catherine Fenselau University of Maryland Daniel Figeys University of Ottawa Sam Hanash Fred Hutchinson Cancer Research Center Stanley Hefta Bristol-Myers Squibb Denis Hochstrasser University of Geneva Michael J. Hubbard University of Melbourne Donald F. Hunt University of Virginia Barry L. Karger Northeastern University Joachim Klose Charité-University Medicine Berlin Matthias Mann Max Planck Institute of Biochemistry David Muddiman North Carolina State University Robert F. Murphy Carnegie Mellon University Gilbert S. Omenn University of Michigan Akhilesh Pandey Johns Hopkins University Aran Paulus Bio-Rad Laboratories Jasna Peter-Katalini´c University of Muenster Peipei Ping University of California, Los Angeles Henry Rodriguez National Cancer Institute Michael Snyder Yale University Clifford H. Spiegelman Texas A&M University Ruth VanBogelen Pfizer Global Research & Development Timothy D. Veenstra SAIC-Frederick, National Cancer Institute Scot R. Weinberger GenNext Technologies Susan T. Weintraub University of Texas Health Science Center John R. Yates, III The Scripps Research Institute
© 2007 American Chemical Society
Open Access to Proteomics Data: A Valuable Resource for Biology and Medicine
C
hanging technologies are important drivers of proteomics. Continued improvements in mass accuracy, resolution, fragmentation methods, and throughput all have significant impacts on experimental design and outcome. New algorithms and computational approaches also are changing the landscape with improved database search engines and tools for data reduction. This situation raises various issues for the development of analytical software for the valid identification and quantification of proteins and their modifications. For these and several other reasons, it is in the interest of the field that proteomics data sets be made publicly accessible upon publication of manuscripts. Although most proteomics studies are targeted, at least in the sense that they are focused on the identification of changes in protein levels or the protein complement of a cell or tissue, these studies harbor additional information beyond the scope of the original experiment. For example, the data could reveal a novel tissue localization or modification of a particular protein, or a new splice variant. From this perspective, public access to proteomics data sets enables their broader use in other studies and allows them to be a resource for genome annotation. Considerable additional value exists in the analysis of data across projects, species, and tissues, and even for the same sample across laboratories and analytical platforms. Under the best of circumstances, current major search engines will only find modifications specified in the search parameters and proteins that are included in the database. In other words, they cannot find what they are not told to look for. This leaves considerable room for better interpretation of data sets as search engines, databases, and related data-analysis tools improve. This situation is inherent to current proteomics studies in that few labs have access to all data-processing software, and differences in algorithms can result in significant variations in the lists of proteins identified and in the modifications observed. A key point is that search results are implicitly reproducible when identical databases, processing parameters, and software versions are used. Acquisition of data, even from the same sample, is variable at some level because of duty cycle, chemical background, and variations in instrument performance. This fact underscores access to data files as crucial for reproducing results. In addition, the generation of peak lists represents very efficient data compression, but key information can be lost. The final results of a proteomics experiment are dependent not only on the search tools but also on the data-reduction software and its parameters; this suggests that raw data sets may be of particular value in certain cases. Some public resources and commercial tools rely on access to the raw spectra or peak lists. For example, peptideatlas.org reprocesses all data through the Trans-Proteomic Pipeline. The Global Proteome Machine (www. thegpm.org), the Computational Portal and Analysis System (CPAS; https://proteomics. fhcrc.org/CPAS), and the commercial Scaffold software (www.proteomesoftware.com) reanalyze data sets using X!Tandem and/or other search engines. These resources also provide a range of tools for examining the data. Whereas two of the open-source data analysis and management systems (CPAS and Proteome Research Information Management Environment [known as PRIME]; https://www.prime-sdms.org/prime/index.htm) can provide public access to data, the use of data management systems for high-traffic downloads can be inefficient and can interfere with function. Centralized databases (www.ebi.ac.uk/pride, www.hprd.org, www.thegpm.org, www.peptideatlas.org, and others) are valuable resources, and they typically support a subset of annotation data and peak lists but not raw data. A need still exists in proteomics for a trusted dissemination system. The opensource Tranche project (tranche.proteomecommons.org) represents one approach to Journal of Proteome Research • Vol. 6, No. 6, 2007 2047
editorial this problem, with a distributed system for sharing data (raw data, peak lists, search results, etc.) both securely and for public access. Currently, Tranche is the largest public repository for raw and processed proteomics data. Its use of a transactional system on a peer-to-peer infrastructure provides simplicity, security, efficiency, and scalability. Tranche operates like a data bank with data being deposited and withdrawn; the difference is that three copies of each file are made and the files are loadbalanced across many servers. Uploading and downloading are simple, and the system provides a unique hash that is used as a permanent address for the data set and guarantees file integrity and provenance. Tranche increases the stability of data access, greatly reduces the effort of sharing data, and allows bioinformaticians and other researchers facile access to large
2048 Journal of Proteome Research • Vol. 6, No. 6, 2007
proteomics data sets. The two key issues described here—realizing the broader value of proteomics data sets and reproducibility of results—argue for data repositories and open access to data from publicly funded research. One of the challenges in proteomics is reproducing results across laboratories, instrument platforms, and data pipelines. The most rigorous solution is to open up access to the original data sets that consist of peak lists or raw data. Public access to raw data files and peak lists is of particular value at this stage in the field of proteomics. PHILIP ANDREWS and JAYSON FALKNER National Center for Research Resources and National Resource for Proteomics and Pathways, University of Michigan