Editorial pubs.acs.org/jpr
A First Step Toward Completion of a Genome-Wide Characterization of the Human Proteome
■
THE GOALS AND ORGANIZATION OF THE CHROMOSOME-CENTRIC HUMAN PROTEOME PROJECT (C-HPP)
international research alliances), deep proteomic discovery experiments, top-down analyses of protein variants generated from alternative splicing variants (ASVs) and alternative splicing transcripts (ASTs), the public deposition of data sets, better statistical tools for assessing extremely large data sets and the development of informatics systems and interfaces to allow the integration of genomic, proteomic and individual protein variation information.
This Special Issue Documents the Progress of the C-HPP
Almost 10 years after the human genome project (HGP) was completed in 2003, the characterization of proteins encoded by each gene located on a human chromosome remains elusive with respect to annotation, abundance, function, structural modification and post-transcriptionally modified isoforms. The Human Proteome Organization (HUPO) has recently established the Human Proteome Project (HPP) initiative under which the Chromosome-centric Human Proteome Project (C-HPP) and Biology/Disease-Driven HPP (B/DHPP)1 were placed. The C-HPP, a 10-year project with a finite end point (2012.9−2022.9), has progressed steadily since it was officially launched in Geneva in 2011 and following publication of a landmark paper.2 A general collaborative consensus has been built up and resulted in the C-HPP guidelines3 with description of defined goals that have been further consolidated through numerous workshops and conference calls during the past years, under the aegis of the HUPO HPP.3 The idea of documenting progress in C-HPP in a special issue of this journal was born during the Geneva HUPO congress at a meeting of the PIs of the chromosome teams with the goal of recording demonstrable progress by the initiative as well as outlining the planned impact of this project on biological research. This special issue will provide examples of the coming explosion of information in genomics, transcriptomics and proteomics. The papers published here contain a diverse set of topics including an industrial viewpoint, related technology development, examples of chromosome parts lists and databases with large scale protein identifications from samples of interest to the initiative, for example, placenta and liver. The goals of the C-HPP are to map and annotate the entire human protein set encoded in each chromosome through the cooperation of the 25-membered international consortium covering 24 chromosomes and mitochondria2 (see Figure 1). C-HPP can be categorized as dual types of projects: hypothesisdriven initiatives (seeking the biology, disease proteins and pathways) and a data gathering multinational project (constructing a knowledge base). For example, each chromosome group can select a chromosome based on specific disease interests (e.g., cancer, genetic disease) or targets (e.g., biomarker). The initiative does not represent any change in a typical proteomic experiment but rather that data sets are shared with all of the chromosome teams with a resulting growth in aggregated knowledge and more information on “missing or poorly characterized proteins”. We believe that the C-HPP initiative will support ongoing evolution in the proteomics field, such as the earlier adoption of information flowing from molecular biology advances, such as ENCODE,5 integrated transcriptomics/proteomic measurements, evolution in research organizations (from individual laboratories to © 2012 American Chemical Society
A Master Table of Chromosome-Specific Baseline Metrics for the C-HPP
To establish a starting point for the HPP and specifically for the C-HPP, the HPP executive committee and C-HPP investigators agreed on the five standard baseline metrics for each chromosome presented in Table 1. Ensembl v69 provides the number of protein-coding genes; neXtProt (gold), Peptide Atlas (canonical), and GPMdb (green) provide the numbers of confidently identified proteins from mass spectrometry studies, with special features for each; and the Human Protein Atlas provides the number of proteins for which polyclonal antibodies directed at one or two different epitopes along the protein sequence have been generated and been used to characterize protein expression across 46 cell types, intracellular organelles, and selected cancer cells (with evidence at the medium or high levels). All of these data sets were updated during the October to December 2012 period. Leaders of each resource contributed and confirmed the figures in this Table (see Acknowledgments). Together we agreed on the thresholds for high confidence protein identifications that are used here. neXtProt is a specialized resource for human proteins, developed at the Swiss Institute for Bioinformatics in concert with the emergence of the HPP; it evolved from the widely used SwissProt and UniProtKB resources through extensive curation of published data sets (www.neXtProt.org). neXtProt also presents “PE1 proteins”, figures about 5−10% larger than the mass-spec-only gold entries, reflecting proteins detected by additional experimental methods, such as Edman sequencing, X-ray, and immunohistochemistry. Peptide Atlas (www. peptideatlas.org, Institute for Systems Biology, Seattle, WA) uses TransProteomicPipeline and embedded statistical algorithms to reprocess raw spectra from many data sets for specific biofluids or organs, and for the entire human proteome. In this way, PeptideAtlas eliminates many of the variables, often proprietary, embedded in the mass spectrometry instruments. A multitiered scheme of progressively more stringent criteria6 yields the canonical list with 1% FDR at the protein level and about 0.2% FDR at the peptide level. The Global Proteome Machine GPMDB, based in Alberta, Canada, is an even larger database built from raw spectra from anonymized data sets using X!Tandem to give a series of progressively more stringent Special Issue: Chromosome-centric Human Proteome Project Published: December 20, 2012 1
dx.doi.org/10.1021/pr301183a | J. Proteome Res. 2013, 12, 1−5
Journal of Proteome Research
Editorial
Figure 1. Current 25 members of the Chromosome-centric Human Proteome Project.
to multiple protein entries for these gene products. There are 21 entries for the single HLA-A gene, 35 entries for HLA-B, 14 for HLA-C and 13 for HLA-DRB1. Such entries may be distinguishable at the protein level. GPMDB includes Supporting Information, for examples, on Chr 6 MHC haplotypes, major haplotypes for other chromosomes, current patches to the genome, and unplaced contigs. We are compelled to ask whether finding the remaining 30− 35% of proteins is a matter primarily of lowering the limit of detection allowing better capture of published data sets, or is primarily biological. First, we must be systematically missing proteins expressed significantly only in unusual organs and cell types. However, Uhlen et al.7 did not find much evidence of tissue-specific expression by immunohistochemistry, though they do report striking differences in level of expression. The same may be concluded from the high overlap among the 11 cell lines from the Mann Lab, for which 7000 to 10000 proteins were identified in each, with a grand total of about 11000 proteins.8−11 Promising sites include brain (with extreme histologic and functional heterogeneity), nasal epithelium/ olfactory cortex, testis, and placenta. However, there are very few additional protein identifications among the large number of proteins detected in placenta by the Chr 13 team (this journal). A second explanation is developmental, suggesting we might expect to find many proteins expressed only in the embryo or fetus; but the cell lines from the Heck laboratory do not strongly support this prediction. Third, there may be families of proteins (olfactory receptors, cytokeratins, histocompatibility antigens) and classes of proteins (especially membrane-embedded proteins) that we miss systematically due to sample preparation challenges or inability to distinguish highly homologous families of proteins through peptide matches. Finally, there may be proteins that have short halflives or proteins that act at very low abundance, like intranuclear regulatory proteins. For most of these explanations, the determination of transcript levels would be a useful starting point. Several of the C-HPP teams have used this strategy, combined with
expectation values (www.gpmdb.org). These data resources are now linked through the ProteomeXchange (www. proteomexchange.org) (PX), which ensures distribution of MS/MS and SRM data sets uploaded to PX or EBI/PRIDE (European Bioinformatics Institute) or PeptideAtlas. The Human Protein Atlas is entirely different, built through genomically guided epitope predictions from peptide sequences, polyclonal antibodies stimulated against these immunogenic epitopes, and immunohistochemistry with antibodies on arrays of 46 cell types (plus cell lines and cancer cells). One of the goals for the HPP is to compare results from mass spectrometry with results from immunohistochemistry, crossvalidating the findings from each. Uhlen et al.7 have produced ten versions of the HPA, progressing from 600 proteins identified in version 1 to over 14000 proteins identified in the preview of version 10 at the 2012 HUPO meeting, Boston; HPA protein evidence scores were calculated based on the manual curation of Western blots, immunohistochemistry and immunofluorescence and whether two or more antibodies exist for the protein target (paired) or only one antibody (single). Table 1 shows that 10794 proteins have medium to high evidence scores. The Master Table for mass spectrometry and protein capture results shows that approximately 65−70% of the expected proteins have been confidently identified. We estimated the proportion of missing proteins by simply subtracting the average of the three mass spectrometry database figures from the number of Ensembl genes. However, there are important complications in comparing Ensembl gene numbers and protein numbers, as discussed at the Boston Congress. The case of chromosome 6 is particularly instructive. There are 1787 genes, but only 1108 different proteins (entries in neXtProt), due to the high proportion of multigene proteins. For example, six different genes encode exactly the same protein IER3; 8 genes encode the olfactory receptor 2W1. It will never be possible to use proteomics to disambiguate these gene level identifications. Conversely, the high degree of polymorphisms in Major Histocompatibility Complex (MHC) molecules leads 2
dx.doi.org/10.1021/pr301183a | J. Proteome Res. 2013, 12, 1−5
Journal of Proteome Research
Editorial
Table 1. Current Status of the Human Protein-coding Genes and the Missing Proteinsa
a Ensembl v69(October 2012), neXtProt (October 10, 2012, confirmed 12/2012), PeptideAtlas (2012-07 build, confirmed 12/7/2012), GPMdb (October 1, 2012; confirmed 11/26/2012), HPA (September 12, 2012; confirmed 12/4/2012). Approximation of Missing Proteins based on mass spec is #genes - [(B+C+D)/3], (x 100%) for %.
shotgun proteomics. A more comprehensive strategy would target SRM peptides and the SRM spectral library genes with readily detectable transcript levels in specific tissues. This approach is certain to become a major feature of the HPP, both the C-HPP and the B/D-HPP. It is also logical to perform cross-analyses of the proteins in each of the lists in the Master Table. That is a much more complex task than might be obvious, since each has different ways of choosing a “representative” protein among pairs or many with indistinguishable matches to the confidently identified peptides. There are several new mass spectrometrybased databases and browsers emerging from C-HPP teams in China, Australia, and Japan, while the MaxQB database9 is based on the output of a single laboratory.
antibodies, and data sets, for example, poorly characterized protein families such as olfactory receptors. The 25 C-HPP teams will be able to work closely with the technology development groups of HUPO such as the Knowledge-Base, Mass Spectrometry, and Antibody Resources. An important need is the development of Global Resource Centers for the supply of essential technologies and samples to support the goals of C-HPP such as the characterization of a full protein “parts list”2 (see Figure 2). Examples of the capabilities of planned resource centers are as follows: • Sample resource banks of specific cell lines and clinical material, that is, normal and disease tissue samples. • Proteomics of rare tissues (i.e., nasal epithelial cells or hair cortex to identify likely sources of “missing” proteins such as olfactory receptors and keratin binding proteins). • Top-down mass spectrometry for analysis of intact proteins to facilitate the characterization of alternative spliced (ASV) and single nucleotide (SNV) variants. • Ultrasensitive liquid chromatography/mass spectrometry analyses for unique micro clinical samples. • Specific technology and data generating centers for PTM characterization.
Future Planned Progress in C-HPP
During the review process for papers contained in this issue, we realized that it would be beneficial for all authors to address a few common approaches which should be applied by each chromosome group throughout their project. First, the C-HPP initiative has recognized the need for the development of specific technologies, for example, deep proteomics which is integrated with RNA-Seq, reagents, for example, monoclonal 3
dx.doi.org/10.1021/pr301183a | J. Proteome Res. 2013, 12, 1−5
Journal of Proteome Research
Editorial
Figure 2. Future direction of the Chromosome-centric Human Proteome Project (C-HPP) in concert with other omics fields to bring about a new paradigm for integrating comprehensive protein research into the biomedical community. We expect that, in the future, healthcare will have high demands on understanding the interconnections between the genome and proteome and in mechanisms that control the subset of isoforms, that is, PTMs for a given protein.
■
William S. Hancock*,∥
• Metabolomics for integration of -omics data into the phenome.
†
CONCLUSIONS
The first special issue of the C-HPP initiative descibes the initial activities of the chromosome teams and gives a view of the value of evaluating proteomic experiments in the context of genomic information as well as providing a biochemical readout of the effect of genomic regulatory processes. Alterations in protein isoform profiles need to be corrleated to changes in the interactome, pathways and the metabolome and will promote an effective integration of the two parts of the Human Proteome Project, namely the chromosome and biology and disease initiative. In the long run, we also aim to integrate other genomics data sets, for example, ENCODE,4 which will make additional synergistic readouts for biological research and produce many potential benefits in the understanding of disease processes.5 We anticipate that the next special issue targeted for 2014 will indeed describe more practical outputs from this linkage with ENCODE as well as other advances in molecular biology. Thus, the C-HPP will stimulate our community in many ways such as focused industrial and academic collaborations and the better use of genomic tools and proteogenomic technology development. From this project, the research community can expect a full catalogue of proteins including novel drug targets, new diagnostic biomarkers and a parts list of the isoforms of cellular regulators such as major signaling pathways. As described above, the structure that HUPO is putting in place will result in continued evolution in the field with improved methodologies, international coordination in data management, global data sharing and improved repositories.
Gyorgy Marko-Varga Gilbert S. Omenn‡ Young-Ki Paik*,§
Lund University, Lund, Sweden Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States § Yonsei Proteome Research Center, Yonsei University, Seoul, Korea ∥ Northeastern University, Boston, Massachusetts, United States ‡
■
AUTHOR INFORMATION
Corresponding Author
*William S. Hancock (
[email protected]) at 1-617-8698458 (Tel) and 617-373-2855 (Fax) or Young-Ki Paik (
[email protected]) at 82-2-2123-4242 (Tel) and 82-2-3936589 (Fax).
■
ACKNOWLEDGMENTS We gratefully acknowledge guidance and assistance from Lydie Lane, Pascale Gaudet, and Amos Bairoch of neXtProt/Swiss Institute of Bioinformatics, Ron Beavis of GPMDB/Global Proteome Machine, Eric Deutsch, Terry Farrah, and Zhi Sun of Peptide Atlas/Institute for Systems Biology, and Emma Lundberg of Human Protein Atlas/SciLife for providing upto-date chromosome-specific summary data and for confirming the respective elements of the Master Table of C-HPP Baseline Metrics.
■
REFERENCES
(1) Legrain, P.; Aebersold, R.; Archakov, A.; Bairoch, A.; Bala, K. The Human Proteome Project current state and future direction. Mol. Cell. Proteomics 2011, 10, No. M111.009993. (2) Paik, Y. K.; Jeong, S. K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; et al. The Chromosome-Centric Human Proteome Project for
†
4
dx.doi.org/10.1021/pr301183a | J. Proteome Res. 2013, 12, 1−5
Journal of Proteome Research
Editorial
cataloging proteins encoded in the genome. Nat. Biotechnol. 2012, 30 (3), 221−3. (3) Paik, Y. K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Marko-Varga, G.; et al. Standard guidelines for the Chromosome-centric Human Proteome Project. J. Proteome Res. 2012, 11, 2005−13. (4) ENCODE Consortium.. An integrated encyclopedia of DNA elements in the human genome. Nature 2012, 489, 57−74. (5) Paik, Y. K.; Hancock, W. S. Uniting ENCODE with genome-wide proteomics. Nat. Biotechnol. 2012, 30, 1065−7. (6) Farrah, T.; Deutsch, E. W.; Omenn, G. S.; Campbell, D. S.; Sun, Z. A high-confidence human plasma proteome reference set with estimated concentrations in peptideatlas. Mol. Cell. Proteomics 2011, No. M110.006353. (7) Uhlen, M.; Oksvold, P.; Fagerberg, L.; Lundberg, E.; Jonasson, K.; et al. Towards a knowledge-based Human Protein Atlas. Nat. Biotechnol. 2010, 28 (12), 1248−50. (8) Geiger, T.; Wehner, A.; Schaab, C.; Cox, J.; Mann, M. Comparative proteomic analysis of eleven common cell lines reveals ubiquitous but varying expression of most proteins. Mol. Cell. Proteomics 2012, 11, No. M111.014050. (9) Schaab, C.; Geiger, T.; Stoehr, G.; Cox, J.; Mann, M. Analysis of high-accuracy, quantitative proteomics data in the MaxQB database. Mol. Cell. Proteomics 2012, 11, No. M111.014068. (10) Nagaraj, N.; Wisniewski, J. R.; Geiger, T.; Cox, J.; Kircher, M.; et al. Deep proteome and transcriptome mapping of a human cancer line. Mol. Syst. Biol. 2011, 7, 548. (11) Beck, M.; Schmidt, A.; Malmstroem, J.; Claassen, M.; Ori, A.; et al. The quantitative proteome of a human cell line. Mol. Syst. Biol. 2011, DOI: 10.1038/msb2011.82.
5
dx.doi.org/10.1021/pr301183a | J. Proteome Res. 2013, 12, 1−5