neXtProt: Organizing Protein Knowledge in the Context of Human

Dec 3, 2012 - PPLine: An Automated Pipeline for SNP, SAP, and Splice Variant ..... Merging and scoring molecular interactions utilising existing commu...
0 downloads 0 Views 2MB Size
Technical Note pubs.acs.org/jpr

neXtProt: Organizing Protein Knowledge in the Context of Human Proteome Projects Pascale Gaudet,† Ghislaine Argoud-Puy,† Isabelle Cusin,† Paula Duek,† Olivier Evalet,† Alain Gateau,† Anne Gleizes,† Mario Pereira,† Monique Zahn-Zabal,† Catherine Zwahlen,† Amos Bairoch,†,‡ and Lydie Lane*,†,‡ †

CALIPHO group, SIB-Swiss Institute of Bioinformatics, ‡Department of Human Protein Sciences, Faculty of Medicine, University of Geneva, CMU-1, rue Michel Servet 1211 Geneva 4, Switzerland S Supporting Information *

ABSTRACT: About 5000 (25%) of the ∼20400 human protein-coding genes currently lack any experimental evidence at the protein level. For many others, there is only little information relative to their abundance, distribution, subcellular localization, interactions, or cellular functions. The aim of the HUPO Human Proteome Project (HPP, www.thehpp. org) is to collect this information for every human protein. HPP is based on three major pillars: mass spectrometry (MS), antibody/affinity capture reagents (Ab), and bioinformaticsdriven knowledge base (KB). To meet this objective, the Chromosome-Centric Human Proteome Project (C-HPP) proposes to build this catalog chromosome-by-chromosome (www.c-hpp.org) by focusing primarily on proteins that currently lack MS evidence or Ab detection. These are termed “missing proteins” by the HPP consortium. The lack of observation of a protein can be due to various factors including incorrect and incomplete gene annotation, low or restricted expression, or instability. neXtProt (www.nextprot.org) is a new web-based knowledge platform specific for human proteins that aims to complement UniProtKB/Swiss-Prot (www.uniprot.org) with detailed information obtained from carefully selected highthroughput experiments on genomic variation, post-translational modifications, as well as protein expression in tissues and cells. This article describes how neXtProt contributes to prioritize C-HPP efforts and integrates C-HPP results with other research efforts to create a complete human proteome catalog. KEYWORDS: Knowledgebase, C-HPP, mass spectrometry, human proteome, database, proteomics, ontologies, controlled vocabularies



INTRODUCTION Proteins are the major actors of life involved in virtually all cell functions. Collecting extensive information about their properties is key for both clinical and fundamental research applications. The UniProt/Swiss-Prot group achieved a first round of manual annotation for the full set of about 20000 human gene products in September 2008.1 This annotation effort allowed them to estimate that ∼30% of human gene products had not been studied experimentally at all and that the information available for the remainder was often scarce. This estimate did not take into account the enormous diversity that is generated from these gene products through alternative mRNA splicing or post-translational modifications (PTMs). In total, it is estimated that up to 1 million different protein species can be found in the ∼230 cell types making up our body! The experimental characterization of the complexity of the human proteome at the molecular and functional level is challenging and requires international cooperation efforts such as the Human Proteome Project.2−4 In parallel, it is necessary to develop specific bioinformatics resources aimed at capturing, © XXXX American Chemical Society

integrating, and maintaining up-to-date the available knowledge.



NEXTPROT KNOWLEDGE PLATFORM neXtProt (www.nextprot.org) is a web-based protein knowledge platform developed within the Swiss Institute of Bioinformatics (SIB, www.isb-sib.ch) to support research on human proteins. As such, its role is analogous to that of Model Organism Databases (MODs) for model species. The core data set in neXtProt is the whole corpus of manually curated annotations extracted from UniProtKB/Swiss-Prot5 for human proteins. This set is continuously being complemented with a wide range of quality-filtered data from high throughput studies. Special attention is given to the quality of the data integrated in order to avoid flooding the system with noisy data. Whenever possible, we collaborate with experts in the field or Special Issue: Chromosome-centric Human Proteome Project Received: August 31, 2012

A

dx.doi.org/10.1021/pr300830v | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research



directly with the data providers to establish quality criteria allowing the data to be sorted into the following categories: − Gold: highest quality data, according to biocurator’s judgment. When it is possible to assess the data quality through quantitative criteria, we put the threshold for inclusion into the Gold category at ≤1% estimated error rate. − Silver: good quality data, also according to biocurator’s judgment, and if quantitative criteria can be applied, the threshold is set at ≤5% error rate. Silver data are marked as such in the annotations. − Bronze: data deemed of lower quality that we do not integrate in neXtProt. Users can view the criteria adopted for each data set in metadata information records linked to the relevant experiments. As shown in Figure 1, the neXtProt platform can be accessed through an intuitive and simple interface centered on a Google-

Technical Note

NEXTPROT AS A BASIS FOR THE C-HPP INITIATIVE

neXtProt Provides a Complete and Curated Mapping of Proteins to the Genome

A prerequisite for the C-HPP initiative, which aims to systematically catalog the human proteome chromosome by chromosome,2 is that information collected at protein level be correctly mapped to corresponding genomic locations. The genomic coordinate information is displayed in the “exon” view of each entry as shown in Figure 2.

Figure 2. The Exon view from the Gene perspective gives the exact coordinates of all protein isoforms that can be mapped to Ensembl transcripts. The length of each exon in nucleotides and their position on the gene are shown. The coding fragments are shown with a large green line, and noncoding ones with a thin gray line. The reading frame of each exon is indicated by red labeling of amino acids. For example, Asn49-Asp68 means that only the last nucleotide of the first amino acid (Asn) is encoded in that exon, whereas the last amino acid (Asp) is completely encoded within that exon.

Figure 1. The neXtProt home page contains a search bar to access the database as well as links to documentation about the contents of the platform. Users can sign-in (top right) to access the personalized mode.

Mapping proteins to genomic sequences is not a trivial task and often needs manual correction. Our mapping strategy starts with the mapping that Ensembl provides between proteincoding genes and UniProtKB/Swiss-Prot entries, whose sequences are 100% identical to those of neXtProt. These mappings are verified by aligning each of the Ensembl-mapped protein sequences to the translation of the different transcripts also provided by Ensembl. The vast majority of neXtProt entries (18858, i.e., ∼94%) have mappings to Ensembl7 proteins and their corresponding genes and transcripts. Close to 5% of the entries (921) are missing mappings to Ensembl protein because Ensembl did not predict them as proteins, emphasizing the value of the manual curation provided by UniProtKB/SwissProt and neXtProt. Whenever there is a discrepancy between the mappings provided by Ensembl and the results of the alignments we perform, or where Ensembl does not provide any mapping to UniProtKB/Swiss-Prot, our biocurators perform a BLAT8 search against the human genome. In most cases this procedure allows to map to an existing Ensembl gene. When that fails, the curators directly enter into neXtProt the genomics coordinates of the region that spans the protein-coding exon(s) based on the BLAT alignment. This allowed 192 proteins to be manually mapped to the genome despite their lack of an Ensembl identifier. Even with the above procedure, a few proteins cannot be mapped on the genome either because they are not present in

like search functionality. Users can perform more complex queries by using different fields and filters. It is also possible to search and view only “Gold” data (the default option), or both “Gold and Silver” data. Users can create personal accounts to track their queries and results. Search results are displayed either as simple lists or as mini-summaries and can be exported in text or Excel formats. Sequences can be downloaded in FASTA or PEFF format. The complete set of annotations is available as a XML file. Protein entries are displayed from three different perspectives: the “Protein”, the underlying “Gene”, and the “References” used to annotate it. The Protein perspective is subdivided in thematic views relative to function, medical information, expression, interactions, localization, sequence, proteomics, structures, and protein identifiers. The Gene perspective contains a view of the exons, as well as gene identifiers. Whenever possible, specific information on splice isoforms is documented. For instance, in the sequence view, the different splice isoforms features can be compared graphically. neXtProt is designed as a web interface but also allows third party developers to make use of the data through an Application Programming Interface (API). This article focuses on neXtProt functionalities that are particularly relevant to the C-HPP project (www.c-hpp.org). Other functionalities have been described in detail in a recent publication.6 B

dx.doi.org/10.1021/pr300830v | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Technical Note

products of erroneous translations of pseudogenes). neXtProt applies the same criteria but since it contains more mass spectrometry data, a larger number of proteins have been validated. Primary data concerning protein identification are dispersed in numerous repositories and publications, hidden in poorly indexed supplementary files in nonstandardized formats, and of heterogeneous quality. As mentioned before, neXtProt carefully goes through those data to select the most reliable sets. neXtProt has integrated peptide identification results extracted from PeptideAtlas24 (using 1% FDR at protein level as threshold), from papers, and from direct submissions (before the ProteomeXchange procedure was set up). Peptide sequences are mapped on neXtProt entries at each release in order to take into account possible changes in protein sequences. Currently, 261 006 peptide-to-entry mappings have been loaded and are displayed in the Proteomics view of the Protein perspective. As shown in Figure 3, we display the

the current genome build (GRCh37) or because the mapping is ambiguous. An example of the former case is ATXN8 (NX_Q156A1), a toxic poly glutamine protein that is only present in spinocerebellar ataxia type 8 patients. Examples of the latter case are some of the DUX proteins (NX_Q96PT3, NX_O75505, NX_Q96PT4), encoded by a family of 3.3kilobase repeated elements dispersed in the human genome.9 Currently, all but 125 neXtProt entries display precise genomic coordinates for at least one isoform, and only 9 entries are not assigned to any chromosome. neXtProt Provides an Extended Catalog of Human Protein Species that Takes into Account Validated Polymorphisms and PTMs

Proteins present in biological samples can differ from the canonical gene products from the current genome build due to sequence polymorphisms. In addition, most of the proteins are post-translationally modified, which often interferes with peptide identification. These two facts may contribute to the high rate of unattributed spectra in proteomic studies,10 thus, taking into account known variants and PTMs may extend the coverage of peptide and protein identification. To tackle this issue, neXtProt is placing a lot of emphasis on the import of variant and PTM data. We have uploaded about 312000 sequence variants from dbSNP11 (through Ensembl7 release 68) and COSMIC12 release v60_190712, in addition to the data integrated via UniProtKB/Swiss-Prot. We have also integrated a total of 8135 additional PTM sites on 3312 entries from high-quality published sets of mass spectrometrydetected N-glycosylation,13,14 phosphorylation,15,16 S-nitrosylation,17,18 ubiquitination19,20 and sumoylation21 sites. We are planning to add other types of PTM soon, such as arginine methylation.22 The neXtProt API (www.nextprot.org/rest/) allows users to retrieve the complete set of PTMs or variants for all isoforms of a protein, along with their experimental evidence and Gold/ Silver data confidence assessment. Available output formats are HTML (default), JavaScript Object Notation (JSON) and XML. A new sequence format named PEFF (for “PSI extended FASTA format”) has recently been developed in the frame of HUPO PSI initiative23 in order to facilitate the handling of PTM and variation information by identification software (www.psidev.info/index.php?q=node/317). To our knowledge, neXtProt is the first resource to offer export of annotated sequences in this format.

Figure 3. The Proteomics view from the Protein perspective shows the positions of the identified peptides on the sequence. Methodological details for identification (biological sample, detection method, analysis procedure, quality filtering...) are available in the metadata information records.

positions of the identified peptides on the sequence, and indicate if a peptide is present in more than one entry. Of note, only peptides not shared with other entries are used to validate existence “at protein level”. Currently, 14955 entries (75%) are validated at the protein level in the neXtProt release 2012_10_07, versus 13670 in UniProtKB/Swiss-Prot. neXtProt Can Be Used to Prioritize “Missing Proteins”

To support C-HPP projects, neXtProt provides chromosome reports that summarize available information on each protein on the following topics: chromosomal location, availability of antibodies in the Human Protein Atlas (HPA)25 and of mass spectrometry data, number of annotated variants, splice isoforms and PTMs, presence of associations with diseases, and existence of a 3D structure. We expect these reports will help to prioritize the “missing proteins” for mass spectrometry analysis and for designing/producing antibodies.

neXtProt Extends the Coverage of Identified Proteins in the Human Proteome

The primary targets of the C-HPP are the so-called “missing proteins” that have not yet been identified by mass spectrometry nor detected by antibodies. An important aspect of the C-HPP work is to define which proteins have already been characterized, to focus on those that have yet to be detected. To help C-HPP in this task, neXtProt captures evidence for the existence of each protein based on criteria established by UniProtKB/Swiss-Prot in 2007. Five levels of evidence have been defined: (1) evidence at protein level (e.g., identification by mass spectrometry, detection by antibodies, sequence by Edman degradation, or tridimensional structure resolved), (2) evidence at transcript level (e.g., ESTs or full length mRNA), (3) inferred by homology (strong sequence similarity to known proteins in related species), (4) predicted and (5) uncertain (e.g., dubious sequences that are likely the

Expression Data Can Help Choose the Appropriate Biological Sample

Lack of evidence for protein existence can be due to low abundance or restricted expression of a protein. In order to facilitate the detection of missing proteins, one should select the right tissue or cell type, and/or enrich for the right organelle. Any hint regarding protein expression, abundance or localization can be useful in this task. neXtProt has integrated RNA-based expression data stored in ArrayExpress26 and UniGene11 as reanalyzed by the SIB’s Evolutionary BioC

dx.doi.org/10.1021/pr300830v | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Technical Note

informatics group and available from the Bgee resource.27 neXtProt has also integrated protein-based expression data obtained by immunohistochemistry from the Human Protein Atlas (HPA).25 The Gold/Silver data quality criteria used to filter the data are established in close collaboration with data providers whenever possible. The criteria established with the providers of Bgee are described in detail in the Bgee online documentation (bgee. unil.ch/bgee/bgee?page=documentation#sectionDataAnalysis). When there is a single experimental study (EST or microarray) available for a particular tissue and developmental stage, data is integrated as Silver if the Bgee quality is defined as “Low” and as Gold if the Bgee quality is defined as “High”. When several microarray experiments are available, a single experiment of high quality (“High”) as defined by Bgee is sufficient for the expression data to be considered of Gold quality in neXtProt. For each immunohistochemically stained sample, HPA manually evaluates images and attributes a qualitative intensity (negative, weak, moderate or strong). For proteins with more than one antibody, these values are integrated into a final “protein expression” annotation where the intensities are labeled as none, low, medium or high. In neXtProt, we integrate these “protein expression” values when available. In this case, HPA makes a global assessment of the reliability and assigns a reliability score (www.proteinatlas.org/about/ quality+scoring#re). A “low” reliability score is considered to be Silver in neXtProt, while “medium” and “high” reliability scores are Gold. For proteins for which there is a single antibody, HPA publishes the results of several validation measures. In this case, neXtProt integrates the expression data as described in Supplementary Table 1 (Supporting Information). Due to the different levels of resolution of the different methodologies, Bgee and HPA deliver expression data at different anatomical levels (e.g., tissue versus cell type) that are not easy to reconcile and compare. Available ontologies and controlled vocabularies, including MeSH, eVoc,28 BRENDA tissue ontology (BTO)29 and the foundational model of anatomy (FMA)30 describe the human anatomy with different scopes, coverage and precision levels. Since none of these met the needs of integrating the data from Bgee and HPA at comparable depths, we developed our own tissue and cell-type ontology that is available by ftp (ftp://ftp.nextprot.org/pub/ current_release/controlled_vocabularies/caloha.obo). It currently contains 762 anatomical terms, and has numerous cross-references to BTO,29 FMA,30 MeSH and UBERON.31 The Expression view from the Protein perspective presents an overview of mRNA and protein expression based on our human anatomy ontology, as shown in Figure 4. Data is captured and displayed at the most precise anatomical level, and propagated to parent levels so that they can be compared across different sources. The rationale is that once a protein has been detected in a particular tissue (e.g., hippocampus), all structures that contain this tissue (e.g., brain) by definition also contain this protein. In contrast, information is not propagated to children levels, because we cannot assume that a protein found in one organ is present in all its subparts. neXtProt has integrated subcellular localization results from two different high-throughput projects: DKFZ GFP-cDNA localization;32,33 and Weizmann Institute of Science’s Kahn Dynamic Proteomics Database.34 This information is displayed in the Localization view of each entry.

Figure 4. The Expression view from the NX_Q96RJ3 (TNFRSF13C) Protein perspective displays expression data from Bgee (at mRNA level) and from HPA (at protein level). Information is captured and displayed at the most precise anatomical level (magnifying glass symbols) and propagated to higher levels using the neXtProt human anatomy ontology to allow comparison between data sets. According to HPA (IHC column), the TNFRSF13C protein is present in lymph node germinal center cells (at medium levels) and in lymph node nongerminal center cells (at high levels). Bgee reports mRNA expression at the level of tissue (lymph node) only (microarray column).

Expression data can be retrieved using the XML export or the API. For example, FAM166B, an uncharacterized protein that has not yet been identified by mass spectrometry, has been detected by immunochemistry (IHC) at strong levels in bronchus epithelium and oviduct glandular cells, at moderate levels in heart muscle, skeletal muscle and nasopharyngeal epithelium, and at low levels in 8 other tissues. FAM166B has also been detected at mRNA level in 25 tissues, among which respiratory system, lung, oviduct, placenta, skin, adrenal cortex and adrenal gland at Gold quality level (www.nextprot.org/ rest/entry/NX_A8MTA8/expression). This information can be used to design experiments aimed at increasing the odds of detecting a given protein: in this case, we would suggest designing proteomics experiments on lung or oviduct epithelial cells.



NEXTPROT AS AN INTEGRATION PLATFORM FOR C-HPP RESULTS In the context of the C-HPP project, neXtProt serves as a data integration platform. Therefore, raw data generated in the frame of C-HPP will have to be submitted to and stored in appropriate repositories. As described in the HPP guidelines,4 mass spectrometry-derived data will be stored and analyzed using the ProteomExchange resources.35 While it is not a public repository, the Human Protein Atlas25 is currently the primary resource of immunochemistry data for the HPP. neXtProt will continue to filter and integrate processed data from those resources so that they will stay synchronized with other relevant studies on human proteins.



CONCLUSION AND PERSPECTIVES neXtProt is in constant evolution. One of the major future milestones of neXtProt is to support quantitative proteomics data. Technical developments in this field such as the selected reaction monitoring (SRM) methodology, as well as the creation of specialized repositories such as SRMAtlas,36 will soon permit integration of high quality quantitative data. Moreover, we plan to continue to add information directly relevant for the C-HPP, including proteomics data sets, PTMs, D

dx.doi.org/10.1021/pr300830v | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Technical Note

Centric Human Proteome Project. J. Proteome Res. 2012, 11, 2005− 2013. (5) The UniProt Consortium. Ongoing and future developments at the Universal Protein Resource. Nucleic Acids Res. 2011, 39, D214− D219. (6) Lane, L.; Argoud-Puy, G.; Britan, A.; Cusin, I.; Duek, P. D.; Evalet, O.; Gateau, A.; Gaudet, P.; Gleizes, A.; Masselot, A.; Zwahlen, C.; Bairoch, A. neXtProt: a knowledge platform for human proteins. Nucleic Acids Res. 2011, 40, D76−D83. (7) Flicek, P.; Amode, M. R.; Barrell, D.; Beal, K.; Brent, S.; Chen, Y.; Clapham, P.; Coates, G.; Fairley, S.; Fitzgerald, S.; Gordon, L.; Hendrix, M.; Hourlier, T.; Johnson, N.; Kähäri, A.; Keefe, D.; Keenan, S.; Kinsella, R.; Kokocinski, F.; Kulesha, E.; Larsson, P.; Longden, I.; McLaren, W.; Overduin, B.; Pritchard, B.; Riat, H. S.; Rios, D.; Ritchie, G. R. S.; Ruffier, M.; Schuster, M.; Sobral, D.; Spudich, G.; Tang, Y. A.; Trevanion, S.; Vandrovcova, J.; Vilella, A. J.; White, S.; Wilder, S. P.; Zadissa, A.; Zamora, J.; Aken, B. L.; Birney, E.; Cunningham, F.; Dunham, I.; Durbin, R.; Fernández-Suarez, X. M.; Herrero, J.; Hubbard, T. J. P.; Parker, A.; Proctor, G.; Vogel, J.; Searle, S. M. J. Ensembl 2011. Nucleic Acids Res. 2011, 39, D800−D806. (8) Kent, W. J. BLAT–the BLAST-like alignment tool. Genome Res. 2002, 12, 656−664. (9) Beckers, M.; Gabriëls, J.; Van Der Maarel, S.; De Vriese, A.; Frants, R. R.; Collen, D.; Belayew, A. Active genes in junk DNA? Characterization of DUX genes embedded within 3.3 kb repeated elements. Gene 2001, 264, 51−57. (10) Nesvizhskii, A. I.; Roos, F. F.; Grossmann, J.; Vogelzang, M.; Eddes, J. S.; Gruissem, W.; Baginsky, S.; Aebersold, R. Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data: toward more efficient identification of posttranslational modifications, sequence polymorphisms, and novel peptides . Mol. Cell. Proteomics 2006, 5, 652−670. (11) Sayers, E. W.; Barrett, T.; Benson, D. A.; Bolton, E.; Bryant, S. H.; Canese, K.; Chetvernin, V.; Church, D. M.; DiCuccio, M.; Federhen, S.; Feolo, M.; Fingerman, I. M.; Geer, L. Y.; Helmberg, W.; Kapustin, Y.; Landsman, D.; Lipman, D. J.; Lu, Z.; Madden, T. L.; Madej, T.; Maglott, D. R.; Marchler-Bauer, A.; Miller, V.; Mizrachi, I.; Ostell, J.; Panchenko, A.; Phan, L.; Pruitt, K. D.; Schuler, G. D.; Sequeira, E.; Sherry, S. T.; Shumway, M.; Sirotkin, K.; Slotta, D.; Souvorov, A.; Starchenko, G.; Tatusova, T. A.; Wagner, L.; Wang, Y.; Wilbur, W. J.; Yaschenko, E.; Ye, J. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2012, 40, D13−D25. (12) Bamford, S.; Dawson, E.; Forbes, S.; Clements, J.; Pettett, R.; Dogan, A.; Flanagan, A.; Teague, J.; Futreal, P. A.; Stratton, M. R.; Wooster, R. The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website. Br. J. Cancer 2004, 91, 355−358. (13) Chen, Y.; Cao, J.; Yan, G.; Lu, H.; Yang, P. Two-step protease digestion and glycopeptide capture approach for accurate glycosite identification and glycoprotein sequence coverage improvement. Talanta 2011, 85, 70−75. (14) Hofmann, A.; Gerrits, B.; Schmidt, A.; Bock, T.; Bausch-Fluck, D.; Aebersold, R.; Wollscheid, B. Proteomic cell surface phenotyping of differentiating acute myeloid leukemia cells. Blood 2010, 116, e26− e34. (15) Olsen, J. V.; Vermeulen, M.; Santamaria, A.; Kumar, C.; Miller, M. L.; Jensen, L. J.; Gnad, F.; Cox, J.; Jensen, T. S.; Nigg, E. A.; Brunak, S.; Mann, M. Quantitative phosphoproteomics reveals widespread full phosphorylation site occupancy during mitosis. Sci. Signal. 2010, 3, ra3. (16) Rigbolt, K. T. G.; Prokhorova, T. A.; Akimov, V.; Henningsen, J.; Johansen, P. T.; Kratchmarova, I.; Kassem, M.; Mann, M.; Olsen, J. V.; Blagoev, B. System-wide temporal characterization of the proteome and phosphoproteome of human embryonic stem cell differentiation. Sci. Signal. 2011, 4, rs3. (17) Lam, Y. W.; Yuan, Y.; Isaac, J.; Babu, C. V. S.; Meller, J.; Ho, S.M. Comprehensive identification and modified-site mapping of Snitrosylated targets in prostate epithelial cells. PLoS ONE 2010, 5, e9075.

polymorphisms, splice variants, expression and subcellular localization data, as well as closely related information about protein interactions, structure and function. We are currently designing tools to support the analysis and comparison of lists of proteins and optimizing search functionalities. We look forward to the feedback from the whole human proteomics community.



ASSOCIATED CONTENT

S Supporting Information *

Supplemental Table 1. This material is available free of charge via the Internet at http://pubs.acs.org.



AUTHOR INFORMATION

Corresponding Author

*Tel: +41 22 379 5841. Fax: +41 22 379 5858. E-mail: lydie. [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We thank the UniProt groups at SIB, EBI and PIR for their dedication in providing up-to-date high-quality annotations for the human proteins in Swiss-Prot thus providing neXtProt with a solid foundation. We thank Laurent-Philippe Albou, Frédéric Bastian, Pierre-Alain Binz, Christine Carapito, Eric Deutsch, Marc Robinson-Rechiavi, Mathias Uhlen, and Christian von Mering for stimulating discussions, advice and/or providing data. From 2009 to 2011, neXtProt has been jointly developed by the Swiss Institute of Bioinformatics (SIB) and GeneBio SA. We especially thank Alexandre Masselot and Nasri Nahas for their contributions to the project. neXtProt development has been funded by the SIB; GeneBio SA; the Swiss Confederation’s Commission for Technology and Innovation (CTI, grant 10214.1 PFLS-LS). Eurostars grant 6715 (BioNextProt) allowed us to develop the Application Programming Interface. The neXtProt server is hosted by VitalIT, the bioinformatics competence center that supports and collaborates with life scientists in Switzerland.



REFERENCES

(1) The UniProt Consortium. The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res. 2009, 37, D169−D174. (2) Paik, Y.-K.; Jeong, S.-K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Cho, S. Y.; Lee, H.-J.; Na, K.; Choi, E.-Y.; Yan, F.; Zhang, F.; Zhang, Y.; Snyder, M.; Cheng, Y.; Chen, R.; Marko-Varga, G.; Deutsch, E. W.; Kim, H.; Kwon, J.-Y.; Aebersold, R.; Bairoch, A.; Taylor, A. D.; Kim, K. Y.; Lee, E.-Y.; Hochstrasser, D.; Legrain, P.; Hancock, W. S. The Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome. Nat. Biotechnol. 2012, 30, 221−223. (3) Legrain, P.; Aebersold, R.; Archakov, A.; Bairoch, A.; Bala, K.; Beretta, L.; Bergeron, J.; Borchers, C. H.; Corthals, G. L.; Costello, C. E.; Deutsch, E. W.; Domon, B.; Hancock, W.; He, F.; Hochstrasser, D.; Marko-Varga, G.; Salekdeh, G. H.; Sechi, S.; Snyder, M.; Srivastava, S.; Uhlén, M.; Wu, C. H.; Yamamoto, T.; Paik, Y.-K.; Omenn, G. S. The Human Proteome Project: current state and future direction. Mol. Cell. Proteomics 2011, 10, M111.009993. (4) Paik, Y.-K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Marko-Varga, G.; Aebersold, R.; Bairoch, A.; Yamamoto, T.; Legrain, P.; Lee, H.-J.; Na, K.; Jeong, S.-K.; He, F.; Binz, P.-A.; Nishimura, T.; Keown, P.; Baker, M. S.; Yoo, J. S.; Garin, J.; Archakov, A.; Bergeron, J.; Salekdeh, G. H.; Hancock, W. S. Standard guidelines for the ChromosomeE

dx.doi.org/10.1021/pr300830v | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Technical Note

(18) Liu, M.; Hou, J.; Huang, L.; Huang, X.; Heibeck, T. H.; Zhao, R.; Pasa-Tolic, L.; Smith, R. D.; Li, Y.; Fu, K.; Zhang, Z.; Hinrichs, S. H.; Ding, S.-J. Site-specific proteomics approach for study protein Snitrosylation. Anal. Chem. 2010, 82, 7160−7168. (19) Shi, Y.; Chan, D. W.; Jung, S. Y.; Malovannaya, A.; Wang, Y.; Qin, J. A data set of human endogenous protein ubiquitination sites. Mol. Cell. Proteomics 2011, 10, M110.002089. (20) Danielsen, J. M. R.; Sylvestersen, K. B.; Bekker-Jensen, S.; Szklarczyk, D.; Poulsen, J. W.; Horn, H.; Jensen, L. J.; Mailand, N.; Nielsen, M. L. Mass spectrometric analysis of lysine ubiquitylation reveals promiscuity at site level. Mol. Cell. Proteomics 2011, 10, No. M110.003590. (21) Matic, I.; Schimmel, J.; Hendriks, I. A.; Van Santen, M. A.; Van De Rijke, F.; Van Dam, H.; Gnad, F.; Mann, M.; Vertegaal, A. C. O. Site-specific identification of SUMO-2 targets in cells reveals an inverted SUMOylation motif and a hydrophobic cluster SUMOylation motif. Mol. Cell 2010, 39, 641−652. (22) Uhlmann, T.; Geoghegan, V. L.; Thomas, B.; Ridlova, G.; Trudgian, D. C.; Acuto, O. A method for large-scale identification of protein arginine methylation. Mol. Cell. Proteomics 2012, 11, 1489− 1499. (23) Orchard, S.; Hoogland, C.; Bairoch, A.; Eisenacher, M.; Kraus, H.-J.; Binz, P.-A. Managing the data explosion. A report on the HUPO-PSI Workshop August 2008, Amsterdam, The Netherlands. Proteomics 2009, 9, 499−501. (24) Deutsch, E. W.; Lam, H.; Aebersold, R. PeptideAtlas: a resource for target selection for emerging targeted proteomics workflows. EMBO Rep. 2008, 9, 429−434. (25) Uhlen, M.; Oksvold, P.; Fagerberg, L.; Lundberg, E.; Jonasson, K.; Forsberg, M.; Zwahlen, M.; Kampf, C.; Wester, K.; Hober, S.; Wernerus, H.; Björling, L.; Ponten, F. Towards a knowledge-based Human Protein Atlas. Nat. Biotechnol. 2010, 28, 1248−1250. (26) Parkinson, H.; Sarkans, U.; Kolesnikov, N.; Abeygunawardena, N.; Burdett, T.; Dylag, M.; Emam, I.; Farne, A.; Hastings, E.; Holloway, E.; Kurbatova, N.; Lukk, M.; Malone, J.; Mani, R.; Pilicheva, E.; Rustici, G.; Sharma, A.; Williams, E.; Adamusiak, T.; Brandizi, M.; Sklyar, N.; Brazma, A. ArrayExpress update−an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res. 2011, 39, D1002−D1004. (27) Bastian, F.; Parmentier, G.; Roux, J.; Moretti, S. Bgee: integrating and comparing heterogeneous transcriptome data among species. Data Integr. Life Sci. 2008, 5109, 124−131. (28) Kelso, J.; Visagie, J.; Theiler, G.; Christoffels, A.; Bardien, S.; Smedley, D.; Otgaar, D.; Greyling, G.; Jongeneel, C. V.; McCarthy, M. I.; Hide, T.; Hide, W. eVOC: a controlled vocabulary for unifying gene expression data. Genome Res. 2003, 13, 1222−1230. (29) Gremse, M.; Chang, A.; Schomburg, I.; Grote, A.; Scheer, M.; Ebeling, C.; Schomburg, D. The BRENDA Tissue Ontology (BTO): the first all-integrating ontology of all organisms for enzyme sources. Nucleic Acids Res. 2011, 39, D507−D513. (30) Mejino, J. L. V.; Agoncillo, A. V.; Rickard, K. L.; Rosse, C. Representing complexity in part-whole relationships within the foundational model of anatomy. AMIA Annu. Symp. Proc. 2003, 2003, 450−454. (31) Mungall, C. J.; Torniai, C.; Gkoutos, G. V.; Lewis, S. E.; Haendel, M. A. Uberon, an integrative multi-species anatomy ontology. Genome Biol. 2012, 13, R5. (32) Liebel, U.; Starkuviene, V.; Erfle, H.; Simpson, J. C.; Poustka, A.; Wiemann, S.; Pepperkok, R. A microscope-based screening platform for large-scale functional protein analysis in intact cells. FEBS Lett. 2003, 554, 394−398. (33) Simpson, J. C.; Wellenreuther, R.; Poustka, A.; Pepperkok, R.; Wiemann, S. Systematic subcellular localization of novel proteins identified by large-scale cDNA sequencing. EMBO Rep. 2000, 1, 287− 292. (34) Sigal, A.; Danon, T.; Cohen, A.; Milo, R.; Geva-Zatorsky, N.; Lustig, G.; Liron, Y.; Alon, U.; Perzov, N. Generation of a fluorescently labeled endogenous protein library in living human cells. Nat. Protoc. 2007, 2, 1515−1527.

(35) Orchard, S.; Albar, J.-P.; Deutsch, E. W.; Eisenacher, M.; Binz, P.-A.; Martinez-Bartolomé, S.; Vizcaíno, J. A.; Hermjakob, H. From proteomics data representation to public data flow: a report on the HUPO-PSI workshop September 2011, Geneva, Switzerland. Proteomics 2012, 12, 351−355. (36) Rost, H. L.; Malmstrom, L.; Aebersold, R. A computational tool to detect and avoid redundancy in selected reaction monitoring. Mol. Cell. Proteomics 2012, 11, 540−549.

F

dx.doi.org/10.1021/pr300830v | J. Proteome Res. XXXX, XXX, XXX−XXX