Functional Annotation of the Human Chromosome 7 “Missing

Jan 11, 2013 - *E-mail: [email protected]. ... Using the BLAST search strategy, homologues for reviewed non-human mammalian proteins with ...
1 downloads 0 Views 338KB Size
Article pubs.acs.org/jpr

Functional Annotation of the Human Chromosome 7 “Missing” Proteins: A Bioinformatics Approach Shoba Ranganathan,*,†,‡,§ Javed M. Khan,†,‡ Gagan Garg,†,‡ and Mark S. Baker*,† †

Department of Chemistry and Biomolecular Sciences and ‡ARC Centre of Excellence in Bioinformatics, Macquarie University, Sydney, NSW, Australia § Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore S Supporting Information *

ABSTRACT: The chromosome-centric human proteome project aims to systematically map all human proteins, chromosome by chromosome, in a gene-centric manner through dedicated efforts from national and international teams. This mapping will lead to a knowledge-based resource defining the full set of proteins encoded in each chromosome and laying the foundation for the development of a standardized approach to analyze the massive proteomic data sets currently being generated. The neXtProt database lists 946 proteins as the human proteome of chromosome 7. However, 170 (18%) proteins of human chromosome 7 have no evidence at the proteomic, antibody, or structural levels and are considered “missing” in this study as they lack experimental support. We have developed a protocol for the functional annotation of these “missing” proteins by integrating several bioinformatics analysis and annotation tools, sequential BLAST homology searches, protein domain/motif and gene ontology (GO) mapping, and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis. Using the BLAST search strategy, homologues for reviewed non-human mammalian proteins with protein evidence were identified for 90 “missing” proteins while another 38 had reviewed non-human mammalian homologues. Putative functional annotations were assigned to 27 of the remaining 43 novel proteins. Proteotypic peptides have been computationally generated to facilitate rapid identification of these proteins. Four of the “missing” chromosome 7 proteins have been substantiated by the ENCODE proteogenomic peptide data. KEYWORDS: human proteome project, human chromosome 7, missing proteins, sequential BLAST, functional annotation, proteotypic peptides



representing only ∼160 Mb or 5.3% of the human genome, the over-representation of disease genes has made hChr7 an attractive target for c-HPP studies, and this chromosome has now been adopted by Australia and New Zealand (see http://chpp.org/working_groups). The three pillars of HPP are mass spectrometric proteomics, antibody/affinity capturing agents, and a knowledge base,2 embodied by the neXtProt database, 10 where detailed information on the human proteome are collated, curated, and organized for rapid access of information on a query protein. The main target of c-HPP4 is to establish the existence of “thousands of ‘missing’ proteinsproteins that should exist, based on the instructions in genes and other genetic evidence, but remain undiscovered.” To support the existence of each protein, neXtProt10 has compiled the availability of proteomic (i.e., mass spectroscopic), antibody, and 3D structural evidence, extending the original c-HPP definition. For hChr7, neXtProt (release 2012-08-24) lists 946 proteins. However, ∼20% of the

INTRODUCTION The human genome project has provided the “blueprint” of life, whose interpretation depends on detailed annotation, usually at the nucleotide, protein and process levels.1 Since 2008, the Human Proteome Organization (HUPO) has pursued the comprehensive identification and functional characterization of the human proteome via the Human Proteome Projects (HPPs),2 of which the chromosome-centric HPP (c-HPP) approach seeks to catalog the human proteome on the basis of chromosomes.3−5 Such an approach would address a key aim of the human genome project, viz. personalized medicine, by providing sensitive and highly specific protein biomarkers for early onset diagnosis, prognosis, and treatment of several diseases, providing clinical and translational proteomic solutions.6 Following on from the 2003 DNA sequencing of human chromosome 7 (hChr7)7 as the first metacentric chromosome completed, Scherer and Green suggested8 that hChr7 is ideally suited to serve as a model for human genome studies, based on the identification of over 350 biomedically important genes, including those coding for the T cell receptor, homeobox families, erythropoietin, and cystic fibrosis as well as the cytogenetic changes associated with some cancers.7,9 Although © 2013 American Chemical Society

Special Issue: Chromosome-centric Human Proteome Project Received: November 19, 2012 Published: January 11, 2013 2504

dx.doi.org/10.1021/pr301082p | J. Proteome Res. 2013, 12, 2504−2510

Journal of Proteome Research

Article

The overall workflow developed for the in silico annotation of the hChr7 proteome is illustrated in Figure 1 and described in the following two sections.

hChr7 proteome have no proteomic, antibody, or 3D structural evidence and thus belong to the twilight zone of protein annotation, comprising the “missing” proteins. Also, we have identified a few repeat and viral sequences. Omitting these repeats and viral sequences, hChr7 has in effect ∼18% of its original proteome as truly “missing” proteins for analysis in this study. Given that the gold standard database of reviewed and annotated proteins, the UniProt/Swiss-Prot database11 currently has >538 000 entries, our quest in this study is to determine whether the existence of these missing proteins can be indirectly inferred by orthology to similar proteins in primates, mammals, or other species. Since homologous sequences share identical or similar functions, similarity searches, using the most used and well-established BLAST12,13 suite of programs, are invariably the first port of call for functional annotation of an uncharacterized sequence. However, running a standard database similarity search for any novel protein against the default nonredundant protein database usually results in matches to a long list of proteins, which are in the main putative, unannotated, or translated coding regions identified by gene-finding programs. We have developed strategies for the annotation of transcript and putative protein sequences from neglected organisms such as helminth parasites and fungal pathogens,14−17 combining similarity searches supplemented by extensive functional annotation at the gene ontology (GO), biochemical pathway and functional domain and motif levels. For the hChr7 “missing” proteins, we have adapted our basic annotation workflow to include a BLAST search strategy by selecting databases of maximum functional characterization. Additionally, we propose the “sequential BLAST” approach, involving serial similarity searching against different databases, rather than repeated searches against the same database as implemented in PSI-BLAST,12 a variant of the standard BLAST search engine. With this new approach, we were able to identify reviewed non-human mammalian homologues with protein evidence for 90 sequences (53%) while 37 proteins (22%) showed significant mapping to reviewed non-human mammalian homologues. Comprehensive in silico annotation of the remaining 43 (25%) of the 170 “missing” proteins, using stringent mapping parameters (detailed in Materials and Methods) have resulted in putative gene ontology, biochemical pathway, and domain/motif assignments for 23 proteins (50% of the “missing” proteins). In order to facilitate protein discovery of the “missing” proteins, we have generated in silico proteotypic peptides for these novel 43 proteins for the proteomic community to check their data archives. With the availability of high-quality proteogenomic data from the ENCODE project, we have identified several peptides as proteomic evidence for two of the “missing” proteins and single/double peptide matches to 10 more proteins, thus leaving 158 hChr7 proteins awaiting experimental confirmation.



Figure 1. Bioinformatics annotation and analysis workflow for human Chr 7 “missing” proteins.

Database Similarity Search

BLAST12,13 is one of the best known programs used for a database similarity search. If the query sequence matches a database sequence with high significance (i.e., a very low E value) over almost all its length (indicated by coverage of at least 60%), this match is deemed a strong indicator of homology. In order to extract maximum functional characterization of hChr7 “missing” proteins from similarity searches, we have downloaded and set up local databases for BLAST12,13 searches from the Swiss-Prot/UniProt database.11 The first BLAST search was against the reviewed non-human mammalian proteins with experimental evidence (14 450 sequences) to check whether our “missing” proteins had a close mammalian homologue that was already known experimentally. For proteins with no BLAST matches in the first round, a second BLAST search was carried out against reviewed non-human mammalian proteins (45 668 sequences) to see if there was a close mammalian homologue, albeit without experimental evidence. Proteins that still had no significant matches, were then mapped to all human proteins (721 836 sequences), in case some of the proteins are human-specific. The final similarity search was against PDB proteins (213 310 sequences), in case we could obtain a match directly against a solved protein structure, as 3D structures are conserved even at very low sequence similarity values and provide valuable functional clues. Additional verification data sets used for BLAST include all mammalian proteins (1 029 491 sequences), all non-mammalian proteins with experimental protein evidence (60 857 sequences), and the nonredundant (NR) protein database (27 938 885 sequences). For mapping a protein sequence against a database of protein sequences, BLASTP was run sequentially against the data sets described above using default parameters. Those sequences which reported “no match” from the first run were passed onto the next round of BLASTP to search the second data set, then the third and the fourth. Proteins with no significant database

MATERIALS AND METHODS

Data Sources

Protein sequences were extracted from the neXtProt database10 (release 2012-08-24) in FASTA format. Protein data sets for similarity searches (described in Database Similarity Search) were extracted from the UniProt/Swiss-Prot database (release 2012_08-September 5, 2012).11 2505

dx.doi.org/10.1021/pr301082p | J. Proteome Res. 2013, 12, 2504−2510

Journal of Proteome Research



matches were considered novel and subjected to in silico annotation and analysis.

Article

RESULTS AND DISCUSSION

Preprocessing hChr7 Proteome

Functional Annotation of hChr7 Proteins

Our first step was to extract the hChr7 proteome details from neXtProt. In order to download the protein sequences, we converted the neXtProt identifiers to UniProt/Swiss-Prot indentifiers and downloaded all the protein sequences in FASTA format using a batch download command. We then sorted the 946 hChr7 proteins by the availability of protein evidence; 178 proteins (19%) have no proteomic, antibody, and 3D structural information available (available as Supporting Information, Table S1). Three sequences are repeats (NX_A6NHP3, NX_B0FP48, NX_Q8NGT9), while five are clearly of viral origin (NX_P60608, NX_Q69383, NX_Q7LDI9, NX_Q9BXR3, NX_Q9Y6I0), with “HERV” (human endogenous retrovirus) in their descriptions. These eight sequences are therefore removed from our analysis and annotation list, leaving 170 “missing” proteins, as shown in Figure 2.

Assignment of protein function is strengthened by matching the query sequence to specific secondary databases containing information on protein domains, motifs, and signatures, as this step adds value to the annotation by pin-pointing a domain, a motif, or a region in a protein sequence characteristic of a particular protein family. One of the most comprehensive protein functional annotation tools is InterProScan,18 comprising 14 programs for matching a query sequence against 13 protein domain and functional site databases, essential for the prediction of protein functionality. The InterProScan applications (BlastProDom, FPrintScan, HMMPIR, HMMPfam, HMMSmart, HMMTigr, ProfileScan, HAMAP, PatternScan, SuperFamily, SignalPHMM, TMHMM, HMMPanther, and Gene3D) map a query protein sequence against PROSITE, PRINTS, Pfam, ProDom, SMART, TIGRFAMs, PIR superfamily, SUPERFAMILY, Gene3D, PANTHER, and HAMAP databases. While for some well-characterized proteins there are multiple souces of functional evidence, each of the component databases was developed to address a specific property while the search tools utilize a range of heuristic and intelligent search strategies. Therefore, for the best results, InterProScan is usually run in its entirety, instead of a select subset of programs using a few of the databases. The novel “missing” proteins were characterized with high-quality in silico annotation via InterProScan protein functional domains and motifs incorporated within BLAST2GO19 (V 2.6.0; Build 14092012) along with GO annotation (EValue-Hit-Filter: 1.0 × 10−6). Prior to GO mapping, BLAST2GO compares the proteins, using BLASTP, with data in the nonredundant protein sequence database from National Centre for Biotechnology Information (NCBI). Subsequently, the proteins were mapped to respective pathways in human using the embedded enzyme code and the Kyoto Encyclopedia of Genes and Genomes (KEGG)20 pathway mapping tool in BLAST2GO. These pathway results were also compared with the results of an independent pathway mapping software, KOBAS21 (KEGG orthology-based annotation system, KOBAS-2.0). These programs were identified as reliable for comprehensive annotation of novel and uncharacterized sequences14−17 and have been successfully applied to recent genome projects of novel, neglected organisms.22,23

Figure 2. Distribution of human Chr 7 proteome. Distribution of 946 proteins, comprising sets with proteomic, antibody, and 3D evidence and repeat and viral sequences.

Preliminary Search for Homologues

Prior to embarking on our sequential BLAST approach, we did some preliminary review of database matches using UniProt’s sequence clusters of at least 90% identity, stored as UniRef. As an example, the 90% matches to the very first sequence from the alphabetically ordered 170 “missing” proteins, NX_A0PJY2, are shown in Figure 3. We note that for human and primate matches (Figure 3A,B) there are no reviewed sequences other than the query itself, although some of the matches appear to be almost the same length as the query, providing very similar sequences with almost full coverage. When we expand the organism level to mammalian (Figure 3C), we can pick up the reviewed mouse sequence, Q0VDQ9 (with gold star status), which is known at the transcript level. Expanding the cluster to 50% (data not shown) picks up only one more reviewed protein, Q2TAR3, from the African clawed frog, for which only transcript level evidence is available. Mammalian homologues with protein level evidence are required to strongly support the existence of the hChr7 protein as well as provide valuable functional clues. In the reviewed category, UniProt has identified sequences for which protein evidence is available. Hence, reviewed sequences with protein level evidence were our most reliable source of homologues. As there are only 14 450 non-human mammalian protein sequences with experimental protein evidence (2.6% of UniProt), we continued the similarity search for sequences reporting no matches, with databases with

In Silico Tryptic Digestion

Peptides unique to a protein known as proteotypic peptides are the key to its experimental identification using antibodies. Therefore, Protein Digestion Simulator24 was employed with default parameters (fragment mass range: 400−6000 Da; pI range: 0−14; mass tolerance: 5 ppm; hydrophobicity mode: Hopp and Woods25) to carry out in silico tryptic digestion of the novel “missing proteins”, in order to identify any proteotypic peptides for these proteins that would enable the synthesis of antibodies and for future in vitro studies leading to proteomic identification of the proteins. The program validates the input protein sequences, removing any duplicates. It then parses and digests the protein sequences to list out the resultant peptide sequences. The program also computes monoisotopic masses, pI, and hydrophobicity values for the resultant peptides. 2506

dx.doi.org/10.1021/pr301082p | J. Proteome Res. 2013, 12, 2504−2510

Journal of Proteome Research

Article

Figure 4. Significant BLAST hits sorted by organism. Blue bars: reviewed non-human mammalian proteins with experimental evidence in cyan; red bars: reviewed non-human mammalian proteins. Details in Supporting Information, Tables S2 and S3.

72 and 82 were the first Swiss-Prot entries after a list of trEMBL entries. Thirty-three hChr7 sequences showed marginal hits, while eight sequences mapped to no database sequence, leaving 43 novel proteins (25% of the 170 “missing” proteins). We then mapped these 43 novel sequences to the validation databases (results not shown) comprising all mammalian proteins, all non-mammalian proteins with experimental protein evidence, and the nonredundant (NR) protein database. However, our searches did not yield any significant matches to reviewed proteins with functional information. Mapping the novel proteins to a database of human proteins yielded three high-quality hits: NX_Q8IXZ3 (transcription factor Sp8) mapped to SP9_HUMAN (transcription factor Sp9), which has been inferred from homology; NX_Q9BZW2 (solute carrier family 13 member 1) mapped to S13A5_HUMAN (solute carrier family 13 member 5), known at the transcript level; and NX_Q96SN7 (protein orai-2) mapped to CRCM1_HUMAN (calcium release-activated calcium channel protein 1), which is known at the protein level. Mapping to non-mammalian proteins with experimental evidence yielded only two significant hits: NX_A0PJY2 (Fez family zinc finger protein 1) mapped to FEZF2_DANRE (Fez family zinc finger protein 2) and NX_A4D1F6 (leucine-rich repeat and death domain-containing protein 1) mapped to SHOC2_CAEEL (leucine-rich repeat protein soc-2). However, the second protein, NX_A4D1F6, appears to be a new mammalian construct as it maps to all 20 leucine-rich repeat domains of the Caenorhabditis elegans protein, but the latter has no death domain-containing protein 1. Therefore, these 43 sequences were all retained in the novel “missing” protein set for in silico annotation and analysis. Seven novel proteins show no matches to any of the databases. Of these, three appeared to be humanspecific: NX_P0CJ73 (humanin-like protein 6), NX_Q86UQ8 (transcription factor NF-E4), and NX_Q13166 (CATR tumorigenic conversion 1 protein). While this might be true of the latter two, the humanin-like protein 6 (24 aa) belongs to a family of proteins named humanins, characterized at the transcript level and with mouse and nematode homologues, according to Zapała et al.,27 suggesting that database searching of very short sequences may not be productive. In fact, the UniProt 90% cluster for this protein lists mammalian sequences

Figure 3. UniProt’s 90% sequence clusters for NX_A0PJY2, filtered at the organism level by (A) human, (B) primate, and (C) mammalian. Dotted lines indicate that lists in B and C are truncated.

decreasing confidence levels, as described in Materials and Methods. Sequential BLAST Similarity Search

In the first round of our sequential BLAST approach, we mapped the 170 “missing” proteins against reviewed nonhuman mammalian protein sequences with experimental protein evidence. Eight-nine hChr7 proteins (52%) showed significant matches, with 54−100% sequence coverage and E values of 1.0e-157 - 0 (available from Supporting Information, Table S2). All top hits were ranked 1, except in one case, where the fourth hit had better coverage despite the slightly lower % identity. Despite the recent research in primate genomes, the majority of matches were to mouse (78%). The distribution of matches by organism is shown in Figure 4. Thirty-one sequences showed marginal hits (partial coverage, low E values) or were short sequences (1 peptide) from the “missing” list for review and integration into the neXtProt lists. For the remaining eight proteins, with a single peptide match, further validation from high-resolution instruments is awaited. In the case of NX_B0FP48, a single peptide maps to 7q22.1:102,222,970-102,283,316:- while there is no peptide match to the other coding sequence 7q22.1:102,178,365102,232,891:-, providing supporting evidence for one of two the possible coding sequences. Another 12 proteins had matches of 1−3 peptides; however, the peptides were in the reverse orientation, that is, coded by the complementary strand. NX_P0CL84 is common to both lists, with its coding region covering one ENCODE peptide as well as to three reverse peptides, requiring further experimental validation. The hChr7 sequences and their ENCODE matches are listed in Table 2, with peptide details in Supporting Information, Table S6.

Figure 5. Summary of functional annotations for the 43 novel “missing” proteins. Gene ontology (GO), KEGG pathways (KEGG), and functional domains/motif (InterPro) annotations were obtained for 27 proteins.



In Silico Tryptic Digestion

CONCLUSIONS We have identified a nonredundant and clean set of “missing” proteins of hChr7 for which no proteomic, antibody, or 3D structural information is available. Using selected high-quality protein databases, similarity searches using BLAST sequentially identified homologues with experimental evidence for 52% of the missing proteins, with another 22% mapping to reviewed non-human mammalian proteins. With homologues identified

Proteotypic peptides were generated in silico using the Protein Digestion Simulator for trypsin digestion, resulting in a list of 3629 peptides with their proteomic properties: monoisotopic mass, pI, and hydrophobicity as well as the predicted normalized elution time (NET) under both LC retention time prediction and strong cation exchange (SCX) fraction prediction (using a 0 to 1 scale). The results are available in

Table 1. Summary of Similarity Search, Functional Annotation, and In Silico Tryptic Digestion Analysis for hChr7 “Missing” Proteins method BLASTP using specific databases:

BLAST2GO

KOBAS Protein Digestion Simulator

description

no. of input sequences

no. of “missing” protein sequences annotated

1. reviewed non-human mammalian proteins with experimental evidence 2. reviewed non-human mammalian proteins 3. human proteins 4. reviewed non-mammalian proteins with experimental evidence 5. nonredundant proteins 6. PDB proteins 1. GO 2. InterProScan domains/signatures 3. enzyme code for pathway mapping KEGG pathways trypsin digestion

170 81 43 43 43 43 43 43 43 43 43

89 (52%) 38 (22%) 3 (2%) 2 (1%) 0 0 20 (12%) 17 (10%) 1 (0.6%) 4 (2%) 43 (100%)

2508

dx.doi.org/10.1021/pr301082p | J. Proteome Res. 2013, 12, 2504−2510

Journal of Proteome Research

Article

Table 2. ENCODE Proteogenomic Peptides Mapped to hChr7 “Missing” Proteins (Proteins with Matches to Peptides and Reverse Peptides Are in Bold) S. no.

hChr7 neXtProt ID

no. of matching ENCODE peptides

S. no.

hChr7 neXtProt ID

no. of matching reverse ENCODE peptides

1 2 3 4 5 6 7 8 9 10 11 12

NX_A4D1F6 NX_Q96JA3 NX_P28039 NX_Q2T9F4 NX_O60896 NX_Q15672 NX_Q96N11 NX_Q6NUM6 NX_P0CL84 NX_Q8TEE9 NX_B0FP48 NX_Q7Z7C7

11 5 2 2 1 1 1 1 1 1 1 1

13 14 15 16 17 18 19 20 21 22 23 24

NX_P0CL84 NX_Q6ZWJ8 NX_Q75LS8 NX_Q6ZN68 NX_Q75MW2 NX_Q6NXN4 NX_Q6X4U4 NX_Q96CH1 NX_P58743 NX_Q9Y493 NX_Q96SN7 NX_Q8NGU1

3 3 2 2 2 1 1 1 1 1 1 1

orienatation at the mRNA level. B. Peptides with the reverse strand orientation at the mRNA level. This material is available free of charge via the Internet at http://pubs.acs.org.

from higher mammals, these proteins have a high probability of acquiring experimental evidence in the near future. Using a suite of bioinformatics tools, we have assigned putative biological functions in terms of gene ontology, biochemical pathway, and domain/motif signatures for 27 (16%) of the remaining “novel” sequences. Despite the current level of biological knowledge in the databases, 16 sequences (9%) remain unannotated. To assist proteomic and antibody identification of these novel proteins, we have generated proteotypic peptides from in silico tryptic digestion. Recent data from the ENCODE project provided proteogenomic peptides, with which we can establish proteomic evidence with high confidence for four “missing” proteins while another eight proteins await further proteomic confirmation. By using a combination of computational strategies, 91% of the human chromosome 7 “missing” proteins have been assigned putative biological functionality, providing valuable clues for experimental validation assays. The approach we have described is generic and can be used to annotate the proteome of any human chromosome or any novel organism, to bridge the gap between proteins with a known biological role and those described as “uncharacterized”.





AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Phone: +612-98506262. Fax: +612-9850-8313 (for bioinformatics analysis). Email: [email protected]. Phone: +612-9850-8211. Fax: +612-9850-8313 (for c-HPP7). Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We are grateful to Dr. David E. James, Garvan Institute of Medical Research, Australia, for his suggestion to search for primate homologues. We are also grateful to Drs. Jainab Khatun and Brian Rice for providing us the proteogenomic ENCODE data produced by Boise State University, USA.



ABBREVIATIONS c-HPP, Chromosome-centric Human Proteome Project; EC, enzyme code; GO, gene ontology; hChr7, human chromosome 7; KEGG, Kyoto Encyclopedia of Genes and Genomes; NET, normalized elution time; NR, nonredundant; SCX, strong cation exchange

ASSOCIATED CONTENT

S Supporting Information *

Tables S1−S6 as described in the main text are provided as MS Excel files (*.xls). The table descriptions are as follows. Table S1: hChr7 proteins with no proteomic, antibody, or 3D structural evidence. Sequences sorted alphabetically. Repeat sequences highlighted in yellow; viral sequences highlighted in cyan. Table S2: Significant BLAST hits against non-human reviewed mammalian proteins with experimental evidence. Hits sorted by % coverage. Table S3: Significant BLAST hits against non-human reviewed mammalian proteins. Hits sorted by % coverage; first Swiss-Prot entry reported as top hit. Table S4: Functional annotations of the 43 novel “missing” proteins. Hits sorted alphabetically. GO, InterProScan, and enzyme codes from BLAST2GO; KEGG mapping from KOBAS. Note: NX_Q6NUR7 has been removed from neXtProt Chr7 and is now in Swiss-Prot as Q6NUR7. Table S5: Proteotypic peptides for the novel “missing” proteins. Hits sorted by Unique_ID. MP No. Refers to the serial number of missing proteins in Table S4. NET: normalized elution time; SCX: strong cation exchange. Table S6: ENCODE proteogenomic peptides mapping to the hChr7 “missing” proteins. Hits sorted by number of peptides. A. Peptides with the same strand



REFERENCES

(1) Stein, L. Genome annotation: From sequence to biology. Nat. Rev. Genet. 2001, 2, 493−503. (2) Legrain, P.; Aebersold, R.; Archakov, A.; Bairoch, A.; Bala, K.; et al. The human proteome project: Current state and future direction. Mol. Cell. Proteomics 2011, 10, M111.009993. (3) Hancock, W.; Omenn, G.; Legrain, P.; Paik, Y. K. Proteomics, human proteome project, and chromosomes. J. Proteome Res. 2011, 10, 210. (4) Paik, Y. K.; Jeong, S. K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; et al. The Chromosome-centric human proteome project for cataloging proteins encoded in the genome. Nat. Biotechnol. 2012, 30, 221−223. (5) Paik, Y. K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Marko-Varga, G.; et al. Standard guidelines for the chromosome-centric human proteome project. J. Proteome Res. 2012, 11, 2005−2013. (6) Baker, M. S. Building the ‘practical’ human proteome project - the next big thing in basic and clinical proteomics. Curr. Opin. Mol. Ther. 2009, 11, 600−602. 2509

dx.doi.org/10.1021/pr301082p | J. Proteome Res. 2013, 12, 2504−2510

Journal of Proteome Research

Article

(7) Hillier, L. W.; Fulton, R. S.; Fulton, L. A.; Graves, T. A.; Pepin, K. H.; et al. The DNA sequence of human chromosome 7. Nature 2003, 424, 157−164. (8) Scherer, S. W.; Green, E. D. Human chromosome 7 circa 2004: a model for structural and functional studies of the human genome. Hum. Mol. Genet. 2004, 13 (special no. 2), R303−R313. (9) Yang, S. Gene amplifications at chromosome 7 of the human gastric cancer genome. Int. J. Mol. Med. 2007, 20, 225−231. (10) Lane, L.; Argoud-Puy, G.; Britan, A.; Cusin, I.; Duek P. D.; et al.. neXtProt: a knowledge platform for human proteins. Nucleic Acids Res. 2012, 40 (Database issue):D76−83. Epub 2011 Dec 11. Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 1990, 215, 403−410. (11) UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012, 40 (Database issue), D71−75. (12) Altschul, S. F.; Madden, T. L.; Schäffer, A. A.; Zhang, J.; Zhang, Z.; et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25, 3389− 3402. (13) Nagaraj, S. H.; Gasser, R. B.; Ranganathan, S. A hitchhiker’s guide to expressed sequence tag (EST) analysis. Brief Bioinform. 2007, 8, 6−21. (14) Ranganathan, S.; Menon, R.; Gasser, R. B. Advanced in silico analysis of expressed sequence tag (EST) data for parasitic nematodes of major socio-economic importancefundamental insights toward biotechnological outcomes. Biotechnol. Adv. 2009, 27, 439−448. (15) Garg, G.; Ranganathan, S. In silico secretome analysis approach using next generation sequencing transcriptomic data. BMC Genomics 2011, 12 (Suppl 3), S14. (16) Menon, R.; Garg, G.; Gasser, R. B.; Ranganathan, S. TranSeqAnnotator: large-scale analysis of transcriptomic data. BMC Bioinform. 2012, 13 (Suppl 17), S24. (17) Garg, G.; Ranganathan, S. High-throughput functional annotation and data mining of fungal genomes to identify therapeutic targets. In Laboratory Protocols in Fungal Biology: Current Methods in Fungal Biology; Gupta, V. K., Tuohy, M. G., Ayyachamy, M., Turner, K. M.; O’Donovan, A., Eds.; Springer: New York, 2013, pp. 569−574. (18) Quevillon, E.; Silventoinen, V.; Pillai, S.; Harte, N.; Mulder, N.; et al. InterProScan: protein domains identifier. Nucleic Acids Res. 2005, 33, W116−W120. (19) Conesa, A.; Gotz, S.; Garcia-Gomez, J. M.; Terol, J.; Talon, M.; et al. BLAST2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 2005, 21, 3674−3676. (20) Kanehisa, M.; Goto, S.; Hattori, M.; Aoki-Kinoshita, K. F.; Itoh, M.; et al. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 2006, 34, D354−D357. (21) Xie, C.; Mao, X.; Huang, J.; Ding, Y.; Wu, J.; et al. KOBAS 2.0: a web server for annotation and identification of enriched pathways and diseases. Nucleic Acids Res. 2011, 39, W316−W322. (22) Young, N. D.; Jex, A. R.; Li, B.; Liu, S.; Yang, L.; et al. Wholegenome sequence of Schistosoma haematobium. Nat. Genet. 2012, 44, 221−225. (23) Jex, A. R.; Liu, S.; Li, B.; Young, N. D.; Hall, R. S.; et al. Ascaris suum draft genome. Nature 2011, 479 (7374), 529−33. (24) Protein Digestion Simulator: http://omics.pnl.gov/. (25) Hopp, T. P.; Woods, K. R. Prediction of protein antigenic determinants from amino acid sequences. Proc. Natl. Acad. Sci. U.S.A. 1981, 78, 3824−8. (26) Sander, C.; Schneider, R. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 1991, 9, 56−68. (27) Zapała, B.; Kaczyński, Ł.; Kieć-Wilk, B.; Staszel, T.; Knapp, A.; et al. Humanins, the neuroprotective and cytoprotective peptides with antiapoptotic and anti-inflammatory properties. Pharmacol. Rep. 2010, 62, 767−777.

2510

dx.doi.org/10.1021/pr301082p | J. Proteome Res. 2013, 12, 2504−2510