GenomewidePDB 2.0: A Newly Upgraded Versatile Proteogenomic

Aug 14, 2015 - Department of Biochemistry, Department of Integrated Omics for Biomedical Science (World Class University Graduate Program), Yonsei Uni...
1 downloads 6 Views 5MB Size
Article pubs.acs.org/jpr

GenomewidePDB 2.0: A Newly Upgraded Versatile Proteogenomic Database for the Chromosome-Centric Human Proteome Project Seul-Ki Jeong,† William S. Hancock,‡ and Young-Ki Paik*,†,§ †

Yonsei Proteome Research Center and Biomedical Proteome Research Center, 50 Yonsei-Ro, Seodaemun-gu, Seoul 120-749, Korea Barnett Institute and Department of Chemistry and Chemical Biology, Northeastern University, 12 Oxford Street, Boston, Massachusetts 02115, United States § Department of Biochemistry, Department of Integrated Omics for Biomedical Science (World Class University Graduate Program), Yonsei University, 50 Yonsei-Ro, Sudaemoon-ku, Seoul 120-749, Korea Downloaded by UNIV OF NEBRASKA-LINCOLN on August 25, 2015 | http://pubs.acs.org Publication Date (Web): August 19, 2015 | doi: 10.1021/acs.jproteome.5b00541



ABSTRACT: Since the launch of the Chromosome-centric Human Proteome Project (C-HPP) in 2012, the number of “missing” proteins has fallen to 2932, down from ∼5932 since the number was first counted in 2011. We compared the characteristics of missing proteins with those of already annotated proteins with respect to transcriptional expression pattern and the time periods in which newly identified proteins were annotated. We learned that missing proteins commonly exhibit lower levels of transcriptional expression and less tissue-specific expression compared with already annotated proteins. This makes it more difficult to identify missing proteins as time goes on. One of the C-HPP goals is to identify alternative spliced product of proteins (ASPs), which are usually difficult to find by shot-gun proteomic methods due to their sequence similarities with the representative proteins. To resolve this problem, it may be necessary to use a targeted proteomics approach (e.g., selected and multiple reaction monitoring [S/MRM] assays) and an innovative bioinformatics platform that enables the selection of target peptides for rarely expressed missing proteins or ASPs. Given that the success of efforts to identify missing proteins may rely on more informative public databases, it was necessary to upgrade the available integrative databases. To this end, we attempted to improve the features and utility of GenomewidePDB by integrating transcriptomic information (e.g., alternatively spliced transcripts), annotated peptide information, and an advanced search interface that can find proteins of interest when applying a targeted proteomics strategy. This upgraded version of the database, GenomewidePDB 2.0, may not only expedite identification of the remaining missing proteins but also enhance the exchange of information among the proteome community. GenomewidePDB 2.0 is available publicly at http://genomewidepdb.proteomix.org/. KEYWORDS: Chromosome-Centric Human Proteome Project, database, proteomics, alternative splicing, GenomewidePDB, missing protein



missing proteins.5−9 However, despite these extensive collaborative works at the HPP community level, more than 2900 proteins are still missing according to the recent release of neXtProt (2015-01-01).4,10 At this juncture, it is also worth noting that as documented by Deutsch et al. (this issue),11 data from the TCGA project and the Pandey and Kuster laboratories have contributed to reducing the number of missing proteins by several hundred each during 2014, although the original claims of the latter two groups7,9 were far greater. Recently, Savitski et al.12 also pointed out that these two data sets have serious falsepositive problems after re-evaluating both groups data set. This

INTRODUCTION The Chromosome-Centric Human Proteome Project (C-HPP) was launched in 2012 to identify, annotate, localize, and characterize proteins present in various tissues.1 The ultimate goal of the C-HPP is to fill the gap between genomic information and proteomic evidence by mapping the number of representative proteins encoded by genes present in human tissues.1−3 Another goal of C-HPP is to map the major PTMs and identify alternative spliced products (ASPs) of proteins.2 One of the most obvious barriers is the presence of “missing” proteins: gene products yet to be annotated using biological samples with sufficient mass spectrometric (MS) evidence at the protein level.2,4 To accomplish this goal in a globally coordinated manner, 25-international chromosome teams are in charge of mapping proteins encoded by their assigned chromosomes under standard guidelines.2 During the past 3 years there has been good progress in terms of identifying © XXXX American Chemical Society

Special Issue: The Chromosome-Centric Human Proteome Project 2015 Received: June 10, 2015

A

DOI: 10.1021/acs.jproteome.5b00541 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research

Downloaded by UNIV OF NEBRASKA-LINCOLN on August 25, 2015 | http://pubs.acs.org Publication Date (Web): August 19, 2015 | doi: 10.1021/acs.jproteome.5b00541

Table 1. Current State (Depending on neXtProt Release 2015-01-01) of the Human Protein Knowledgebase According to Protein Evidencea Chr.

PE1

PE2

PE3

PE4

PE5

total

w/o PE5

missing

% missing

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT Sum

1688 1077 911 648 742 951 743 575 637 628 965 887 277 517 485 705 1011 237 1129 457 180 380 656 25 14 16 525

280 126 132 69 103 113 131 70 124 107 288 115 32 80 66 89 120 26 244 78 42 54 125 11 0 2625

35 8 11 21 10 10 10 13 10 5 19 5 2 7 13 15 11 2 14 0 4 3 12 4 0 244

6 1 2 1 2 4 4 4 4 3 5 2 6 2 0 1 4 1 5 2 0 1 3 0 0 63

49 18 14 22 10 32 48 39 36 17 42 21 10 17 38 26 24 10 36 13 25 21 29 7 0 604

2058 1230 1070 761 867 1110 936 701 811 760 1319 1030 327 623 602 836 1170 276 1428 550 251 459 825 47 14 20 061

2,009 1212 1056 739 857 1078 888 662 775 743 1277 1009 317 606 564 810 1146 266 1392 537 226 438 796 40 14 19 457

321 135 145 91 115 127 145 87 138 115 312 122 40 89 79 105 135 29 263 80 46 58 140 15 0 2932

16.0 11.1 13.7 12.3 13.4 11.8 16.3 13.1 17.8 15.5 24.4 12.1 12.6 14.7 14.0 13.0 11.8 10.9 18.9 14.9 20.4 13.2 17.6 37.5 0.0 15.1

a

PE1: MS evidence at the protein level; PE2: evidence only at the transcript level; PE3: evidence based on homology; PE4: predicted only; PE5: uncertain/dubious; total: sum of proteins at PE1, PE2, PE3, PE4, and PE5; %missing: the percentage of missing proteins from the total number of predicted proteins (excluding PE5 proteins).

ASPs is difficult because they share sequence identities with the representative proteins (encoded by gene). Thus, the identities of most peptides from an MS analysis of the representative proteins can usually be shared by their own ASPs.20 Identification of ASPs using shotgun proteomics requires extreme care in handling false-positives, which should be distinguished from ASP-specific peptides.20,21 Targeted proteomics can complement shotgun (or bottomup) proteomics with top-down proteomics, selected reaction monitoring (SRM), or multiple reaction monitoring (S/MRM) assays.22−24 The top-down approach may be a good way to detect large polypeptides spanning long regions of a given protein,25 maximizing the sequence coverage of identified proteins.22,25 The S/MRM approach is more sensitive than shotgun proteomics because it enables the identification not only of more lower-abundance proteins but also ASP-specific peptides.23,24 These targeted approaches might be a good complement to the routine methods of identifying missing proteins and ASPs. In this study, we demonstrate a new way of assessing the characteristics of missing proteins by cross-comparing them with proteins that have been already annotated (having protein level evidence). We also established new approaches to sample selection and protein detection, such as targeted proteomics for finding missing proteins and distinguishing ASPs. Finally, to support the C-HPP’s mission of identifying both missing proteins and ASPs, we upgraded GenomewidePDB by making an improvement over the previous version,26 integrating proteomic data in a chromosome-by-chromosome manner.

suggests that more stringent criteria for claiming a discovery of missing proteins need to be established in this field. Previously, it was thought that the proteomic profiling of various clinical and nonclinical samples (cell lines) coupled to high-resolution MS or MS/MS analytical methods would expand proteome coverage; however, this does not seem to be the case when it comes to identification of missing proteins.7,9,13 Although additional profiling results from body fluids and cell lines have helped us to identify rare or low-abundance proteins, this approach can increase proteome coverage by only a small amount.9 The use of a combination of several fractionation methods (e.g., hydrophilic interaction chromatography, strong cationic exchange chromatography, and OFFGEL electrophoresis) and enrichment methods (such as TiO2 and lectins) with various MS/MS fragmentation modes (collision-induced dissociation, electron-transfer dissociation) was expected to improve proteomic detection, but the outcomes fell short of the expected proteome coverage.13,14 Another goal of C-HPP is to identify at least one representative ASP per gene.2 ASPs have many important roles in biology, and they are known to be a major source of cell- and tissue-specific protein variation that can increase protein diversity without increasing the genome size.15,16 It is also well-documented that many ASPs are related to human diseases.17 For example, splicing variants of soluble fms-like tyrosine kinase-1 are up-regulated in preeclampsia, and they are known to inhibit vascular endothelial growth factor.18,19 According to neXtProt (release 2015-01-01), there are 32 346 ASPs derived from 10 507 proteins. In practice, identifying B

DOI: 10.1021/acs.jproteome.5b00541 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Downloaded by UNIV OF NEBRASKA-LINCOLN on August 25, 2015 | http://pubs.acs.org Publication Date (Web): August 19, 2015 | doi: 10.1021/acs.jproteome.5b00541

Journal of Proteome Research

Figure 1. Progress in identification of the missing proteins over time. (A) The number of proteins at each PE level according to the annual neXtProt status. (B) The percentage of each group of proteins according to the annual neXtProt status. (C) Changes in the number of missing proteins from 2011 to 2015.

made a JAVA (version 1.8.0, https://java.com) program that enables the construction of a tryptic peptide list from protein sequences and the identification of the number of proteins that contain the peptide product at given protein sequences. These two modules were used to prepare a list of peptides that appear in more than one protein and to annotate the peptides’ uniqueness. We only considered trypsin as a protease, and peptides with fewer than eight amino acids were ignored.

Transcriptional expression data obtained from various tissues, recently annotated peptides, and an advanced protein search function were also added in GenomewidePDB 2.0.



MATERIALS AND METHODS

Integration of Information

To upgrade GenomewidePDB in a chromosome-by-chromosome manner, we obtained updated proteomic data, transcriptomic data, and other information from public databases. All proteins encoded by genes located on chromosomes were linked to other resources. These include neXtProt for protein information, GO annotations, and disease-related information.10 Twenty-four neXtProt releases (from neXtProt release 2011-03-23 to 2015-01-01) were used for analyzing the changes in the status of missing proteins over time. More detailed information for disease was taken from Online Mendelian Inheritance in Man (OMIM),27 and oncogene product information was taken from the Cancer Gene Census28 (2012-03-15 release, provided by the Sanger Institute, Cambridge, U.K.). The Human Protein Atlas (HPA; release 13)8,29 was referenced for information regarding antibody availability and the expression level of protein in tissues. We gathered and integrated the gene expression information on proteins from NCBI UniGene.30,31

Statistical Analyses

To define the difference between the features of missing proteins and already annotated proteins, we used the Mann− Whitney U test to calculate p values. The receiver operating characteristic (ROC) curve, area under the ROC curve (AUC) value, and Youden’s index (Y-index) were analyzed using PanelComposer.32 Construction of the Web Interface

The web interface for GenomewidePDB was realized using Apache (version 2.2.17, www.apache.org) and PHP (version 5.3.5, www.php.net) to enable query results from the database to be displayed in an HTML format. The database was constructed with an entity−relationship model using MySQL (version 5.5.8, www.mysql.com), which is linked to Apache and PHP, enabling query results to be received and reported through the Web.

Peptide Data Processing

To annotate peptides that represent only one protein (unique peptides) or represent several proteins (shared peptides), we C

DOI: 10.1021/acs.jproteome.5b00541 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Downloaded by UNIV OF NEBRASKA-LINCOLN on August 25, 2015 | http://pubs.acs.org Publication Date (Web): August 19, 2015 | doi: 10.1021/acs.jproteome.5b00541

Journal of Proteome Research

Figure 2. ROC of the annotated proteins and missing proteins. The table under each graph shows the first and third quartiles with median values. (A) Effect of all known expression levels on protein detectability. (B) Effect of average expression levels of each transcript on protein detectability. (C) Number of expressed tissues versus protein detectability.



RESULTS AND DISCUSSION

tissue samples (such as brain, kidney, liver, lung, placenta) measured by an expression sequence tag (EST) provided by UniGene. In UniGene, the expression unit was defined as transcripts per million (TPM), which can be used as a measure of normalized transcript expression. The number of ESTs generated from different tissues varied, so a normalization process to make further comparisons more reliable was needed. This results in the UniGene expression data of 16 257 out of 16 525 (98.4%) that represent already annotated proteins and 2196 out of 2932 (74.9%) missing proteins. In summary, 18 453 of 19 457 (94.8%) proteins have UniGene expression data. The expression level of each transcript differed significantly between the missing proteins and those already annotated. Measured expression levels of all transcripts present in different tissues were used for this analysis. The transcripts of annotated proteins were found to have an average expression level of 98.3 TPM, whereas missing proteins’ transcripts had an average of 39.2 TPM. The difference in expression levels between the two groups was 59.1 TPM (∼2.5-fold, p = 1.4 × 10−110) (Figure 2). When these data were subjected to ROC analysis, we obtained an AUC value of 0.718 (Figure 2A) and best Y-index (32.53) at 25 TPM. Y-index is calculated from sensitivity and specificity (Sensitivity + Specificity − 1) and useful for selecting an optimal cutoff (the point at which the Yindex reaches its maximum value).33 This result shows that in general transcripts of missing proteins tend to have lower expression levels than transcripts of proteins that have already been annotated. The expression level appears to be an important feature of missing proteins, which may explain why some proteins are very hard to find. The same analysis was performed by taking the average expression level in different tissues for comparison to minimize the effect of the big difference between the expression levels of known proteins and missing proteins (410 538 vs 21 135) on the ROC. When we used an average value of expression level for each transcript, the difference in the average expression levels between the transcripts of annotated proteins (79.1) and missing proteins (24.5) was slightly reduced but still significant. Using these values, we also obtained results in which the two groups were

Current State of Missing Proteins

We adopted the neXtProt release 2015-01-01 as a reference protein database for assessing the current number of genematched human proteins and missing proteins with respect to each chromosome (www.nextprot.org). According to this reference database, there are 20 061 proteins (Table 1). These proteins can be categorized into five protein evidence (PE) levels4,10 according to the degree of MS evidence and other associated information (such as antibody detection, amino acid sequence analysis, and gene expression data). When total proteins were classified by these criteria, the numbers were as follows: PE1 for 16 525 proteins, PE2 for 2625 proteins, PE3 for 244 proteins, PE4 for 63 proteins, and PE5 for 604 proteins. We did not use proteins with PE5 because they are most likely dubious proteins.4 After removing the proteins that are classified as PE5, the total number of gene-encoded human proteins was 19 457, and 2932 proteins (∼15% of 19 457) were deemed missing proteins that need to be annotated. We analyzed the number of missing proteins according to particular time periods by collecting the number of newly identified proteins and missing proteins in 24 neXtProt releases covering 4 years (2011-03-23 to 2015-01-01) (Figure 1). In the first year of the C-HPP Consortium foundation (2011),2 the total number of predicted human proteins was 19 520, of which 5932 (30.4%) were regarded as missing proteins (neXtProt release 2011-08-23). Given that 2932 proteins are regarded as missing proteins (15.1% of 19 457 total predicted proteins; neXtProt release 2015-01-01), it appears that the proteomics community has identified ∼3000 new proteins by various methods. Characteristics of Missing Proteins

We analyzed the characteristics of the missing proteins to determine if they might display differences in gene expression compared with those already annotated. We gathered the transcript expression information from NCBI UniGene31 and examined the relative expression levels of transcripts in 45 D

DOI: 10.1021/acs.jproteome.5b00541 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Downloaded by UNIV OF NEBRASKA-LINCOLN on August 25, 2015 | http://pubs.acs.org Publication Date (Web): August 19, 2015 | doi: 10.1021/acs.jproteome.5b00541

Journal of Proteome Research

Figure 3. Decreasing trend of average values for both transcript expression level and the number of expressed tissues from 2011 to 2015.

significantly different (p = 1.5 × 10−79), making them easily distinguishable (AUC value = 0.819; Figure 2B). We also tested whether the discovery rate of missing proteins would be increased if the transcripts of proteins were expressed in many types of tissues. We counted the number of expressed tissues for the transcripts of annotated proteins (16 257) and the missing proteins (2196) and then analyzed whether there was any significant difference between the two groups by performing simple statistical analysis and ROC. The average numbers of expressed tissues for the transcripts of annotated and missing proteins were 25.3 and 9.6, respectively (p = 2.8 × 10−81). The AUC from the ROC analysis was 0.838. As shown in Figure 2C, there was a clear difference in the number of proteins that are expressed in more than six types of tissue between the annotated proteins (89.2%, 14 498 out of 16 257) and the missing proteins (47.5%, 1043 out of 2196). The Yindex value for the expressed number of tissues was 17. A total of 1781 (81.1%) missing proteins’ transcripts were found to be expressed in fewer than 17 tissues, whereas 11 597 (71.3%) annotated proteins’ transcripts were expressed in more than 17 tissues.

transcript expression was significantly different between the two groups (p = 6.1 × 10−74), but the discriminating power was not so strong (AUC = 0.61; Y-index = 16.94). There was a 2-fold difference (p = 1.3 × 10−56) in the average number of expressed tissues of the transcripts between the two groups (9 tissues for Group 1 proteins vs 18 tissues for Group 2 proteins). The AUC value of the ROC was 0.729, whereas the Y-index was 35.34. Between 2011 and 2015, the average expression level and number of expressed tissues for each transcript gradually decreased (Figure 3). The average expression level of the missing proteins’ transcripts was 43.9 in 2011 (neXtProt release 2011-08-23), but in subsequent years the values were down to 42.6 (neXtProt release 2012-08-24), 38.5 (neXtProt release 2013-08-17), and 34.1 (neXtProt release 2015-01-01). The average values of expressed tissues for each transcript followed a similar pattern during this 4-year period: 14.6, 13.7, 10.5, and 9.5. We further analyzed the difference in patterns of transcript expression in the case of single tissue. We selected testis, because it has the highest number of expressed transcript. We selected 3798 proteins in the same way as previously described and divided them into Group 1 (1418) and Group 2 (2380). Group 1 has 24.9 average TPM, whereas Group 2 has 27.6 average TPM. Apparently, these two groups also show the significant difference (p = 1.7 × 10−11) in transcript expression levels. From these results, we learned a few important lessons. First, it is apparent that proteins that are still missing tend to exhibit lower expression levels and a lower number of expressed tissues compared with those in newly identified proteins, suggesting that these values are decreasing over time. Second, considering this decrease in both transcript expression level and expressed tissue numbers, we may have to use targeted proteomics options to increase our chances of discovering the remaining missing proteins.

Are the Remaining Missing Proteins More Difficult to Identify?

We made two groups of protein and analyzed them to investigate whether there was any notable difference in patterns of transcriptional expression between the recently identified missing proteins and the remaining missing proteins. A total of 4907 proteins were chosen from neXtProt because they had transcriptional expression data in UniGene and remained missing proteins in 2011 (neXtProt release 2011-08-23). These proteins are divided into two groups. Group 1 has 1967 proteins which are still missing as of January 1, 2015 (neXtProt release 2015-01-01). Group 2 has 2940 proteins that have been newly identified during the time period from 2011-08-23 to 2015-01-01 (neXtProt release 2015-01-01). We then compared the expression levels and number of expressed tissues of the transcripts of these two groups of proteins. The level of

Upgraded Features of GenomewidePDB 2.0

To support the C-HPP’s mission in searching for the missing proteins and identifying their ASPs, we wanted to improve the utility and function of GenomewidePDB by updating several E

DOI: 10.1021/acs.jproteome.5b00541 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Downloaded by UNIV OF NEBRASKA-LINCOLN on August 25, 2015 | http://pubs.acs.org Publication Date (Web): August 19, 2015 | doi: 10.1021/acs.jproteome.5b00541

Journal of Proteome Research

Figure 4. Transcript and protein expression profiles (e.g., of Protein ADP-ribosylarginine hydrolase-like protein 1, NX_Q8NDY3). (A) Transcripts expressed in various tissues in different states (site, health, developmental stage) as measured by EST. (B) Transcript expression values (FPKM) by RNA-Seq. (C) The expression profiles for proteins in human tissues based on immunohistochemisty using tissue micro arrays. (D) Staining profiles for proteins in human tumor tissue. Each box represents the staining level of protein present in patients’ tumor tissue.

Transcript and Protein Expression Profile. The expanded proteome database can now include transcript expression data obtained from two sources. From NCBI UniGene,31 we gathered normalized transcript expression levels measured by expression sequence tags (EST) and the expression levels of transcripts from various tissues (45 types of tissue), developmental stages (such as fetus, infant, juvenile, and adult), and 26 types of tissues in various stages of health (including normal and several types of tumor and carcinoma).

features. This gene-centric proteome database, which can be expanded with new data sets on demand, was designed to integrate proteomic data from experimentally identified proteins encoded by human chromosomes.26 We improved the database by introducing three new features: (1) transcript expression data for various tissues, (2) a list of tryptic peptides of each protein and their isoforms with annotated information such as uniqueness, and (3) an advanced protein search interface that can find proteins for targeted proteomics. F

DOI: 10.1021/acs.jproteome.5b00541 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Downloaded by UNIV OF NEBRASKA-LINCOLN on August 25, 2015 | http://pubs.acs.org Publication Date (Web): August 19, 2015 | doi: 10.1021/acs.jproteome.5b00541

Journal of Proteome Research

Figure 5. “Annotation of peptides” view in an individual protein view, peptide list, and codes. Code U is denoted as a green box, G as a yellow box, and S as a red box. S codes link to the peptide-sharing proteins list.

Figure 6. (A) Keyword search and (B) Advanced search.

We also integrated the RNA expression levels of 32 tissues measured by RNA-Seq from the Human Protein Atlas.8 Also, from the Human Protein Atlas, we gathered the expression profiles of proteins in human tissues and tumors based on

immunohistochemistry. These expression profiles for transcripts and proteins are accessible through the individual protein view page (Figure 4). As shown in Figure 4A, the expression level of transcripts among the various tissues, health G

DOI: 10.1021/acs.jproteome.5b00541 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Downloaded by UNIV OF NEBRASKA-LINCOLN on August 25, 2015 | http://pubs.acs.org Publication Date (Web): August 19, 2015 | doi: 10.1021/acs.jproteome.5b00541

Journal of Proteome Research

Figure 7. Examples of advanced search results. (A) Conditional clauses and number of results. (B) Examples of two Speedy E2B isoforms (SPDYE2B-1: NX_A6NHP3-1, SPDYE2B-2: NX_A6NHP3-2).

proteins are identified with a high level of confidence. The “annotation of peptides” can be seen in the individual protein view page (Figure 5A). The peptides of certain proteins are listed with peptide sequence and uniqueness codes. Code U denotes uniquely mapped peptides, G denotes gene-specific peptides at the gene level, and S indicates that the peptide is shared peptide. This information is crucial for selecting peptides for S/MRM assays and distinguishing ASPs. We ignored peptides with sequence lengths shorter than eight amino acids because they are usually not recommended for protein identification. Advanced Search tool. Users can find target proteins from GenomewidePDB 2.0 via search interfaces. We added two search forms, “keyword search” and “advanced search” (Figure 6). The keyword search enables proteins to be found using simple keywords such as gene name, protein description, and chromosomal location (e.g., 13q12). The keyword search can query one or two words or sentences but cannot use the conditional clause. The advanced search is more complex. Using the advanced search interface, users can query using conditional clauses on specific terms and columns. The advanced search can query (1) level of protein evidence (PE), (2) number of known isoforms, (3) number of known variants, (4) number of known post-translational modifications, (5) the presence of disease-related information or not, (6) genes expressed in “body site”/“health state”/“developmental stage” with over/under/exact expression level, (7) protein expression measured by immunohistochemistry in specific “tissues”/“cancers”, (8) specific chromosomes, (9) number of peptides depending on annotated type, and (10) range of molecular weights of a protein.

status, and developmental stage measured by EST are represented as bar-chart with their TPM values. The RNAseq profiles are shown in Figure 4B as a bar chart. Bars are colored according to their relative abundance levels. Highly expressed RNAs (FPKM > 50) are shown as red bars. Moderately expressed RNAs (10 < FPKM < 50) are represented by orange bars, and lower expressed RNAs (1 < FPKM < 10) are shown in yellow. The lowest expressed RNAs (FPKM value of 0 or 1) are shown in white. The expression profiles for proteins in human tissues based on immunohistochemisty using tissue microarrays are provided with tissue names and their expression level (Figure 4C). Staining profiles for proteins in human tumor tissue are shown in colored boxes (Figure 4D). Each box represents the staining level of protein present in patients’ tumor tissue (red: high, orange: moderate, yellow: low, gray: not detected) Information on the Tryptic Peptides. The newly added list of annotated tryptic peptides can be accessed via the individual protein view page in the database. The peptides were also annotated into three categories, uniquely mapped, genespecific, and shared peptides. The uniquely mapped peptides come from only one protein, whereas the gene-specific and shared peptides come from several proteins. Therefore, the uniquely mapped peptides are the key to identifying their corresponding proteins correctly. Gene-specific peptides are mapped uniquely to a gene product but to multiple protein isoforms. Share peptides are mapped to multiple proteins from several gene products. We integrated the information on the annotated peptides of all proteins made by trypsin into the upgraded database. This information may be useful for analyzing MS profiling results to determine how many missing H

DOI: 10.1021/acs.jproteome.5b00541 J. Proteome Res. XXXX, XXX, XXX−XXX

Downloaded by UNIV OF NEBRASKA-LINCOLN on August 25, 2015 | http://pubs.acs.org Publication Date (Web): August 19, 2015 | doi: 10.1021/acs.jproteome.5b00541

Journal of Proteome Research



For example, using this search method, we identify 327 proteins from chromosome 13 (including PE5) from which we filter out those 277 already annotated proteins and 10 PE5 level proteins. Out of the remaining 40 proteins, we selected 29 that have at least one uniquely mapped peptide and three genespecific peptides. We then further selected two proteins that have transcript expression in the placenta (Figure 7A). These proteins are a good target for employing S/MRM to identify missing proteins. Another example is the proteins that may need a top-down approach to detect large polypeptides spanning long regions of a given protein.25 To this end, we took 2932 missing proteins and then selected 92 proteins that contain no uniquely mapped peptides. For those proteins with no uniquely mapped peptides, it may be necessary to treat them with other proteases (not trypsin) to annotate, or they can be subjected to top-down MS approach for detecting their intact form. Nine of ninety-two proteins had >1 isoform (ASP), among which five proteins had an expression level TPM of >25 at any tissues (Figure 7A). Speedy E2B (SPDYE2B, NX_A6NHP3) protein is a good example for such case. SPDYE2B has two isoforms, SPDYE2B-1 and SPDYE2B-2, in that the former (NX_A6NHP3-1) has 144 more amino acids than the latter (NX_A6NHP3-2) at the beginning of the sequence. By this way, these isoforms can be distinguished; however, all six tryptic peptides from this region are shared with at least two proteins. These kinds of proteins are hard to distinguish using bottom-up proteomics and S/MRM methods. Using this advanced search tool, the user may be able to find target proteins (or peptides) of interest suitable for their proteomics works.



Article

AUTHOR INFORMATION

Corresponding Author

*Phone: +82-2-2123-4242. Fax: +82-2-393-6589. E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS



REFERENCES

This work was supported by grants from the National Research Foundation of Korea (2011-0028112 to Y.-K.P.) and the Korean Ministry of Health and Welfare (HI13C2098 to Y.-K.P. and A112047 to S.-K.J.)

(1) Paik, Y. K.; Jeong, S. K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; et al. The Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome. Nat. Biotechnol. 2012, 30, 221−3. (2) Paik, Y. K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Marko-Varga, G.; et al. Standard guidelines for the Chromosome-centric Human Proteome Project. J. Proteome Res. 2012, 11 (4), 2005−13. (3) Hancock, W.; Omenn, G. S.; Legrain, P.; Paik, Y. K. Proteomics, human proteome project, and chromosomes. J. Proteome Res. 2011, 10, 210. (4) Lane, L.; Bairoch, A.; Beavis, R. C.; Deutsch, E. W.; Gaudet, P.; et al. Metrics for the Human Proteome Project 2013−2014 and strategies for finding missing proteins. J. Proteome Res. 2014, 13, 15− 20. (5) Marko-Varga, G.; Omenn, G. S.; Paik, Y. K.; Hancock, W. S. A first step toward completion of a genome-wide characterization of the human proteome. J. Proteome Res. 2013, 12, 1−5. (6) Paik, Y. K.; Omenn, G. S.; Thongboonkerd, V.; Marko-Varga, G.; Hancock, W. S.; et al. Genome-wide proteomics, ChromosomeCentric Human Proteome Project (C-HPP), part II. J. Proteome Res. 2014, 13, 1−4. (7) Kim, M. S.; Pinto, S. M.; Getnet, D.; Nirujogi, R. S.; Manda, S. S.; et al. A draft map of the human proteome. Nature 2014, 509, 575−81. (8) Uhlén, M.; Fagerberg, L.; Hallström, B. M.; Lindskog, C.; Oksvold, P.; et al. Proteomics. Tissue-based map of the human proteome. Science 2015, 347, 1260419. (9) Wilhelm, M.; Schlegl, J.; Hahne, H.; Gholami, A. M.; Lieberenz, M.; et al. Mass-spectrometry-based draft of the human proteome. Nature 2014, 509, 582−7. (10) Gaudet, P.; Michel, P. A.; Zahn-Zabal, M.; Cusin, I.; Duek, P. D.; et al. The neXtProt knowledgebase on human proteins: current status. Nucleic Acids Res. 2015, 43, D764−70. (11) Deutsch, E. W.; Sun, Z.; Campbell, D.; Kusebauch, U.; Chu, C. S.; et al. The State of the Human Proteome in 2014/2015 as viewed through PeptideAtlas: enhancing accuracy and coverage through the AtlasProphet. J. Proteome Res. 2015, 150724142438005. (12) Savitski, M. M.; WIlhelm, M.; Hahne, H.; Kuster, B.; Bantscheff, M.; et al. A scalable approach for protein false discovery rate estimation in large proteomic data sets. Mol. Cell. Proteomics 2015, mcp.M114.046995. (13) Lee, H. J.; Jeong, S. K.; Na, K.; Lee, M. J.; Lee, S. H.; et al. Comprehensive genome-wide proteomic analysis of human placental tissue for the Chromosome-Centric Human Proteome Project. J. Proteome Res. 2013, 12, 2458−66. (14) Brunner, E.; Ahrens, C. H.; Mohanty, S.; Baetschmann, H.; Loevenich, S.; et al. A high-quality catalog of the Drosophila melanogaster proteome. Nat. Biotechnol. 2007, 25, 576−83. (15) Nilsen, T. W.; Graveley, B. R. Expansion of the eukaryotic proteome by alternative splicing. Nature 2010, 463, 457−63. (16) Omenn, G. S.; Menon, R.; Zhang, Y. Innovations in proteomic profiling of cancers: Alternative splice variants as a new class of cancer

CONCLUSIONS

In this study, we compared missing proteins with already identified proteins in terms of transcript expression levels and the number of tissues in which they are expressed with reference to the time period during which those missing proteins were identified and annotated. We learned that missing proteins become very hard to detect as time goes on because they apparently have smaller numbers of expressed tissues and lower expression levels in general, which makes our quest to find the remaining missing proteins more difficult. To overcome such obstacles in identifying those indistinguishable proteins such as ASPs and low-abundance missing proteins, we may consider using top-down MS approaches and targeted proteomics using the S/MRM assays instead of traditional bottom-up MS profiling. In line with this proposal, we provided multiomics data sets in the GenomewidePDB 2.0, which facilitates the use of genomic data. We also improved the search function to enable target proteins with specific features such as molecular weight, number of unique peptides, and expression level in tissue to be found. One of our challenges in the C-HPP is stimulating and assisting other teams to utilize and compare the several databases and informatics tools developed by individual teams. GenomewidePDB 2.0 may not only expedite identification of the remaining missing proteins but also enhance the exchange of information among the proteome community. GenomewidePDB 2.0 is publicly available at http://genomewidepdb.proteomix.org/. I

DOI: 10.1021/acs.jproteome.5b00541 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Downloaded by UNIV OF NEBRASKA-LINCOLN on August 25, 2015 | http://pubs.acs.org Publication Date (Web): August 19, 2015 | doi: 10.1021/acs.jproteome.5b00541

Journal of Proteome Research biomarker candidates and bridging of proteomics with structural biology. J. Proteomics 2013, 90, 28−37. (17) Menon, R.; Zhang, Q.; Zhang, Y.; Fermin, D.; Bardeesy, N.; et al. Identification of novel alternative splice isoforms of circulating proteins in a mouse model of human pancreatic cancer. Cancer Res. 2009, 69, 300−9. (18) Heydarian, M.; McCaffrey, T.; Florea, L.; Yang, Z.; Ross, M. M.; et al. Novel splice variants of sFlt1 are upregulated in preeclampsia. Placenta 2009, 30, 250−5. (19) Sela, S.; Itin, A.; Natanson-Yaron, S.; Greenfield, C.; GoldmanWohl, D.; et al. A novel human-specific soluble vascular endothelial growth factor receptor 1: cell-type-specific splicing and implications to vascular endothelial growth factor homeostasis and preeclampsia. Circ. Res. 2008, 102, 1566−74. (20) Menon, R.; Im, H.; Zhang, E. Y.; Wu, S. L.; Chen, R.; et al. Distinct splice variants and pathway enrichment in the cell-line models of aggressive human breast cancer subtypes. J. Proteome Res. 2014, 13, 212−27. (21) Nesvizhskii, A. I. Proteogenomics: concepts, applications and computational strategies. Nat. Methods 2014, 11, 1114−25. (22) Ahlf, D. R.; Thomas, P. M.; Kelleher, N. L. Developing top down proteomics to maximize proteome and sequence coverage from cells and tissues. Curr. Opin. Chem. Biol. 2013, 17, 787−94. (23) Hüttenhain, R.; Soste, M.; Selevsek, N.; Röst, H.; Sethi, A.; et al. Reproducible quantification of cancer-associated proteins in body fluids using targeted proteomics. Sci. Transl. Med. 2012, 4, 142ra94. (24) Boersema, P. J.; Kahraman, A.; Picotti, P. Proteomics beyond large-scale protein expression analysis. Curr. Opin. Biotechnol. 2015, 34C, 162−170. (25) Garcia, B. A, Top Down mass spectrometry. J. Am. Soc. Mass Spectrom. 2010, 21, 193−202. (26) Jeong, S. K.; Lee, H. J.; Na, K.; Cho, J. Y.; Lee, M. J.; et al. GenomewidePDB, a proteomic database exploring the comprehensive protein parts list and transcriptome landscape in human chromosomes. J. Proteome Res. 2013, 12, 106−11. (27) Hamosh, A.; Scott, A. F.; Amberger, J. S.; Bocchini, C. A.; McKusick, V. A.; et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2005, 33, D514−7. (28) Futreal, P. A.; Coin, L.; Marshall, M.; Down, T.; Hubbard, T.; et al. A census of human cancer genes. Nat. Rev. Cancer 2004, 4, 177− 83. (29) Uhlen, M.; Ponten, F. Antibody-based proteomics for human tissue profiling. Mol. Cell. Proteomics 2005, 4, 384−93. (30) Wheeler, D. L.; Church, D. M.; Federhen, S.; Lash, A. E.; Madden, T. L.; et al. Database resources of the National Center for Biotechnology. Nucleic Acids Res. 2003, 31, 28−33. (31) NCBI Resource Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2015, 43, D6−17. (32) Jeong, S. K.; Na, K.; Kim, K. Y.; Kim, H.; Paik, Y. K.; et al. PanelComposer: a web-based panel construction tool for multivariate analysis of disease biomarker candidates. J. Proteome Res. 2012, 11, 6277−81. (33) Youden, W. J. Index for rating diagnostic tests. Cancer 1950, 3, 32−5.

J

DOI: 10.1021/acs.jproteome.5b00541 J. Proteome Res. XXXX, XXX, XXX−XXX