Appraisal of the Missing Proteins Based on the ... - ACS Publications

Oct 26, 2015 - Assessing Transcription Regulatory Elements To Evaluate the Expression Status of Missing Protein Genes on Chromosomes 11 and 19...
1 downloads 0 Views 982KB Size
Subscriber access provided by CMU Libraries - http://library.cmich.edu

Article

Appraisal of the missing proteins based on the mRNAs bound to ribosomes Shaohang Xu, Ruo Zhou, Zhe Ren, Baojin Zhou, Zhilong Lin, Guixue Hou, Yamei Deng, Jin Zi, Liang Lin, Quanhui Wang, Xin Liu, Xun Xu, Bo Wen, and Siqi Liu J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.5b00476 • Publication Date (Web): 26 Oct 2015 Downloaded from http://pubs.acs.org on November 1, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Appraisal of the missing proteins based on the mRNAs bound to ribosomes Shaohang Xu1#, Ruo Zhou1#, Zhe Ren1, Baojin Zhou1, Zhilong Lin1, Guixue Hou1,2, Yamei Deng1,2, Jin Zi1, Liang Lin1, Quanhui Wang1,2, Xin Liu1, Xun Xu1, Bo Wen1*, Siqi Liu1,2*

1

BGI-Shenzhen, Shenzhen 518083, China;

2

CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics,

Chinese Academy of Sciences, Beijing 101318, China

#These authors contribute equally to this work. *To whom correspondence should be addressed:

Siqi Liu, Ph.D Phone: 86-10-80485325 Fax: 86-10-80485324 E-mail: [email protected] Address: Airport Industrial Zone B-6, Shunyi, Beijing 101318, China

Bo Wen Tel and Fax: 86-0755-25273620 E-mail: [email protected] Address: BGI-Shenzhen, 11 Build, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 35

Abstract

Considering the technical limitations of mass spectrometry in protein identification, the mRNAs bound to ribosomes (RNC-mRNA) are assumed to reflect the mRNAs participating in the translational process. The RNC-mRNA data are reasoned to be useful for appraising the missing proteins. A set of the multi-omics data including free-mRNAs, RNC-mRNAs, and proteomes was acquired from three liver cancer cell lines. Based on the missing proteins in neXtProt (release 2014-09-19), the bioinformatics analysis was carried out in three phases: 1) finding how many neXtProt missing proteins have or do not have RNA-seq and/or MS/MS evidence, 2) analyzing specific physicochemical and biological properties of the missing proteins that lack both RNA-seq and MS/MS, and 3) analyzing the combined properties of these missing proteins. Total of 1501 missing proteins were found by neither RNC-mRNA nor MS/MS in the three liver cancer cell lines. For these missing proteins, some are

expected

functions as properties

higher

hydrophobicity,

at

protein

the

level,

unsuitable while

some

detection are

or

sensory

predicted to

have

non-expressing chromatin structures on the corresponding gene level. With further integrated analysis, we could attribute 93% of them (1391/1501) to these causal factors, which result in the expression products scarcely detected by RNA-seq or MS/MS.

Keywords Missing protein, ENCODE, mRNA, RNC-mRNA, RNA-seq, MS/MS,DHS, DNase I hypersensitivity site,undetectable proteins

ACS Paragon Plus Environment

Page 3 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Introduction

The primary goal of the international Chromosome-Centric Human Proteome Project (C-HPP) is to create an annotated proteomic catalog for each chromosome. In recent years, many groups have focused on using proteomics to bridge the major gaps in connection of the evidence of genomic, epigenomic, transcriptomic variation and the diverse phenotypes1,2. In the first special issue of the Journal of Proteome Research (JPR) C-HPP in January 2013, the HPP executive committee and the investigators agreed on five standard baseline metrics for the whole proteome and for each chromosome3. The “protein evidence levels”, which was used in neXtProt4,5 (and in UniProt/SwissProt), was generally classified into five categories by Lane et al.6. Thus, the proteins that are not detected using an antibody or mass spectrometry are assigned as the “missing proteins”. In the first special issue of the JPR in 2013, Hancock et al. estimated that there are 6568 missing proteins by subtracting the average of the three mass spectrometry databases from the Ensembl genes3. In the second special issue of JPR in 2014, Omenn et al.6 further updated the missing proteins to a total of 3844. Recently, neXtProt deployed a new release of neXtProt (neXtProt release 2014-09-19) that included major data updates that were of interest to the proteomics community. With the new proteomics studies that were re-analyzed and integrated in PeptideAtlas7, neXtProt reported that more than 82% of the entries of the human genome genes were confirmed at the protein level, and the missing protein size was decreased to 2948.

The significant improvements of the proteomic technology were responsible for dramatically

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

shrinking the number of the missing proteins within a short period of time. For instance, in the paper published by Muraoka et al.8 in 2013, total of 7092 proteins were identified using membrane fractions isolated from the breast cancer tissues with high confidence of MS/MS evidence. Of the identified proteins, the authors claimed the 851 missing proteins identified according to the neXtProt version at 2013. However, most of these proteins (840 out of 851) are no longer to belong to missing proteins based on the recent list of missing proteins at neXtProt version at 2014-09-19. Can all of the missing proteins be determined by developing the available proteomic techniques? The answer is not clear because there is a lack of systematic investigations to determine the reason for the undetectable proteins. Was the protein missed due to its specific features, such as its physicochemical properties, physiological features, or abundance? Was the gene encoding the missing protein poorly expressed because of the special genomic structure? Was the gene encoding the missing protein mistakenly annotated? Without knowing these answers, we cannot explain the missing proteins.

Total mRNAs were poorly correlated with the protein abundances. However, the mRNAs that were bound to ribosomes (RNC-mRNAs) were reported to be an important factor for the protein abundances as well as their functions. Ingolia claimed that studies using ribosome profiling had already provided new insights into the identity and the amount of proteins that were produced by cells, as well as detailed views into the mechanism of protein synthesis itself9. Considering the technique limitation in mass spectrometry for protein identification, Ingolia et al. pointed out that proteomics techniques substantial limits on their ability to

ACS Paragon Plus Environment

Page 4 of 35

Page 5 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

independently determine protein sequences and measure low-abundance proteins, whereas profiling of RNC-mRNAs greatly increased the ability to quantitatively monitor protein productions10. Recently, Wang et al. systematically analyzed the relative abundances of mRNAs, RNC-mRNAs, and proteins on the genome-wide scale and thus proposed that the relative abundances of the translating mRNAs and proteins were strongly correlated in cells at steady state conditions11,12. Taking the translatome as a background of protein expression, Chang et al. found that the protein abundance played a decisive role in the detectability of a protein13. Due to the limitations of the current technology, proteomics has difficulty offering the overall view of expression of the genes that encode the proteins. However, the RNC-mRNAs, which are acquired from the full sequencing analysis, are expected to provide an alternative option for estimating the translated genes. We therefore suggest that achieving high-quality RNC-mRNA data is the foundation for the missing proteins analysis.

In this study, we adopted the evidence of the RNC-mRNA to evaluate the missing proteins. By combining the information from the RNC-mRNA and MS/MS, we narrowed down the number of the missing proteins to 1501. Furthermore, the bioinformatics analysis was performed to determine the factor that lead to the undetectability phenomenon. Our analysis revealed that the 375 missing proteins without a detectable signal were relatively abundant in the sensory- and membrane-related processes, which possessed higher hydrophobicity scores. In addition, there were 220 missing proteins that were predominantly located in the testis tissue, and the 930 missing proteins had their corresponding genes in the chromatin regions without a DHS. Based on the theoretical analysis, we attributed approximately 93% of the

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

missing proteins to known biological or physicochemical reasons that led to either poor expression of the genes or difficulty in detecting the expressed products.

Methods

1. Data resources

All of the data for the bioinformatics analysis was obtained from the Chinese Human Chromosome Proteome Consortium (CCPC) database. Briefly, the three human hepatocellular carcinoma cell lines (e.g., MHCC97H, HCCLM3, and Hep3B) were cultured and equally divided into three groups for follow-up experiments to examine the deep sequencing of the transcriptome, translatome, and proteome, respectively. To prepare high-quality samples for the transcriptome and translatome, we extracted the total mRNA and RNC-mRNA immediately after harvesting the cultured cells with 80% confluence. The RNC extraction was performed as described by Esposito et al.14. In brief, the cells were pre-treated with cycloheximide followed by pre-chilled phosphate buffered saline washes. After ice-bath, the cell debris was removed by centrifuging and the resulted supernatants were transferred on the sucrose buffer. The RNCs were pelleted after ultra-centrifugation. Total RNA and RNC-RNA were respectively isolated by using TRIzol® RNA extraction reagent (Ambion, Austin, TX). Equal amount of total RNA or RNC-RNA from each preparation was pooled, respectively, for subsequent library construction and RNA-seq. The sequencing libraries were constructed following the TruSeqTM RNA Sample Preparation Guide (Illumina, San Diego, CA). First-strand cDNA was synthesized with SuperScript II Reverse Transcriptase

ACS Paragon Plus Environment

Page 6 of 35

Page 7 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(Invitrogen) using random primer, and Ampure XP beads (Beckman Coulter, Beijing, China) were used to isolate double-stranded cDNA synthesized by Second Strand Master Mix. The purified libraries were quantified by Qubit® 2.0 Fluorometer (Invitrogen) and validated by Agilent 2100 bioanalyzer (Agilent, Beijing, China). The clusters were generated by cBot with the library diluted to 10 pM and then were sequenced on the Illumina HiSeq-2000 sequencer for 50 cycles11. High-quality reads that passed the Illumina quality filters were kept for the sequence analysis. For the proteome measurement, the proteins were extracted from the harvested cells and analyzed using the mass spectrometers (The Q Exactive and the Triple TOF 5600).

2. Analysis of the transcriptomic and proteomic data The data processing of the transcriptome and proteome was also described in the method part of Chang et al.13. In brief, for the transcriptomic data, the sequencing reads were mapped with the Ensembl-v72 RNA reference sequences using the FANSe2 algorithm15. The genes with at least 10 mapped reads were determined to have a valid gene identification and quantification16. For the proteomic data, the peptide search was performed using the Mascot v2.3.2 local server17 against the database that contained sequences of all human proteins from Swiss-Prot (release-2014_08). The search results were filtered through Mayu18 by setting the PSM-, peptide- and protein-level FDR at 1%. And the MS/MS spectra of identified missing proteins were manually checked by the pLabel software developed by the pFind group19.

3. Categorization of the genes that encode proteins

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

All of the human genes that encode proteins and their relevant annotations were based on neXtProt (release 2014-09-19, http://www.nextprot.org/db). The missing proteins were defined by following the “protein evidence levels”, a method that has been previously described by Lane et al.6. For an efficient conversion from the genome ID to protein ID, three steps were performed, 1) based on neXtProt (2014-09-19), total of 19439 protein entries in Swiss-Prot format were obtained after removal of “uncertain” or “dubious” protein entries; 2) using BioMart20, the RNA and RNC-mRNA data in Ensembl format were converted to Swiss-Prot format after removal of the entries not included in 19439; and 3) although MS/MS search results in Swiss-Prot format did not need format conversion, deletion of the entries not in 19439 protein entries was required.

4. Characteristic analysis of the genes that encode the missing proteins The physicochemical features of the proteins, such as the molecular weight (MW) and hydrophobicity (Hy), were analyzed using ProPAS21. The functions and tissue-dependent distribution of the missing proteins were estimated through enrichment analysis using DAVID 6.7 (http://david.abcc.ncifcrf.gov/)22,23. During the analysis, the categories were divided into four subsets: biological process (BP), cellular component (CC), molecular function (MF), and Uniprot tissue (UP_TISSUE). The protein topology was surveyed using TMHMM 2.0 based on hidden Markov models24-26. The expected number of amino acids in the transmembrane helices was obtained to predict the membrane-span domain of the proteins. The DNase I Hypersensitivity sites were derived from the DNase clusters (V3) that collected the DHS information from 125 cell types and were stored in the UW and Duke

ACS Paragon Plus Environment

Page 8 of 35

Page 9 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

ENCODE 27

data

(http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeRegDnaseCluster

ed/wgEncodeRegDnaseClusteredV3.bed.gz). The DHSs from six liver cancer cell lines were specifically chosen for the related analysis. In addition, to better recognize the distribution patterns like hydrophobicity, peptide size and DHS, the density curves were further plotted using R package “ggplot2” with smoothing kernel of “gaussian”.

Results and Discussion

1. The missing proteins in three liver cancer cell lines On the basis of the neXtProt release 2014-09-19, the total number of protein entries encoded in the human genome is 20055, which are divided into five categories (PE1-5) based on the previous identification evidence. If the “dubious” or “uncertain” proteins (PE5) are not considered, the number of the human genome encoding proteins is expected to be 19439. According to the category of missing proteins defined by neXtProt, there are 2948 protein entries belonging to this category that consists of PE2, PE3 and PE4.

We implemented the profiling analysis to the gene expression of the mRNA and protein in three liver cancer cell lines. To clearly define the missing proteins, we reanalyzed the raw data with the new version of neXtProt. With the support of the RNA-seq data, a total of 16240 protein entries were identified for mRNA, and 15924 protein entries were identified for RNC-mRNA. Based on the MS/MS data, a total of 8018 protein entries were identified through stringent criteria. Importantly, for most of these identified proteins (7869), their

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

mRNAs or RNC-mRNAs were identified as well. This concordance implies that the genes encoding the proteins in these cell lines are efficiently transcribed and translated. Moreover, the genes in the two datasets, mRNA and RNC-mRNA in these cell lines, were highly overlapped, approximately 97%. This overlap suggests that a large portion of the mRNAs were able to bind to the ribosomes in the three cell lines. In the 2948 missing proteins defined by Lane et al.6, there are 1393 genes encoding proteins without mRNA evidence, 1509 genes encoding proteins without RNC-mRNA evidence and 2925 proteins without MS/MS evidence (Figure 1). Furthermore, the 23 proteins that were listed in the missing proteins at NextProt (2014-09-19) were identified by MS/MS in three liver cancer cell lines (Supplementary Table S5). To firmly ensure the data quality for identification of these missing proteins, we further manually checked all the 86 MS/MS spectra related with the 23 missing proteins, and found these spectra were highly qualified in protein identification (these MS/MS spectra were listed in Supplementary Figure S2). Importantly, initial analysis of the protein matches generated evidence for 59 neXtProt missing proteins. However, using the stringent FDR at the protein level and scrutinizing all the spectra reduced this finding to 23 higher-quality missing protein matches.

It is generally accepted that the mRNAs that are bound to the ribosome-nascent chain complex are undergoing translation. Therefore, the RNC-mRNA status may exhibit a much closer correlation with protein synthesis. Thus, an integrated view of RNC-mRNA should be able to better estimate the potential ability of genes to be translated. Due to technological limitations, the current mass spectrometric techniques and the antibody-based assays still

ACS Paragon Plus Environment

Page 10 of 35

Page 11 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

cannot detect all of the proteins or the peptides that are generated from the protease digestion. Moreover, the detectability of a protein relies on the detection feasibility to its corresponding peptides, as well as the biological features such as transcription efficiency, mRNA stability, or gene location. We therefore propose to appraise the missing proteins from a new angle. Because the binding of mRNAs to ribosomes is a necessary step towards protein synthesis, this definition suggests that the RNC-mRNA for a gene represents the corresponding synthesized protein. A stringent criterion for a missing protein should consider whether the RNC-mRNA is detectable. In the Venn plot in Figure 1, for 2925 missing proteins without MS/MS identified evidence, approximately 49% (1424/2925) possess their corresponding RNC-mRNA signals in the three liver cancer cell lines, while approximately 51% (1501/2925) are the absence of their corresponding RNC-mRNA in the same cell lines. Considering the missing protein list (2948), therefore, the transcripts and proteins identified and depicted in Figure 1 were divided into four groups, PIM, RI-PE1, RI-PE234, NR-PE234. These groups were defined in Table 1 and the transcripts and proteins in these groups were evaluated in several aspects as described below.

2. Physicochemical and biological characteristics of the proteins The physicochemical or biological properties of proteins may negatively affect protein or peptide detection28. We evaluated the proteins of four groups in 4 aspects to determine why the missing proteins in NR-PE234 could not be identified using MS/MS.

a) Hydrophobicity

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The hydrophobicity of the proteins in the above 4 groups was estimated using ProPAS, and the distribution of the hydrophobic scores from the four groups was generated. Figure 2A reveals that the hydrophobicity curves for PIM, RI-PE1 and RI-PE234 were highly overlapped with a peak value of -0.35. The hydrophobicity curve of NR-PE234 had two peaks. The first peak overlapped with the other three curves. The second peak was distinct and had a peak value of 0.75. By setting a threshold to 0.52 to distinguish the two hydrophobicity peaks in the NR-PE234 group, the 1501 missing proteins could be divided into two groups, and 424 proteins had high hydrophobicity scores (approximately 30%). This result suggests that a significant number of the missing proteins were hydrophobic and could not be easily prepared in an aqueous solution. Furthermore, the 424 proteins with the higher hydrophobicity values were treated with the THMMS analysis. The expected number of amino acids in the transmembrane helices (Exp number) was plotted against the corresponding hydrophobic scores as show in Figure 2B. If the Exp number of 18 was set as the threshold for a transmembrane protein or a signal peptide, the bivariate histogram with the hexagonal binning revealed that only 248 proteins had the Exp number over 18 out of the 1077 proteins with a hydrophobicity of less than 0.52. However, 420 proteins possessed an Exp number over 18 among the 424 proteins with high hydrophobicity scores (>=0.52). This result means that most of the missing proteins with high hydrophobicity are located within a membrane.

b) Suitable peptides A theoretical estimation of the molecular mass of the gene expression products, which were

ACS Paragon Plus Environment

Page 12 of 35

Page 13 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

identified in this study, is illustrated in Supplementary Figure S1. The figure demonstrates the similar distribution curves of the molecular mass in PIM, RI-PE1 and RI-PE234 with a large range from 4 to 5 and a peak at 4.75. The NR-PE234 group had a unique distribution curve with a narrow range from 4.47 to 4.65 (corresponding to the MW of 30-45 kDa) and a peak at 4.55. The average molecular mass in the NR-PE234 group was not significantly different compared with that in the other groups. Moreover, the protein identification by MS/MS is determined by the sizes of tryptic peptides derived from the complete digestion to a protein, but is not dependent on the protein size. Using MS/MS, Swaney et al. estimated the possibility of the peptide detection from a peptide pool from the yeast proteins and proposed that the tryptic peptides suitable for MS detection should have the particular peptide sizes of approximately 7–35 amino acid residues29. The Farrah’s group identified that the 60S ribosomal protein L41 has many lysine and arginine residues, and there were thus no tryptic peptides longer than two amino acids30. Therefore, we introduced the theoretical prediction to the tryptic peptide sizes generated from the databases described above. If the lengths of the amino acid residues (between 7 and 35) for a tryptic peptide were considered as the cutoff values, the number of the peptides over the threshold was plotted against the corresponding peptide lengths. This process is illustrated in Figure 2C. The distribution curves of PIM, RI-PE1 and RI-PE234 were similar to the peak value of approximately 1.35, whereas the distribution curves of NR-PE234 were different compared to the others with the peak value of 0.99. Among the 1501 proteins in the NR-PE234 group, approximately 70% displayed less suitable tryptic peptides that were detected using MS/MS.

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

c) DHSs DHSs are chromatin regions sensitive to cleavage by the DNase I enzyme. The greater number of the DHSs, the greater possibility of the gene expression. For instance, Fan et al. attributed the scare expression of beta-defensins to the genes with poor DHS scores31. We assumed that the distribution of DHSs on chromatin, which was determined using ENCODE, was a proper fit for the liver cancer cell lines that were studied here. For the analysis of DHSs, we first collected the DHS information from the 6 liver cancer cell lines in the ENCODE database and then systematically examined the distribution of their DHSs. Based on the gene expression products, the corresponding genes were referenced back to the chromosome regions. The DHSs on these genes were acquired from the ENCODE data. In each group, the ratios of the genes with certain DHSs were plotted against the number of DHSs at certain genes, as depicted in Figure 2D. The ratios of the DHSs in PIM, RI-PE1 and RI-PE234 were, on average, higher compared with those in the NR-PE234 group. The X axis value of –Inf means that there was no DHS in that gene region. The genes within that region contained a structural hinge to prevent DNase attack. Clearly, the NR-PE234 group displayed significantly higher ratios of –Inf (approximately 62%), indicating that the regions with the higher percentile of the –Inf ratios in this group had relatively low transcriptional activity (Table S1).

d) Gene ontology The enrichment analysis of the expression products in this study was performed using the DAVID software and was focused on the three categories: biological process (BP), cellular

ACS Paragon Plus Environment

Page 14 of 35

Page 15 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

component (CC), and molecular function (MF). Because the proteins included in PIM, RI-PE1 and RI-PE234 occupy 83% of 19439 proteins, they are supposed to achieve a wide coverage of the gene ontology. Therefore, the enrichment analysis of the gene ontology was focused on the proteins in the NR-PE234 group to determine their biological characteristics. All of the protein entries were input into DAVID, and approximately 85% were recognized by it. Once the significant enrichment was set as a FDR of less than 0.01, the DAVID analysis results were presented in Figure 2E. In BP, a total of the 12 processed items were significantly enriched. Of the enriched items, those with larger protein entries (over 350) were in the 7 of them, including 366 in the sensory perception of smell, 379 in the sensory perception of chemical stimuli, 383 in sensory perception, 383 in cognition, 399 in neurological system process, 443 in cell surface receptors linked to signal transduction, and 438 in G-proteins coupled to receptor protein signaling pathways. The 7 items were readily categorized into two functional processes, sensory- and membrane-related. In CC, 5 components were significantly enriched, in which the 3 items with the largest number of protein entries were 619 in the integral membrane, 621 in the intrinsic membrane, and 478 in the plasma membrane. In MF, only one molecular function was enriched: 368 in olfactory receptor activity. According to the analysis of the gene ontology, we concluded that a large portion of the proteins in the NR-PE234 group were closely correlated with the processes and functions involved in sensory function and membrane location. Muraoka et al.8 reported that 3282 membrane proteins were identified using membrane fractions isolated from pooled breast cancer tissues, including 851 missing proteins based on the neXtProt version at 2013. However, based on neXtProt version at 2014-09-19, the missing proteins were shrunk to only

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

11, indicating that the identification of the missing proteins made a great progress within a year. Furthermore, in the 1501 missing proteins according to our definition, almost 42% of them were predicted as membrane proteins, whereas only 4 of the 851 missing proteins were found in the protein list. The fact led to a postulation that some genes encoding extremely hydrophobic proteins were difficult to transcribe. DAVID was applied to a tissue-specific analysis of the proteins in the NR-PE234 group. These proteins were significantly enriched in two tissues: 24 in the hair and 220 in the testis. This result suggests that the NR-PE234 group proteins are relatively enriched in the testis tissue. The observation was in agreement with the report from Zhang et al.32.

3. Combination analysis of the missing proteins Analysis of the gene ontology revealed that the missing proteins were relatively enriched in certain processes, functions, or tissues. In fact, a protein does not merely possess a unique feature. To have the overall view of the properties of the missing proteins, it is necessary to carry out a combination analysis that enables the integration of the physical chemistry and the biological futures of a protein. Through this combination analysis, we can attribute the undetected phenomenon to physical or biological causes.

The 1501 missing proteins in the NR-PE234 group are broadly divided into six categories according to the features described above and are listed in Table 2. To evaluate the missing proteins with unique or shared physical and biological features, we generated a hierarchically clustered heatmap. All of the 1501 missing proteins were listed in a row, and the 6 categories

ACS Paragon Plus Environment

Page 16 of 35

Page 17 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

were listed in a column. The rows and columns were determined using the Euclidean distance and the complete linkage method. In Figure 3A, the hierarchical map of the column revealed the three major clusters: 1) proteins in the sensory pathway, hydrophobicity, and the membrane, 2) those in DHSs and the unsuitable peptides, and 3) those in the testis tissue. In the 1) group, the proteins in the membrane category occupied 632 and 375 proteins (59%) that were clustered with proteins in the other categories; in 2) group, the proteins in the unsuitable peptides had 1035 and 707 proteins (68%) that were clustered with low DHS proteins. Therefore, the proteins with the sensory functions, hydrophobicity, and the membrane location showed the highest correlation with the three cluster groups. It is expected that most of the sensory proteins are located in the membrane and the membrane-spanning domains have higher hydrophobicity. Remarkably, the proteins with unsuitable peptides were clustered with those that contained few DHSs. There seems no proper explanation or logical linkage between these two categories. Apart from the above-mentioned two groups, the proteins with a specific location on the testis comprised a unique cluster. This result indicates that the tissue-dependent proteins were a considerable factor for the missing proteins.

If the transcriptional form of RNC-mRNA is considered as a gene expression in the pre-translational stage, we proposed that the study to the proteins lack of the identification evidence in either protein or RNC-mRNA may offer valuable information in understanding of the missing proteins. Therefore, a reasonable deduction is that a gene without the detection evidence of RNC-mRNA or MS/MS likely resulted from its transcription failure or

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

undetectable protein. The cluster analysis in Figure 3A appears to be a reasonable approach for the overall evaluation of which feature(s) can lead to this phenomenon. However, this figure does not directly count how many missing proteins are well-explained and why they failed in the detection of mRNA or the protein. Therefore, we used an approach that attributed these features to the missing proteins. First, the missing proteins were classified into 6 functional categories. Specifically, 386 proteins were in the sensory group (SP), 632 were in the membrane (MB), 220 were in the testis (TT), 930 were in no DHS (ND), 424 proteins were in the high hydrophobicity labeled group (HH), and 1035 were in the unsuitable peptide group (UP). Then, the missing proteins were counted once, regardless of how many overlaps of these categories there are for a gene. The missing proteins in SP were first counted, then the number was expanded by adding the missing proteins in the MB group but removing the overlapped ones between SP and MB. The missing protein number was then enlarged by including the missing proteins in TT but deleting the overlapped ones among SP, MB, and TT, etc. As depicted in Figure 3B, the final number of missing proteins was 1391. This result means that approximately 93% of the missing proteins can be explained by their specific physical or biological properties. Because only 7% of the missing proteins could not be properly explained, we provided a convincing analysis indicating that most of the proteins were undetectable by MS or RNA-seq because of their specific protein features or chromatin structures.

In addition, we further employed the similar strategy to assess the proteins that belong to the PE234 category but hold RNC-mRNA evidence. A comparison of six features towards these

ACS Paragon Plus Environment

Page 18 of 35

Page 19 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

genes is summarized in Supplementary Table S2. In contrast to the genes in NR-PE234 group, the 1447 genes exhibit quite different behaviors in all the physicochemical or biological features.

The combination analysis described above showed that only 78% of the missing

proteins (1131/1447) could be explained by their specific properties.

A question is naturally raised, however, whether the deduction above is suitable to the other datasets in which the proteins possess the identification evidence upon MS/MS. As a negative control,we did an examination to see how many of the proteins detected by MS/MS could fall into one of the six categories described in Table 2. As shown in Supplementary Table S3, of the 8018 detected proteins, only 38% of these proteins (3083/8018) fall into at least one of the six functional categories after combination analysis and redundant removal, revealing that the categories of NR-PE234 group are very different with that of proteins detected by MS/MS. As the comparable samples for the proteins neither detected by MS/MS in the three liver cancer cell lines nor included in “Missing proteins”, we also conducted similar analysis as did in Table 2, and generated Supplementary Table S4. Of 8508 such proteins in Table S4, approximately 65% of these proteins (5537/8508) belong to at least one of the six functional categories after combination analysis and redundant removal, suggesting that the proteins listed in Table S4 are quite different from that in Figure 3B. If simply compared the category distributions between the two datasets, the proteins with the properties of unsuitable peptides and membrane in the two datasets are basically comparable, whereas the proteins with the properties of hydrophobicity, testis, sensory and DHS in Table S4 are much lower than that in Table 2. This means that our data classification to missing proteins could greatly enrich the

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

protein entries with specific characteristics.

Conclusions

We hypothesize that as the mRNAs bound to ribosomes progress from gene translation to protein, the RNC-mRNA can be viewed as another criterion for estimating whether a protein is detectable. We proposed that the study to the proteins lack of the identification evidence in either protein or RNC-mRNA may offer valuable information to understand the missing proteins. Based on neXtProt version at 2014-09-19, such proteins are 1501. After extensive bioinformatics analysis of the gene- or protein-related features of the 1501 proteins, we demonstrated that for most of them (93%, 1391/1501), we could explain why their transcriptional or translational forms were undetectable. Our analysis strategy provides a clue for exploring the causal factors that lead to the protein products are not identified yet. Moreover, the results suggest that some missing proteins are likely to be resulted from the corresponding poor transcription.

Supporting Information, this material is available free of charge via http://pubs.acs.org/. This work contains supplementary Table S1-S5 and Figure S1-S2.

Acknowledgements

ACS Paragon Plus Environment

Page 20 of 35

Page 21 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

This study was supported by the International Science & Technology Cooperation Program of China (2014DFB30020), the Chinese National Basic Research Programs (2014CBA02002, 2014CBA02005) and the National High-Tech Research and Development Program of China (2012AA020202).

Conflict of Interests Statement:

The authors declare no competing financial interest.

Figure Legends

Figure 1. Venn diagram for the trans-omics data for three liver cancer cell lines. The trans-omics data for the three cell lines (MHCC97H, HCCLM3, and Hep3B) was achieved by RNA-seq and proteomics analysis, which was described in “Methods”. All of the numbers represent the sum of the mRNAs or the proteins that were identified in the three cell lines. RNA, RNC, MSMS, and MP indicate the total number of mRNAs, the mRNAs bound with ribosomes, the total proteins identified by LC-MSMS, and the missing proteins that were defined by Lane et al.6, respectively.

Figure 2. Analysis of the physicochemical and biological characteristics to determine the causal factors for the missing proteins. To better represent the results, the data were divided into four groups. The detailed definitions to all the groups were listed in Table 1.

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(A) Distribution of the hydrophobic scores for the protein entries in the four groups defined above. The X-axis is the hydrophobic scores estimated using ProPAS, and the Y-axis is the density of the protein entries with certain hydrophobic scores. The red dashed line is located at the valley data point that clearly separates the two different hydrophobicity distribution peaks. The hydrophobicity distribution peak on the right side of the dashed line covers 424 proteins with higher hydrophobicity scores that are used for the transmembrane analysis. (B) The correlative distribution of the expected number of amino acids in transmembrane helices (Exp number) and the hydrophobic scores represented by the bivariate histogram with hexagonal binning. The data that determine the dashed line are the same as in Figure 2A, and the total number of proteins on the right side of the dash line is 424. The number in the hexagons represents the proteins that share both hydrophobicity scores and Exp numbers. The color bar indicates the protein counts. (C) The distribution of the suitable peptides with different lengths from the four groups defined above. The X-axis is the number of suitable peptides, and the Y-axis is the density of the protein entries with certain suitable peptides. The red dashed line is defined by the data point that the curve of the NR-PE234 group crosses with the other three curves: PIM, RI-PE1 and RI-PE234. (D) The distribution of the DHSs per protein entry in the four groups. The X-axis indicates the DHSs per protein entry, in which the sign of –Inf means that there is no DHS in the gene region. The Y-axis represents the density of the protein entries with certain DHS values. (E) The functional and process categories for the missing proteins defined by this study. The category analysis was conducted by DAVID, by focusing on biological process (BP), cellular component (CC), molecular function (MF), and tissue location. All of the categories with significant

ACS Paragon Plus Environment

Page 22 of 35

Page 23 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

enrichment at FDR ≤0.01 are presented. The Y-axis indicates the protein entries in a certain enriched category.

Figure 3. Combination analysis of the missing proteins. (A) The heatmap was generated from the cluster analysis by setting all of the missing proteins defined in this study into rows and the six features into columns: SP, sensory perception, MB, membrane, TT, testis, ND, no DHS, HH, high hydrophobicity and UP, unsuitable peptides, respectively. (B) The estimation formula for attributing the missing proteins to the six features. The calculated ratio reveals how the physicochemical and biological features contribute to the missing proteins.

Supplementary Figure

Figure S1. The distribution of the molecular mass for the protein entries in the four groups defined in the text. The X-axis is the molecular mass, and the Y-axis is the density of protein entries with a certain molecular mass. The gray area between the two red dashed lines is the significant range in the NR-PE234 group. It corresponds to a MW of 30-45 kDa. Figure S2. The 86 MS/MS spectra corresponding to the 23 missing proteins identified by MS/MS in three liver cancer cell lines.

Supplementary Tables

Table S1. The ratios of the protein entries with no DHS in the PIM, RI-PE1, RI-PE234 and NR-PE234 groups.

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table S2. The protein entries of six categories for proteins only detected by RNC-mRNA in three liver cancer cell lines and included in PE234. After combination analysis and redundant removal similar with NR-PE234, there were 1131 proteins belong to at least one of the six categories. Table S3. The protein entries of six categories for proteins detected by MS/MS. After combination analysis and redundant removal similar with NR-PE234, there were 3083 proteins belong to at least one of the six categories. Table S4. The protein entries of six categories for proteins not detected by MS/MS in three liver cancer cell lines and included in PE1. After combination analysis and redundant removal similar with NR-PE234, there were 5537 proteins belong to at least one of the six categories. Table S5. The detailed information of 23 missing proteins identified by MS/MS data in three liver cancer cell lines.

Table 1. Summary of the RNC-mRNAs and proteins identified in the three liver cancer cell lines presented in Figure 1. Group Definition The protein numbers in abbreviation each category of Figure 1 PIM Proteins identified by MS/MS 141+2+8+7852+15=8018 RI-PE1 RI-PE234

NR-PE234

RNC-mRNAs identified in PE1 but excluding PIM RNC-mRNAs identified in PE234 but excluding the 23 newly identified missing proteins No RNC-mRNA in PE234 but excluding the 23 newly identified missing proteins

77+6556=6633 1394+24=1424

1361+140=1501

Note: according to the definitions of neXtProt, PE1 contains 16491 protein entries, and the PE234 category consists of PE2, PE3 and PE4, including 2948 proteins entries.

ACS Paragon Plus Environment

Page 24 of 35

Page 25 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 2. The protein entries in NR-PE234 group were categorized in six categories. Categories

Protein entries

Missing Proteins in the NR-PE234 group High Hydrophobicity (HH)

1501 424

Unsuitable peptides (UP)

1035

No DHS (ND)

930

Sensory perception (SP)

386

Membrane (MB)

632

Testis (TT)

220

Note: the proteins in NR-PE234 group were evaluated in an individual category. A protein may be classified into multiple functional categories. References

(1) Zhang, B.; Wang, J.; Wang, X.; Zhu, J.; Liu, Q.; Shi, Z.; Chambers, M. C.; Zimmerman, L. J.; Shaddox, K. F.; Kim, S.; Davies, S. R.; Wang, S.; Wang, P.; Kinsinger, C. R.; Rivers, R. C.; Rodriguez, H.; Townsend, R. R.; Ellis, M. J.; Carr, S. A.; Tabb, D. L.; Coffey, R. J.; Slebos, R. J.; Liebler, D. C.; Nci, C. Proteogenomic characterization of human colon and rectal cancer. Nature 2014, 513, 382-387.

(2) Alfaro, J. A.; Sinha, A.; Kislinger, T.; Boutros, P. C. Onco-proteogenomics: cancer proteomics joins forces with genomics. Nature methods 2014, 11, 1107-1113.

(3)Marko-Varga, G.; Omenn, G. S.; Paik, Y. K.; Hancock, W. S. A first step toward completion of a genome-wide characterization of the human proteome. Journal of proteome research 2013, 12, 1-5.

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(4) Gaudet, P.; Argoud-Puy, G.; Cusin, I.; Duek, P.; Evalet, O.; Gateau, A.; Gleizes, A.; Pereira, M.; Zahn-Zabal, M.; Zwahlen, C.; Bairoch, A.; Lane, L. neXtProt: organizing protein knowledge in the context of human proteome projects. Journal of proteome research 2013, 12, 293-298.

(5) Lane, L.; Argoud-Puy, G.; Britan, A.; Cusin, I.; Duek, P. D.; Evalet, O.; Gateau, A.; Gaudet, P.; Gleizes, A.; Masselot, A.; Zwahlen, C.; Bairoch, A. neXtProt: a knowledge platform for human proteins. Nucleic acids research 2012, 40, (Database issue), D76-83.

(6) Lane, L.; Bairoch, A.; Beavis, R. C.; Deutsch, E. W.; Gaudet, P.; Lundberg, E.; Omenn, G. S. Metrics for the Human Proteome Project 2013-2014 and strategies for finding missing proteins. Journal of proteome research 2014, 13, 15-20.

(7) Deutsch, E. W.; Lam, H.; Aebersold, R. PeptideAtlas: a resource for target selection for emerging targeted proteomics workflows. EMBO reports 2008, 9, 429-434.

(8) Muraoka, S.; Kume, H.; Adachi, J.; Shiromizu, T.; Watanabe, S.; Masuda, T.; Ishihama, Y.; Tomonaga, T. In-depth membrane proteomic study of breast cancer tissues for the generation of a chromosome-based protein list. Journal of proteome research 2013, 12, 208-213.

(9) Ingolia, N. T. Ribosome profiling: new views of translation, from single codons to genome scale. Nature reviews. Genetics 2014, 15, 205-213.

ACS Paragon Plus Environment

Page 26 of 35

Page 27 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(10) Ingolia, N. T.; Ghaemmaghami, S.; Newman, J. R.; Weissman, J. S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 2009, 324, 218-223.

(11) Wang, T.; Cui, Y.; Jin, J.; Guo, J.; Wang, G.; Yin, X.; He, Q. Y.; Zhang, G. Translating mRNAs strongly correlate to proteins in a multivariate manner and their translation ratios are phenotype specific. Nucleic acids research 2013, 41, 4743-4754.

(12) Zhong, J.; Cui, Y.; Guo, J.; Chen, Z.; Yang, L.; He, Q. Y.; Zhang, G.; Wang, T. Resolving chromosome-centric human proteome with translating mRNA analysis: a strategic demonstration. Journal of proteome research 2014, 13, 50-59.

(13) Chang, C.; Li, L.; Zhang, C.; Wu, S.; Guo, K.; Zi, J.; Chen, Z.; Jiang, J.; Ma, J.; Yu, Q.; Fan, F.; Qin, P.; Han, M.; Su, N.; Chen, T.; Wang, K.; Zhai, L.; Zhang, T.; Ying, W.; Xu, Z.; Zhang, Y.; Liu, Y.; Liu, X.; Zhong, F.; Shen, H.; Wang, Q.; Hou, G.; Zhao, H.; Li, G.; Liu, S.; Gu, W.; Wang, G.; Wang, T.; Zhang, G.; Qian, X.; Li, N.; He, Q. Y.; Lin, L.; Yang, P.; Zhu, Y.; He, F.; Xu, P. Systematic analyses of the transcriptome, translatome, and proteome provide a global view and potential strategy for the C-HPP. Journal of proteome research 2014, 13, 38-49.

(14) Esposito, A. M.; Mateyak, M.; He, D.; Lewis, M.; Sasikumar, A. N.; Hutton, J.; Copeland, P. R.; Kinzy, T. G. Eukaryotic polyribosome profile analysis. J Vis Exp 2010, 40, pii: 1948.

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(15) Zhang, G.; Fedyunin, I.; Kirchner, S.; Xiao, C.; Valleriani, A.; Ignatova, Z. FANSe: an accurate algorithm for quantitative mapping of large scale sequencing reads. Nucleic acids research 2012, 40,e83.

(16) Bloom, J. S.; Khan, Z.; Kruglyak, L.; Singh, M.; Caudy, A. A. Measuring differential gene expression by short read sequencing: quantitative comparison to 2-channel gene expression microarrays. BMC genomics 2009, 10, 221.

(17) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S., Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20, 3551-3567.

(18) Reiter, L.; Claassen, M.; Schrimpf, S. P.; Jovanovic, M.; Schmidt, A.; Buhmann, J. M.; Hengartner, M. O.; Aebersold, R. Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry. Molecular & cellular proteomics 2009, 8, 2405-2417.

(19) Wang, L. H.; Li, D. Q.; Fu, Y.; Wang, H. P.; Zhang, J. F.; Yuan, Z. F.; Sun, R. X.; Zeng, R.;He, S.M.; Gao,W. pFind 2.0: a software package for peptide and protein identification via tandem mass spectrometry. Rapid Commun. Mass Spectrom. 2007, 21, 2985−2991.

(20) Smedley, D.; Haider, S.; Durinck, S.; Pandini, L.; Provero, P.; Allen, J.; Arnaiz, O.; Awedh, M. H.; Baldock, R.; Barbiera, G.; Bardou, P.; Beck, T.; Blake, A.; Bonierbale, M.; Brookes, A. J.; Bucci, G.; Buetti, I.; Burge, S.; Cabau, C.; Carlson, J. W.; Chelala, C.; Chrysostomou, C.; Cittaro, D.; Collin, O.; Cordova, R.; Cutts, R. J.; Dassi, E.; Genova, A. D.;

ACS Paragon Plus Environment

Page 28 of 35

Page 29 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Djari, A.; Esposito, A.; Estrella, H.; Eyras, E.; Fernandez-Banet, J.; Forbes, S.; Free, R. C.; Fujisawa, T.; Gadaleta, E.; Garcia-Manteiga, J. M.; Goodstein, D.; Gray, K.; Guerra-Assuncao, J. A.; Haggarty, B.; Han, D. J.; Han, B. W.; Harris, T.; Harshbarger, J.; Hastings, R. K.; Hayes, R. D.; Hoede, C.; Hu, S.; Hu, Z. L.; Hutchins, L.; Kan, Z.; Kawaji, H.; Keliet, A.; Kerhornou, A.; Kim, S.; Kinsella, R.; Klopp, C.; Kong, L.; Lawson, D.; Lazarevic, D.; Lee, J. H.; Letellier, T.; Li, C. Y.; Lio, P.; Liu, C. J.; Luo, J.; Maass, A.; Mariette, J.; Maurel, T.; Merella, S.; Mohamed, A. M.; Moreews, F.; Nabihoudine, I.; Ndegwa, N.; Noirot, C.; Perez-Llamas, C.; Primig, M.; Quattrone, A.; Quesneville, H.; Rambaldi, D.; Reecy, J.; Riba, M.; Rosanoff, S.; Saddiq, A. A.; Salas, E.; Sallou, O.; Shepherd, R.; Simon, R.; Sperling, L.; Spooner, W.; Staines, D. M.; Steinbach, D.; Stone, K.; Stupka, E.; Teague, J. W.; Dayem Ullah, A. Z.; Wang, J.; Ware, D.; Wong-Erasmus, M.; Youens-Clark, K.; Zadissa, A.; Zhang, S. J.; Kasprzyk, A. The BioMart community portal: an innovative alternative to large, centralized data repositories. Nucleic acids research 2015.

(21) Wu, S.; Zhu, Y., ProPAS: standalone software to analyze protein properties. Bioinformation 2012, 8, 167-169.

(22) Huang da, W.; Sherman, B. T.; Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature protocols 2009, 4, 44-57.

(23) Huang da, W.; Sherman, B. T.; Lempicki, R. A. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic acids research 2009, 37, 1-13.

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(24) Krogh, A.; Larsson, B.; von Heijne, G.; Sonnhammer, E. L., Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. Journal of molecular biology 2001, 305, 567-580.

(25) Sonnhammer, E. L.; von Heijne, G.; Krogh, A. A hidden Markov model for predicting transmembrane helices in protein sequences. Proc Int Conf Intell Syst Mol Biol 1998, 6, 175-182.

(26) Moller, S.; Croning, M. D.; Apweiler, R. Evaluation of methods for the prediction of membrane spanning regions. Bioinformatics 2001, 17, 646-653.

(27) Thurman, R. E.; Rynes, E.; Humbert, R.; Vierstra, J.; Maurano, M. T.; Haugen, E.; Sheffield, N. C.; Stergachis, A. B.; Wang, H.; Vernot, B.; Garg, K.; John, S.; Sandstrom, R.; Bates, D.; Boatman, L.; Canfield, T. K.; Diegel, M.; Dunn, D.; Ebersol, A. K.; Frum, T.; Giste, E.; Johnson, A. K.; Johnson, E. M.; Kutyavin, T.; Lajoie, B.; Lee, B. K.; Lee, K.; London, D.; Lotakis, D.; Neph, S.; Neri, F.; Nguyen, E. D.; Qu, H.; Reynolds, A. P.; Roach, V.; Safi, A.; Sanchez, M. E.; Sanyal, A.; Shafer, A.; Simon, J. M.; Song, L.; Vong, S.; Weaver, M.; Yan, Y.; Zhang, Z.; Zhang, Z.; Lenhard, B.; Tewari, M.; Dorschner, M. O.; Hansen, R. S.; Navas, P. A.; Stamatoyannopoulos, G.; Iyer, V. R.; Lieb, J. D.; Sunyaev, S. R.; Akey, J. M.; Sabo, P. J.; Kaul, R.; Furey, T. S.; Dekker, J.; Crawford, G. E.; Stamatoyannopoulos, J. A. The accessible chromatin landscape of the human genome. Nature 2012, 489, 75-82.

(28) Wu, S.; Li, N.; Ma, J.; Shen, H.; Jiang, D.; Chang, C.; Zhang, C.; Li, L.; Zhang, H.; Jiang, J.; Xu, Z.; Ping, L.; Chen, T.; Zhang, W.; Zhang, T.; Xing, X.; Yi, T.; Li, Y.; Fan, F.;

ACS Paragon Plus Environment

Page 30 of 35

Page 31 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Li, X.; Zhong, F.; Wang, Q.; Zhang, Y.; Wen, B.; Yan, G.; Lin, L.; Yao, J.; Lin, Z.; Wu, F.; Xie, L.; Yu, H.; Liu, M.; Lu, H.; Mu, H.; Li, D.; Zhu, W.; Zhen, B.; Qian, X.; Qin, J.; Liu, S.; Yang, P.; Zhu, Y.; Xu, P.; He, F. First proteomic exploration of protein-encoding genes on chromosome 1 in human liver, stomach, and colon. Journal of proteome research 2013, 12, 67-80.

(29) Swaney, D. L.; Wenger, C. D.; Coon, J. J. Value of using multiple proteases for large-scale mass spectrometry-based proteomics. Journal of proteome research 2010, 9, 1323-1329.

(30) Farrah, T.; Deutsch, E. W.; Hoopmann, M. R.; Hallows, J. L.; Sun, Z.; Huang, C. Y.; Moritz, R. L. The state of the human proteome in 2012 as viewed through PeptideAtlas. Journal of proteome research 2013, 12, 162-171.

(31) Fan, Y.; Zhang, Y.; Xu, S.; Kong, N.; Zhou, Y.; Ren, Z.; Deng, Y.; Lin, L.; Ren, Y.; Wang, Q.; Zi, J.; Wen, B.; Liu, S. Insights from ENCODE on Missing Proteins: Why beta-Defensin Expression Is Scarcely Detected. Journal of proteome research 2015, 14, 3635-3644.

(32) Zhang, Y.; Li, Q.; Wu, F.; Zhou, R.; Qi, Y.; Su, N.; Chen, L.; Xu, S.; Jiang, T.; Zhang, C.; Cheng, G.; Chen, X.; Kong, D.; Wang, Y.; Zhang, T.; Zi, J.; Wei, W.; Gao, Y.; Zhen, B.; Xiong, Z.; Wu, S.; Yang, P.; Wang, Q.; Wen, B.; He, F.; Xu, P.; Liu, S. Tissue-Based Proteogenomics Reveals that Human Testis Endows Plentiful Missing Proteins. Journal of proteome research 2015, 14, 3583-3594.

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

for TOC only 470x168mm (120 x 120 DPI)

ACS Paragon Plus Environment

Page 32 of 35

Page 33 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 1 354x308mm (120 x 120 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2 330x445mm (120 x 120 DPI)

ACS Paragon Plus Environment

Page 34 of 35

Page 35 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 3 549x235mm (120 x 120 DPI)

ACS Paragon Plus Environment