Systematic Analyses of the Transcriptome, Translatome, and

Nov 20, 2013 - As shown in Figure 4B,C, under the same level of RNC-mRNA abundance, ..... P.; Hancock , W. S. The Chromosome-Centric Human Proteome Pr...
2 downloads 0 Views 2MB Size
Subscriber access provided by DUESSELDORF LIBRARIES

Article

Systematic analyses of the transcriptome, translatome, and proteome provide a global view and potential strategy for the C-HPP Cheng Chang, Liwei Li, Chengpu Zhang, Songfeng Wu, Kun Guo, Jin Zi, Zhipeng Chen, Jing Jiang, Jie Ma, Qing Yu, Fengxu Fan, Peibin Qin, Mingfei Han, Na Su, Tao Chen, Kang Wang, Linhui Zhai, Tao Zhang, Wantao Ying, Zhongwei Xu, Yang Zhang, Yinkun Liu, Xiaohui Liu, Fan Zhong, Huali Shen, Quanhui Wang, Guixue Hou, Haiyi Zhao, Guilin Li, Siqi Liu, Wei Gu, Guibin Wang, Tong Wang, Gong Zhang, Xiaohong Qian, Ning Li, Qing-Yu He, Liang Lin, Pengyuan Yang, Yunping Zhu, Fuchu He, and Ping Xu J. Proteome Res., Just Accepted Manuscript • Publication Date (Web): 20 Nov 2013 Downloaded from http://pubs.acs.org on December 4, 2013

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 45

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Qian, Xiaohong; Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, Li, Ning; Beijing Proteome Research Center, He, Qing-Yu; Jinan University, Institute of Life and Health Engineering Lin, Liang; Chinese Academy of Sciences, Beijing Institute of Genomics Yang, Pengyuan; Fudan University, Department of Chemistry Zhu, Yunping; Beijing Institute of Radiation Medicine, Beijing Proteome Research Center He, Fuchu; Beijing Proteome Research Center,Beijing Institute Radiation Medicine, Xu, Ping; Beijing Proteome Research Center, Protein Post-translational Modification

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Systematic analyses of the transcriptome, translatome, and proteome provide a global view and potential strategy for the C-HPP Chinese Human Chromosome Proteome Consortium Cheng Chang1#, Liwei Li1#, Chengpu Zhang1#, Songfeng Wu1*, Kun Guo2#, Jin Zi3#, Zhipeng Chen5#, Jing Jiang1, Jie Ma1, Qing Yu1, Fengxu Fan1, Peibin Qin1, Mingfei Han1, Na Su1, Tao Chen1, Kang Wang1, Linhui Zhai1, Tao Zhang1, Wantao Ying1, Zhongwei Xu1, Yang Zhang2, Yinkun Liu2, Xiaohui Liu2, Fan Zhong2, Huali Shen2, Quanhui Wang3,4, Guixue Hou3,4, Haiyi Zhao3, Guilin Li3, Siqi Liu3,4, Wei Gu5, Guibin Wang5, Tong Wang5, Gong Zhang5, Xiaohong Qian1, Ning Li1*, Qing-Yu He5*, Liang Lin3*, Pengyuan Yang2*, Yunping Zhu1*, Fuchu He1*, and Ping Xu1*

1

State Key Laboratory of Proteomics, Beijing Proteome Research Center, National

Engineering Research Center for Protein Drugs, National Center for Protein Sciences, Beijing Institute of Radiation Medicine, Beijing, 102206, P. R. China

2

Institutes of Biomedical Sciences, Department of Chemistry and Zhongshan Hospital,

Fudan University, Shanghai 200032, China

3

BGI-Shenzhen, Beishan Road, Yantian District, Shenzhen, 518083, China.

4

Beijing Institute of Genomics, CAS, No.1 BeiChen West Road, Beijing 100101,

China

1

ACS Paragon Plus Environment

Page 2 of 45

Page 3 of 45

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

5

Key Laboratory of Functional Protein Research of Guangdong Higher Education

Institutes, Institute of Life and Health Engineering, College of Life Science and Technology, Jinan University, Guangzhou 510632, China

#

These authors contributed equally to this work.

*

To whom correspondence should be addressed:

Ping Xu, Ph.D., Professor Beijing Institute of Radiation Medicine, 27 Taiping Road, Beijing 100850, China Tel and Fax: 8610-80705155; E-mail: [email protected]

Fuchu He, Ph.D., Professor Beijing Institute of Radiation Medicine, 27 Taiping Road, Beijing 100850, China Tel and Fax: 8610-68171208; E-mail: [email protected]

Yunping Zhu, Professor Beijing Institute of Radiation Medicine, 27 Taiping Road, Beijing 100850, China Tel: 8610-80705225; E-mail: [email protected]

Pengyuan Yang, Ph.D., Professor Institutes of Biomedical Sciences and Department of Chemistry, 130 DongAn Road, Fudan University, Shanghai 200032, China Tel and Fax: 8621-54237961; E-mail: [email protected]

Liang Lin, Ph.D. BGI-Shenzhen, Beishan Road, Yantian District, Shenzhen, 518083, China 2

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Tel/Fax: 86-755-25274284; E-mail: [email protected] Qingyu He, Ph.D., Professor Institute of Life and Health Engineering, College of Life Science and Technology, Jinan University, Guangzhou 510632, China Tel/Fax: 86-20-85227039; E-mail: [email protected] Ning Li, Ph.D. Beijing Institute of Radiation Medicine, 27 Taiping Road, Beijing, 100850, China Tel and Fax: 8610-80705155; E-mail: [email protected] Songfeng Wu, Ph.D. Beijing Institute of Radiation Medicine, 27 Taiping Road, Beijing, 100850, China Tel and Fax: 8610-80727777-1141; E-mail: [email protected]

Abstract To estimate the potential of the state-of-the-art proteomics technologies on full coverage of the encoding gene products, the Chinese Human Chromosome Proteome Consortium (CCPC) applied a multi-omics strategy to systematically analyze the transciptome, translatome and proteome of the same cultured hepatoma cells with varied metastatic potential qualitatively and quantitatively. The results provide a global view of gene expression profiles. The 9,064 identified high confident proteins covered 50.2% of all gene products in the translatome. Those proteins with function

3

ACS Paragon Plus Environment

Page 4 of 45

Page 5 of 45

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

of adhesion, development, reproduction etc. are low abundance in transcriptome and translatome but absent in proteome. Taking the translatome as the background of protein expression, we found that the protein abundance plays a decisive role and hydrophobicity has a greater influence than molecular weight and isoelectric point on protein detectability. Thus, the enrichment strategy used for low-abundant transcription factors helped to identify missing proteins. In addition, those peptides with single amino acid polymorphisms negligibly contributed to new protein identification, although they might play significant role for the disease researches. The proteome raw and meta data of proteome were collected using the iProX submission

system

and

ProteomeXchange

(PXD000529,

PXD000533

and

PXD000535). All the detailed information in this study can be accessed from the Chinese Chromosome-Centric Human Proteome Database.

Key words: proteome, transcriptome, translatome, chromosome, missing protein, SAP

Introduction The Chromosome-Centric Human Proteome Project (C-HPP) aims to define the full set of proteins encoded by the human genome.1, 2 In contrast to genomic data that are relatively stable in a living organism, expression of encoded genes is time- and space-dependent, and the gene expression profiles at either the transcriptional or translational level vary widely in different cells and tissues. There is strong evidences,

4

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

on the other hand, that gene expression in certain tissues or cells is in homeostasis. For instance, Schaab et al. examined the proteomes from 11 different types of cancer cells and found that an average of 7,337 unique proteins were identified in one cell line alone. Analysis of all 11 cell lines resulted in a total of 10,183 non-redundant proteins, in which the identified protein number was saturated easily, even with the addition of more cell lines.3 A logical conclusion is that a cell or tissue somehow possesses a relatively stable amount of the expressed proteins. Estimation of the appropriate size of the proteome is important to understand the status of protein expression and the chromosomal proteome in a target cell or tissue. In other words, as with defining the size of a genome in genomics studies, the investigation of overall gene expression involves determining how many genes are transcribed or translated in a cell or tissue. The Chinese Human Chromosome Proteome Consortium (CCPC) initiated a collaborative study in mainland China to focus on proteomics in chromosome 1, 8 and 20.4-6 In 2012, the CCPC performed the chromosome proteomic research on three tissues: liver, colon and stomach, as well as on several corresponding cancer cell lines. The CCPC managed to generate a large dataset of chromosome proteomics, CCPD1.0 which contains 12,101 unique proteins (peptide and protein FDR50% occurrence) significantly differed from the reference sequence. All the sequencing data are available in the Gene Expression Omnibus database (accession number: GSE49994).

Protein and peptide fractionation & mass spectrometry analysis The fresh cells as described above were harvested and lysed in buffer containing 8 M urea, 50 mM NH4HCO3 and 5 mM IAA. The total cell lysates (TCLs) were centrifuged at 12,000 g for 10 min at 4 °C to remove cell debris, and 1.5 mg proteins from each cell line were then collected and diluted in 2 M urea followed by sequential in-solution protein digestion using Lys-C and trypsin at 37 °C. The tryptic peptides were purified on a C18 Sep-Pak column (Waters UK Ltd, Manchester, UK). The cleaned samples were split into three different aliquots (500 µg each). The paired samples were distributed to two different labs at Beijing Proteome Research Center and Fudan University (sites 1 and 2) for protein identification and quantification. The mass spectrometers used in this study were the Orbitrap Q Exactive and the Triple TOF 5600. MHCC97H and HCCLM3 cells were analyzed at site1 using the Q Exactive, whereas MHCC97H and Hep3B cells were analyzed at site2 using both instruments. All the three MS datasets were produced and analyzed following the 11

ACS Paragon Plus Environment

Page 12 of 45

Page 13 of 45

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

same procedures with two technical replicates and 24 fractions per repeat. To identify more low-abundant proteins, we used the same batch of cultured cells to enrich TFs with the same method as described.11 Briefly, we used 2 mg of TCL and 30 pmol of catTFRE DNA for each reaction to enrich TFs. The trypsin-digested peptide mixture was purified on a home-made C18 StageTip18 and then analyzed using the LTQ-Oribitrap Velos (Thermo Fisher Scientific, San Jose, CA). For detailed information about peptide mixture prefractionation and MS instrument settings for profiling and the TF-enriched datasets, see the Supporting Information and Methods.

Database searching and data processing The database searching procedure was performed as previously described.19 The search results were processed using PepDistiller20, and then only those results were retained with the threshold of protein FDR below 1%. For label-free quantification, the area under the extracted ion chromatograms (XICs) was calculated using the homemade tool SILVER21, and then the iBAQ index was determined22, and the median was normalized to equal (1) to eliminate biases from different experiments for subsequent quantitative proteomics analyses. To combine the results from different instrument and sites, the combined intensities of MHCC97H cells at site 2 were determined as the average of the results from the Q Exactive and Triple TOF 5600. Subsequently, a robust regression between the intensities of MHCC97H cells at both sites was implemented (Supplementary Figure S1) with the common cell line MHCC97H regarded as a bridge between the

12

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

quantification results at the two sites. According to the linear regression function, the results at site 2 were transformed into those at site 1. This method proved to be reliable as shown by its good reproducibility in each experiment (see Supplementary Figure S2, the correlation between replicates in each experiment was higher than 0.9).

Bioinformatics analysis of identified proteins Four publicly available and commonly used databases were selected for the comparison of protein identification, including the global proteome machine database23 (GPMDB, release 2013_7_1), PeptideAtlas24 (Release 2013_08), Human Protein Atlas25 (HPA, version 11.0) and neXtProt Gold26 (Release 2013-6-11). For GPMDB, the data set annotated as "Green" was used, with a threshold ">20 observations and log(e) < −5". For HPA, only those proteins with summary evidence of "high" or "medium" were retained. Only those proteins with "Gold" evidence were used for neXtProt. Based on protein sequences, physicochemical properties of molecular weight (MW), hydrophobicity (Hy), and isoelectric point (pI) were calculated using ProPAS27 software. Methods of hierarchical clustering of the three -omics data quantification profiles were described previously.19 For the gene annotation, gene ontology (GO) analysis was performed using DAVID28 software v.6.7. The terms with p-value less than 0.01 were considered significantly enriched. Overlap analyses of different datasets were performed using Venn diagram.29

13

ACS Paragon Plus Environment

Page 14 of 45

Page 15 of 45

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

SAP (single amino acid polymorphism) identification Using the SNV data of mRNAs that were translated from the Hep3B, MHCC97H and HCCLM3 cell lines, a sample-specific protein database was constructed according to the Ensembl database. For each hepatoma cell line, proteomic data were inputted to Mascot for database search against the combined original Ensembl and sample-specific databases. The target-decoy strategy was applied to maintain a unique peptide FDR less than 0.1% after filtering the mascot ionscore, which ensures the accuracy of almost all of the identified peptides. All of the spectra were manually verified to confirm their qualities.

Proteome data collection and database construction To ensure the standardization and streamline of proteome data collection, the iProX submission system (http://www.iprox.org/), which has been established following the data-sharing policy of the ProteomeXchange consortium, was used to collect and reposit proteomics raw data and meta-data generated in this second cooperation and all of the MS proteomics data have also been deposited to the ProteomeXchange Consortium

(http://proteomecentral.proteomexchange.org)

with

the

following

identifier: PXD000529, PXD000533 and PXD000535, to facilitate the data-sharing within the CCPC and other research groups (Supplementary Table S1). The data were also imported into the C-HPD database that was constructed and used last year.19 The detailed identification information of proteins, peptides and spectra, as well as their basic annotation for relevant encoded genes and chromosomes, can be found at the 14

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 45

C-HPD website (http://proteomeview.hupo.org.cn/chromosome).

Results and Discussion Strategies

of

the

trans-omics

profiling

of

human

protein-coding genes At the initiation of the C-HPP, chromosomes 1, 8 and 20 were studied by different research teams in mainland China. To carry out more efficient research and complement each other's expertise, the teams decided to collaborate and produce a large-scale dataset from 18 different samples for the first cooperation last year.19, 30, 31 In total, we identified 12,101 proteins with high confidence, covering 59.8% of protein entries in Swiss-Prot.32 Although this number was close to the record of high confident protein number in PeptideAtlas, it was still far away from the C-HPP goal of full coverage for all of the proteins encoded by chromosomal genes. Gene expression is the prerequisite for identification through proteomics and potentially re-annotation of chromosomes. To estimate the number of expressed protein-coding genes and the power of current state-of-the-art proteomics technologies for protein identification as well as potential solutions for missing proteins, we applied a three stage multi-omics strategy in cultured human cells. In the first stage, mRNA, RNC-mRNA, and protein were extracted from identical samples, and whole transcriptome and translatome libraries were sequenced to saturation using RNA-seq (Figure 1). Thereby, the number of actively transcribed and efficiently

15

ACS Paragon Plus Environment

Page 17 of 45

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

translated protein-coding genes in a given state could be estimated as the sum of protein-coding genes expressed in a specific cell. Based on such an optimal estimate, a high-coverage proteomics analysis was performed on both the extracted TCL and enriched TFs from the same cultured cells in the second stage. To prevent expression bias of the human genome and to achieve a higher coverage of the proteome, we used three HCC cell lines with no or gradually increasing metastatic potential that resulted from the varied patterns of gene expression. To learn more about the MS preference for protein identification, we tested the same samples on multiple proteomics platforms, including Q Extractive and Triple TOF 5600, to partially overcome the premature saturation of distinct protein identifications. In the third stage, systematic data analysis was performed on all identified proteins. This study was expected to provide useful insights into the achievable proteome coverage and potential solutions for missing proteins, differential protein expression, the relationship between the proteome and metastatic potential and mutations (Figure 1). In this cooperative research, the datasets for the three -omics were distributed to the teams that were studying chromosomes 1, 8, and 20 for further analysis, focusing on their respective chromosomes and topics of interest, including missing proteins33, liver cancer biology, their metastatic potential34, and mutation analysis.35 In this paper, we focused on general issues, including the qualitative and quantitative exploration of trans-omics data, quantitative analysis of missing proteins and sample-specific mutations using trans-omics data, which may provide valuable resources and guides for further C-HPP research. 16

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

High-quality transcriptome, translatome and proteome data from three hepatoma cell lines Both transcriptome (mRNA) and translatome (RNC-mRNA) data were produced from the RNA-seq experiments. On average, 170 million reads were obtained for each mRNA and RNC-mRNA sample, and 84% of them were mapped to Ensembl-v72 RNA reference sequences using the FANSe2 algorithm. Compared to previous studies of the transcriptome and translatome22,

36

, we obtained 3~7 times higher

throughput. The median of sequencing depth (the average number of times that each nucleotide acid was sequenced) was 22.6, and 62% of the genes reached an average sequencing depth of more than 10, suggesting that the datasets offered a solid basis for downstream analysis (Supplementary Figure S3A, S3B). Based on the high-quality RNA-seq data, the average gene numbers of the transcriptome and translatome for the three cell lines were 17,636 ± 114 and 17,127 ± 101, respectively (Supplementary Figure S4). Meanwhile, the current throughput of the transcriptome and translatome approached saturation (additional details are provided in the Supporting Information and Methods and in Supplementary Figure S5). In the proteome, based on the integration of multi-site MS data with technical repeats and low-abundant TF-enriched data, a large-scale dataset named CCPD2013 was established, which contained 7,541,744 mass spectra with 1,497,791 peptide-spectrum matches (19.9% against the total mass spectra, Supplementary Table S2) and 110,758 unique peptides, matching 9,064 proteins by the parsimony rule with 17

ACS Paragon Plus Environment

Page 18 of 45

Page 19 of 45

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

a protein FDR lower than 1% (Supplementary Figure S6 and Supplementary Tables S3 and S4). The CCPD2013 dataset was of high quality, as demonstrated by the large number of unique peptides per protein and high protein sequence coverage (Supplementary Figure S7). Based upon the high-quality -omics data, further comprehensive analysis focusing on the gene expression status was performed using a trans-omics comparison analysis.

Trans-omics data revealed translation control and technical limitation of proteomics In cells, information about expression is transferred from mRNA to RNC-mRNA and then to protein. To obtain a global view of differences in gene expression in the three different -omics, a systematic comparison was performed. As shown in Supplementary Table S5, translatome data were 4.1% smaller than transcriptome data in Hep3B cells. The percentage was 2.4% and 2.2% for that of MHCC97H and HCCLM3 cell lines, respectively, indicating that there were a small number of genes whose mRNAs could not be transcribed into RNC-mRNAs. In Hep3B, there were 17,023 genes identified in the RNC-mRNA of Hep3B, 50.3% of which were verified by proteome data in our study as shown in Supplementary Figure S4. For MHCC97H and HCCLM3 cells, the overlap between the proteome and translatome was 52.3% and 44.4%, respectively. By combining the results of the three cell lines, the coverage of the proteome compared with the translatome was 54.4%. There were still a large number of genes that could not be identified by proteomics 18

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

because of the sensitivity issue and bias of current proteomic technologies. With the three cell lines combined, there were 9,918 commonly identified genes in the transcriptome, translatome and proteome (Figure 2B). We also noticed that some unique genes could be found in each -ome with no clear explanation. For example, there were 134 genes (0.7%) found in the translatome but not in the transcriptome, and 183 genes (1.8%) were found in the proteome only (Figure 2A). A comparison between the identification rate and the whole genome for each chromosome revealed highly similar distributions of the three -omics datasets (Figure 2C). This high similarity of identification rates between the three -omics indicates that although different chromosomes have different percentages of missing proteins, the major reasons were similar including tissue specificity, sensitivity of proteomics and technical biases. This means that the teams studying different chromosomes could share their methods for finding missing proteins. Moreover, from a global perspective, the dynamic ranges of the detected intensities (indicating protein abundance) of those overlapping genes from the three -omes were different. It seemed that the intensity ranges expanded gradually from the transcriptome to the translatome and further to the proteome (Supporting Information and Methods, and Supplementary Figure S8). A similar observation was also made previously.37 The wider dynamic range of the proteome might hinder the goal of full coverage for the proteome in the C-HPP.

19

ACS Paragon Plus Environment

Page 20 of 45

Page 21 of 45

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Trans-omics analysis of the relationship between protein abundance and biological features To assess the reliability of our quantitation information in the three -omes, hierarchical clustering and the Spearman coefficient matrix were constructed for the three hepatoma cell lines, based on abundance. As shown in Figure 3A, the proteome data were disparate from the transcriptome and translatome data due to technological differences (Figure 3A). However, the transcriptome and translatome clustering results were similar, indicating that the expression information for the transcriptome could be conservatively transferred to the translatome. The clustering of the three cell lines showed that Hep3B cells obviously differed from the other two cell lines. Similar results were found from an analysis of the R2 coefficient matrix (Supplementary Figure S9). This result is consistent with the origin of these cell lines, as both MHCC97H and HCCLM3 cells were established from a single Chinese male patient who differed from the patient from which Hep3B cells were derived. An arguable hypothesis suggests that protein abundance might be related to molecular function.38, 39 In this study, a stringent experimental design allowed a more accurate determination of the abundance of identified genes. To examine the relationship between protein abundance and biological function, we divided all three -omics datasets into four equal subsets based on abundance: Part 1 for genes with a very low abundances, less than the first abundance quartile (Q1); Part 2 for genes with abundances between Q1 and the median (Q2); Part 3 for genes with a modest 20

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

abundance between Q2 and the third abundance quartile (Q3); and Part 4 for genes whose abundance was higher than Q3. Subsequently, enrichment analysis was performed using David software in each subset for three GO categories: biological process (BP), cellular component (CC), and molecular function (MF), and the properties of the main GO terms in each part were calculated as the count of genes in one GO term divided by the number of all genes in each part. As shown in Figure 3B, the percentage of different GO terms was calculated and is presented by different colors. Compared with the entire distribution of the transcriptome and translatome, the four parts of the proteome data were similar to Parts 2~4 of the transcriptome and translatome, because genes with very low abundance in Part 1 of the transcriptome and translatome were hard to find using MS. Based on the transcriptome and translatome data, these genes with the lowest abundances in Part 1 were enriched in terms of biological adhesion, response to stimulus, and biological regulation, whereas the high-abundant genes were enriched in metabolic processes, cellular processes and cellular components (Supplementary Figure S10). These results suggest that new strategies are necessary to identify missing proteins compared with mRNA and RNC-mRNA data in the future.

Quantitative exploration of missing proteins and their potential solutions The identification of missing proteins is one of the biggest challenges in the C-HPP study.40 Our previous research revealed that the physicochemical properties of 21

ACS Paragon Plus Environment

Page 22 of 45

Page 23 of 45

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

proteins were parts of the major reasons for protein detectability through MS.19 Furthermore, the level of gene expression is known to greatly influence the results. However, quantitatively, which factor and how much gene expression affects the ability to detect proteins remain unclear. The datasets generated in this study provide a way for us to investigate these issues in a quantitative way. First, RNC-mRNA abundance was taken as an estimate of protein expression level because these mRNAs were active and ready for translation, which was confirmed with a good correlation between the abundance of RNC-mRNAs and their proteins.7 To facilitate the analysis, three RPKM values were set as check points: RPKM=10 is the approximate peak of the translatome and proteome; RPKM=0.1 is the lowest abundance for detection of proteins encoded by those genes; and RPKM=0.01 is almost the lowest limit for the translatome through RNA-seq (Figure 4A). In this study, a total of 18,246 genes were identified in the translatome, and 9,922 genes (54.4%) were identified in the proteome. Among them, 6,543 genes had RPKM larger than 10, in which 86.9% were covered by our proteomic analysis; only 1,654 genes had RPKM less than 0.1, in which only 3.3% of their gene products could also be detected by MS. However, there were 3,494 genes with RPKM of 0.1~1 and 6,555 genes with RPKM of 1~10, in which 10.1% and 58.4% of their gene products, respectively, were identified in the proteome (Supplementary Table S7). These results suggest that protein abundance plays a major role in protein detectability in an MS-based proteomics study.

22

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

We then combined protein physicochemical properties with RNC-mRNA abundance to assess the effects on protein detectability. To prevent a statistical bias, we put at least 250 genes in each individual group (Supplementary Figure S11). As shown in Figure 4B and C, under the same level of RNC-mRNA abundance, proteins with a low MW (0.2) were hard to detect, but little effect could be found among different pIs (Figure 4D). Again, although these three physicochemical properties influenced protein detectability, protein abundance still played the most important role (Figure 4B-D). To further understand the significance of protein physicochemical properties on the proteins missing in our proteomics study, we calculated the standard deviations (SDs) for the number of genes belonging to different levels of each physicochemical property and drew the changing curves of the SDs with increasing translatome abundance (log10 RKPM) (Figure 4E). The higher the deviation was, the greater the influence of the physicochemical property. Interestingly, the results showed that Hy had the largest impact on the missing proteins, whereas pI had the least impact on detection. This finding may be explainable using a shotgun approach on current proteomics platforms, as some proteins with high Hy or pI tend to have poor solubility or low extraction efficiency, which makes them difficult to detect under current shotgun proteomics approach. As protein abundance had a maximal influence on protein identification, strategies that change the relative abundance of proteins in proteomics samples may greatly affect their ability to be identified. Indeed, many enrichment methods have been 23

ACS Paragon Plus Environment

Page 24 of 45

Page 25 of 45

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

widely used in the proteomics field to specifically identify proteins, which may also aid the identification of low-abundant proteins in the C-HPP. To test this idea, a new TF-enrichment strategy11 was also implemented in this cooperative study to identify low-abundant proteins such as TFs. Compared with the CCPD2013 dataset, 31 proteins were uniquely identified using this simple enrichment. Among them, 14 (45.2%) were TFs41, suggesting that this strategy allows a much greater probability of identifying TFs compared with other regular large-scale proteomics profiling strategies (Figure 5A). Again, the fact that intensity ranks of most TFs in the TF-enriched profile are higher than those in the proteome profile of Hep3B cells further confirms the idea that increasing the relative amounts of low-abundant proteins helps to identify missing proteins (Figure 5B).

SAP identification using sample-specific protein databases In this study, transcriptome, translatome and proteome data were produced using the same samples, providing an opportunities to explore sample-specific mutations. Here, translatome data alone were used to construct a sample-specific database to discover the number and percentage of sample-specific SAPs that could be identified using the proteome data. There were no more than 116, 129, and 50 unique peptide matches in the sample-specific databases of the Hep3B, MHCC97H, and HCCLM3 cell lines, respectively (Figure 6A). By removing redundant peptides, there were 219 unique peptides identified in the three cell lines, which only comprised 0.4% of all the 24

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

identified peptides. This percentage was even lower than that of theoretical SAPs, which was calculated from the sequenced single nucleotide polymorphisms (SNPs) based on our translatome data. We also found no obvious changes in any of our three cell lines. For example, in the MHCC97H cell line, the percentage of identified SAPs was 0.3%, and the theoretical value was 0.4% (Figure 6B). Interestingly, there were only 11 SAP peptides identified in the three cell lines (Figure 6A); of these, six mutated sites were not in the termini of the peptides, which could be caused by ambiguous terminal amino acids in the Ensembl database (Supplementary Figure S12). Some of these SAP peptides were identified in several datasets, indicating that rare SAPs might be widely used in these cell lines (Supporting Information and Methods and Supplementary). We also found that 60 peptide pairs (27%) had both wild-type and mutated sequences in our dataset (Supplementary Table S8). Of these, 20 peptide pairs (9%) had both wild-type and mutated SAPs identified from the same cell line. In addition, the 60 peptide pairs belonged to 27 different substituted amino acid pairs (Figure 6C). Among them, the D-E mutation was the most popular (eight peptide pairs, 13.3%), followed by the V-A and I-V mutations (seven and five peptide pairs, respectively). Figure 6D-F shows the three most popular cases of the SAP peptide pairs with very similar mass spectra. We compared our SAP data with the public database COSMIC, and found only 13 peptides (6%) that had been published previously and collected42 (Supplementary Table S9); these peptides were detected in breast carcinoma, stomach carcinoma, etc., 25

ACS Paragon Plus Environment

Page 26 of 45

Page 27 of 45

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

arguably suggesting that these mutations might be disease related. On the other hand, the higher number of SAPs present in the samples indicated the diversity of their sequences. We also noticed that there were some differences between unique SAPs in the varied samples (Figure 6A), but we still don’t know whether these SAPs are due to SNPs in the genome or if they were generated during the gene expression process, because we don’t have genomics sequences for these cell lines. Although the identification of SAPs had little influence on protein identification, it was arguable significance of these mutations on physiological or pathological proteomics study. The biological consequences of these mutations are also worth investigating.

Contribution, accumulation, and sharing of the proteome dataset (CCPD) As described above, 9,064 proteins were identified and analyzed in the CCPD2013 dataset. Combining datasets CCPD2013 and CCPD1.0 presented last year19, the CCPD 2.0 was constructed with 12,740 highly confident identified proteins in total (Figure 7A), covering 62.9% of all human proteins in the Swiss-Prot database (release 2013_06). Compared with CCPD1.0, CCPD2.0 contains 24,055 additional peptides and 641 new proteins (Figure 7B). Most of these newly identified proteins were annotated due to the presence of these proteins or their transcripts in the neXtProt database (Figure 7C), suggesting the reliability of our proteomics results. After mapping the additional 26

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

proteins to each chromosome, there were 60, 27, and 17 additional proteins for chromosomes 1, 8, and 20, respectively, thus increasing the coverage to 63.2%, 62.1%, and 60.5%, respectively (Figure 7D). Compared with the three public databases (GPMDB, PeptideAtlas, and HPA) using the aligned neXtProt IDs, CCPD2.0 contained 324 newly confirmed gene products (Figure 7E). At present, the total number of verified proteins in all of the four databases was 17,026. Although HPA contained fewer proteins than did GPMDB, the percentage of its unique proteins (816, 8.5%) was higher than that of GPMDB (710, 4.8%), indicating that antibody technology could play a complementary role in the search for missing proteins from the MS-based proteomics platform. Since 2012, the CCPC has produced large-scale datasets of 18 samples from 18 human tissues or cell lines, and built the C-HPD database for storing, analyzing and displaying these data. To further facilitate this collaborative study, the iProX submission system (http://www.iprox.org) was used for the collection, storage, and sharing of MS/MS raw files and experiment metadata. The iProX system will speed data analysis and the sharing of proteomics experiments, and will support the long-term cooperation of CCPC in a future C-HPP. The analysis of this collaborative research has been imported into C-HPD for further access and sharing (http://proteomeview.hupo.org.cn/chromosome/).

27

ACS Paragon Plus Environment

Page 28 of 45

Page 29 of 45

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Conclusions Using deep sequencing and a systematic analysis of transctriptome, translatome, and proteome data generated from the same samples of human HCC cell lines Hep3B, HCC97H, and HCCLM3 with varied metastatic potentials, we found that an average of 17,636 ± 114 and 17,127 ± 101genes were actively transcribed and translated, respectively; among them, a total of 9,064 gene products were identified using proteomics, which is about 53% of the translated genes, and would not be accumulated easily using a regular large-scale proteomics approach due to saturation effects resulting from sensitivity issues associated with current technologies. An feature analysis of missing proteins further revealed that protein abundance plays a major role in MS-based protein identification and missing proteins might be efficiently identified from specifically enriched samples, which was further confirmed after TF analysis, as 31 new low-abundant proteins were identified that were missed in previous studies. With a sample-specific database, we found that SAPs only contributed less than 0.4% of the total number of identified peptides, suggesting its negligible impact on protein identification. This first systematic multi-omics study not only revealed that targeted proteomics approaches combined with high-coverage transcriptome and translatome analyses using deep RNA sequencing greatly aided the identification of missing proteins, but also contributed 324 gene products with MS evidence to publically available proteomics databases. Although some detailed methodology may need to be modified for analysis of different chromosomes, the

28

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

conclusions here are highly instructive for conducting efficient proteomics analyses for future C-HPP studies.

Abbreviations CCPC, Chinese Human Chromosome Proteome Consortium; CCPD, Chinese Chromosome

Proteome Data set; C-HPD, Chinese Chromosome-Centric Human

Proteome Database; RNC, ribosome-nascent chain; pI, isoelectric point; SAP, single amino acid polymorphism; TF, transcription factor; MW, molecular weight; Hy, hydrophobicity; MS, mass spectrometry

Acknowledgements We acknowledge the entire Chinese Human Chromosome Proteome Consortium. We also thank Jun Qin and Chen Ding at Beijing Proteome Research Center for providing reagents for TF enrichment analysis. This work was supported by the Chinese National Basic Research Program (2011CB910600, 2010CB912700, 2013CB911200 and 2013CB910802), the National High-Tech Research and Development Program (2011AA02A114, 2012AA020502, 2012AA020201, and 2012AA020409), the National Natural Science Foundation of China (31070673, 31170780, 81123001, 21105121, 21275160 and 31000379), the China-EU International Collaboration Program (2011-1001), Key Projects in the National Science & Technology Pillar Program (2012BAF14B00), the National Science and Technology Major Project of the Ministry of Science and Technology of China (No. 2012ZX10004801-003-011 29

ACS Paragon Plus Environment

Page 30 of 45

Page 31 of 45

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

and 2012ZX10002012-006), National Mega Projects for Key Infectious Diseases (2013zx10003002), the Beijing Municipal Natural Science Foundation (5112012 and 5122013) and the State Key Lab Project (SKLP-Y201102), and the Shenzhen Key Laboratory of Transomics Biotechnologies (No. CXB2011O8250096A).

Conflict of Interests Statement: The authors declare no competing financial interest.

Supporting information. This work contains Supplementary Tables S1-S9 and Figures S1-S12. The supporting information is available free of charge via the Internet at http://pubs.acs.org

Figure Legends Abstract Figure

Trans-omics analysis and comparison of hepatoma cell lines in a qualitative and quantitative way. An example of protein 3D structure was from Wikipedia (http://en.wikipedia.org/wiki/Protein) only to illustrate the protein.

Figure 1 Experimental workflow based on the cooperative research of trans-omics experiments and bioinformatics analysis.

30

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2 Trans-omics analysis and comparison of hepatoma cell lines. (A) According to central dogma, genetic information flows from the genome to the proteome through the transcriptome and translatome. Differences between the adjacent -omics are shown in the figure and their possible causes are also listed in the figures. (B) Venn diagram of gene numbers in the three -omes. (C) The identification rate of the three -omes and a public dataset (named PublicDB2012) for different chromosomes. The blue bar represents the number of protein-coding genes in the Ensembl database. The PublicDB2012 represents the combination of neXtProt gold (Nov. 2012), Human PeptideAtlas (Dec 2012), GPMDB (Green) (26 Nov. 2012) and Human Protein Atlas Evidence high or medium (Dec. 2012) mentioned in Table 1 in a previous paper.43

Figure 3 A trans-omics quantitative map and its correlation with gene function. (A) Clustering results between the three datasets and the three cell lines. The color level in the matrix indicates the degree of correlation. Here, the Spearman coefficient represents the correlation between two datasets. (B) The -omics data were divided into four parts according to abundances and the GO analysis of each part was analyzed using David software. Genes with different abundances tended to have different biological functions. T, R, and P represent the transcriptome, translatome, and proteome, respectively. 31

ACS Paragon Plus Environment

Page 32 of 45

Page 33 of 45

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 4 Comparative analysis of missing proteins based on RNC-mRNAs abundance and the physicochemical properties of the protein. (A) The distribution of log10 RPKM for all of the genes identified in the translatome (blue line) and the genes identified in both the translatome and proteome (red line). B-D: Comparative analyses of protein percentages and RNC-mRNA abundance. The x-axis represents log10 RNC-mRNA abundance; the y-axis represents the number of genes identified in the proteome relative to the genes in the translatome with RPKM larger than the corresponding value. (B) Lines with different colors represent proteins with different MWs. (C) Lines with different colors represent proteins with different Hys. (D) Lines with different colors represent proteins with different pIs. (E) Comparative analysis of variations of MW, Hy, and pI. The x-axis represents RNC-mRNA abundance, and the y-axis represents the standard deviation of the percentages of genes identified in the proteome relative to the genes with RPKM larger than the corresponding value. The red line represents MW, the blue line represents Hy, and the green line represents pI.

Figure 5 The efficiency of the TF-enrichment strategy. (A) A comparison of the number of identified proteins and their percentage of TFs in the general proteome profile and the

32

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

TF-enrichment strategy. (B) Rank comparison of shared proteins in the TF-enrichment strategy and the general proteome profile; a low rank indicates high abundance.

Figure 6 Summary of the identified SAPs in this study. (A) Venn diagram of SAP peptide pairs identified in the three cell lines. (B) Percentages of the theoretical and measured SAP peptide pairs in Hep3B cells. (C) Pie graph of the numbers and percentages of different kinds of mutations in our data. D~F: Illustration of three SAP peptide pairs. The red lines and the blue lines represent a mutated peptide pair.

Figure 7 General descriptions of the identified proteins. (A) Venn diagram of the identified proteins in the CCPD1.0 and CCPD2013 datasets. (B) The number of identified proteins and peptides with the corresponding accumulative curves in the three hepatoma cell lines and the CCPD1.0 dataset. (C) Evidence distribution in the neXtProt database of the identified proteins. (D) Identified protein-coding genes in different chromosomes. The three chromosomes that the CCPC studied are highlighted. The identified gene number over the total gene number in different chromosomes is shown near the highlighted bar. The percentages of identified genes for different chromosomes are shown at the top. (E) Overlap of identified proteins

33

ACS Paragon Plus Environment

Page 34 of 45

Page 35 of 45

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

with HPA, PeptideAtlas, GPMDB, and CCPD 2.0.

References

1. Legrain, P.; Aebersold, R.; Archakov, A.; Bairoch, A.; Bala, K.; Beretta, L.; Bergeron, J.; Borchers, C.; Corthals, G. L.; Costello, C. E.; Deutsch, E. W.; Domon, B.; Hancock, W.; He, F.; Hochstrasser, D.; Marko-Varga, G.; Salekdeh, G. H.; Sechi, S.; Snyder, M.; Srivastava, S.; Uhlen, M.; Hu, C. H.; Yamamoto, T.; Paik, Y. K.; Omenn, G. S., The human proteome project: Current state and future direction. Mol Cell Proteomics 2011. 2. Paik, Y. K.; Jeong, S. K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Cho, S. Y.; Lee, H. J.; Na, K.; Choi, E. Y.; Yan, F.; Zhang, F.; Zhang, Y.; Snyder, M.; Cheng, Y.; Chen, R.; Marko-Varga, G.; Deutsch, E. W.; Kim, H.; Kwon, J. Y.; Aebersold, R.; Bairoch, A.; Taylor, A. D.; Kim, K. Y.; Lee, E. Y.; Hochstrasser, D.; Legrain, P.; Hancock, W. S., The Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome. Nat Biotechnol 2012, 30, (3), 221-3. 3. Geiger, T.; Wehner, A.; Schaab, C.; Cox, J.; Mann, M., Comparative proteomic analysis of eleven common cell lines reveals ubiquitous but varying expression of most proteins. Mol Cell Proteomics 2012, 11, (3), M111 014050. 4. Weber, R. J.; Li, E.; Bruty, J.; He, S.; Viant, M. R., MaConDa: a publicly accessible Mass spectrometry Contaminants Database. Bioinformatics 2012. 5. Kilgour, D. P.; Mackay, C. L.; Langridge-Smith, P. R.; O'Connor, P. B., Appropriate degree of trust: deriving confidence metrics for automatic peak assignment in high-resolution mass spectrometry. Anal Chem 2012, 84, (17), 7431-5. 6. Alghamdi, W. M.; Gaskell, S. J.; Barber, J., Detection of low-abundance protein phosphorylation by selective (18)o labeling and mass spectrometry. Anal Chem 2012, 84, (17), 7384-92. 7. Wang, T.; Cui, Y.; Jin, J.; Guo, J.; Wang, G.; Yin, X.; He, Q. Y.; Zhang, G., Translating mRNAs strongly correlate to proteins in a multivariate manner and their translation ratios are phenotype specific. Nucleic Acids Res 2013, 41, (9), 4743-54. 8. Kwon, K. H.; Kim, J. Y.; Kim, S. Y.; Min, H. K.; Lee, H. J.; Ji, I. J.; Kang, T.; Park, G. W.; An, H. J.; Lee, B.; Ravid, R.; Ferrer, I.; Chung, C. K.; Paik, Y. K.; Hancock, W. S.; Park, Y. M.; Yoo, J. S., Chromosome 11-centric human proteome analysis of human brain hippocampus tissue. J Proteome Res 2013, 12, (1), 97-105. 9. Krogh, A.; Larsson, B.; von Heijne, G.; Sonnhammer, E. L., Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 2001, 305, (3), 567-80. 10. Butter, F.; Davison, L.; Viturawong, T.; Scheibe, M.; Vermeulen, M.; Todd, J. A.; Mann, M., Proteome-Wide Analysis of Disease-Associated SNPs That Show Allele-Specific Transcription Factor Binding. PLoS Genet 2012, 8, (9), e1002982. 11. Ding, C.; Chan, D. W.; Liu, W.; Liu, M.; Li, D.; Song, L.; Li, C.; Jin, J.; Malovannaya, A.; Jung, S. Y.; Zhen, B.; Wang, Y.; Qin, J., Proteome-wide profiling of activated transcription factors 34

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

with a concatenated tandem array of transcription factor response elements. Proc Natl Acad Sci U S A 2013, 110, (17), 6771-6. 12. Knowles, B. B.; Aden, D. P. Human hepatoma derived cell line, process for preparation thereof, and uses therefor. 1983. 13. Li, Y.; Tian, B.; Yang, J.; Zhao, L.; Wu, X.; Ye, S. L.; Liu, Y. K.; Tang, Z. Y., Stepwise metastatic human hepatocellular carcinoma cell model system with multiple metastatic potentials established through consecutive in vivo selection and studies on metastatic characteristics. J Cancer Res Clin Oncol 2004, 130, (8), 460-8. 14. Zhang, G.; Fedyunin, I.; Kirchner, S.; Xiao, C.; Valleriani, A.; Ignatova, Z., FANSe: an accurate algorithm for quantitative mapping of large scale sequencing reads. Nucleic Acids Res 2012, 40, (11), e83. 15. Bloom, J. S.; Khan, Z.; Kruglyak, L.; Singh, M.; Caudy, A. A., Measuring differential gene expression by short read sequencing: quantitative comparison to 2-channel gene expression microarrays. BMC Genomics 2009, 10, 221. 16. Mortazavi, A.; Williams, B. A.; McCue, K.; Schaeffer, L.; Wold, B., Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 2008, 5, (7), 621-8. 17. Robinson, M. D.; McCarthy, D. J.; Smyth, G. K., edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010, 26, (1), 139-40. 18. Zhai, L.; Chang, C.; Li, N.; Duong, D. M.; Chen, H.; Deng, Z.; Yang, J.; Hong, X.; Zhu, Y.; Xu, P., Systematic research on the pretreatment of peptides for quantitative proteomics using a C18 microcolumn. Proteomics 2013, 13, (15), 2229-37. 19. Wu, S.; Li, N.; Ma, J.; Shen, H.; Jiang, D.; Chang, C.; Zhang, C.; Li, L.; Zhang, H.; Jiang, J.; Xu, Z.; Ping, L.; Chen, T.; Zhang, W.; Zhang, T.; Xing, X.; Yi, T.; Li, Y.; Fan, F.; Li, X.; Zhong, F.; Wang, Q.; Zhang, Y.; Wen, B.; Yan, G.; Lin, L.; Yao, J.; Lin, Z.; Wu, F.; Xie, L.; Yu, H.; Liu, M.; Lu, H.; Mu, H.; Li, D.; Zhu, W.; Zhen, B.; Qian, X.; Qin, J.; Liu, S.; Yang, P.; Zhu, Y.; Xu, P.; He, F., First proteomic exploration of protein-encoding genes on chromosome 1 in human liver, stomach, and colon. J Proteome Res 2013, 12, (1), 67-80. 20. Li, N.; Wu, S.; Zhang, C.; Chang, C.; Zhang, J.; Ma, J.; Li, L.; Qian, X.; Xu, P.; Zhu, Y.; He, F., PepDistiller: A quality control tool to improve the sensitivity and accuracy of peptide identifications in shotgun proteomics. Proteomics 2012. 21. Chang, C.; Zhang, J.; Han, M.; Ma, J.; Zhang, W.; Wu, S.; Xie, H.; He, F.; Zhu, Y., SILVER: An efficient tool for stable isotope labeling LC-MS data quantitative analysis with quality control methods. Bioinformatics 2013 (revised). 22. Schwanhausser, B.; Busse, D.; Li, N.; Dittmar, G.; Schuchhardt, J.; Wolf, J.; Chen, W.; Selbach, M., Global quantification of mammalian gene expression control. Nature 2011, 473, (7347), 337-42. 23. Craig, R.; Cortens, J. P.; Beavis, R. C., Open source system for analyzing, validating, and storing protein identification data. J Proteome Res 2004, 3, (6), 1234-42. 24. Farrah, T.; Deutsch, E. W.; Hoopmann, M. R.; Hallows, J. L.; Sun, Z.; Huang, C. Y.; Moritz, R. L., The state of the human proteome in 2012 as viewed through PeptideAtlas. J Proteome Res 2013, 12, (1), 162-71. 25. Uhlen, M.; Oksvold, P.; Fagerberg, L.; Lundberg, E.; Jonasson, K.; Forsberg, M.; Zwahlen, M.; Kampf, C.; Wester, K.; Hober, S.; Wernerus, H.; Bjorling, L.; Ponten, F., Towards a 35

ACS Paragon Plus Environment

Page 36 of 45

Page 37 of 45

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

knowledge-based Human Protein Atlas. Nat Biotechnol 2010, 28, (12), 1248-50. 26. Lane, L.; Argoud-Puy, G.; Britan, A.; Cusin, I.; Duek, P. D.; Evalet, O.; Gateau, A.; Gaudet, P.; Gleizes, A.; Masselot, A.; Zwahlen, C.; Bairoch, A., neXtProt: a knowledge platform for human proteins. Nucleic Acids Res 2012, 40, (Database issue), D76-83. 27. Chen, L. C.; Liu, M. Y.; Hsiao, Y. C.; Choong, W. K.; Wu, H. Y.; Hsu, W. L.; Liao, P. C.; Sung, T. Y.; Tsai, S. F.; Yu, J. S.; Chen, Y. J., Decoding the disease-associated proteins encoded in the human chromosome 4. J Proteome Res 2013, 12, (1), 33-44. 28. Huang da, W.; Sherman, B. T.; Lempicki, R. A., Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 2009, 4, (1), 44-57. 29. Oliveros, J. C. VENNY. An interactive tool for comparing lists with Venn Diagrams. http://bioinfogp.cnb.csic.es/tools/venny/index.html 30. Zhang, Y.; Yan, G.; Zhai, L.; Xu, S.; Shen, H.; Yao, J.; Wu, F.; Xie, L.; Tang, H.; Yu, H.; Liu, M.; Yang, P.; Xu, P.; Zhang, C.; Li, L.; Chang, C.; Li, N.; Wu, S.; Zhu, Y.; Wang, Q.; Wen, B.; Lin, L.; Wang, Y.; Zheng, G.; Zhou, L.; Lu, H.; Liu, S.; He, F.; Zhong, F., Proteome atlas of human chromosome 8 and its multiple 8p deficiencies in tumorigenesis of the stomach, colon, and liver. J Proteome Res 2013, 12, (1), 81-8. 31. Wang, Q.; Wen, B.; Yan, G.; Wei, J.; Xie, L.; Xu, S.; Jiang, D.; Wang, T.; Lin, L.; Zi, J.; Zhang, J.; Zhou, R.; Zhao, H.; Ren, Z.; Qu, N.; Lou, X.; Sun, H.; Du, C.; Chen, C.; Zhang, S.; Tan, F.; Xian, Y.; Gao, Z.; He, M.; Chen, L.; Zhao, X.; Xu, P.; Zhu, Y.; Yin, X.; Shen, H.; Zhang, Y.; Jiang, J.; Zhang, C.; Li, L.; Chang, C.; Ma, J.; Yao, J.; Lu, H.; Ying, W.; Zhong, F.; He, Q. Y.; Liu, S., Qualitative and quantitative expression status of the human chromosome 20 genes in cancer tissues and the representative cell lines. J Proteome Res 2013, 12, (1), 151-61. 32. Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res 2013, 41, (Database issue), D43-7. 33. Zhang, C.; Li, N.; Zhai, L.; Xu, S.; Liu, X.; Cui, Y.; Ma, J.; Han, M.; Jiang, J.; Yang, C.; Fan, F.; Li, L.; Qin, P.; Yu, Q.; Chang, C.; Su, N.; Zheng, J.; Zhang, T.; Wen, B.; Zhou, R.; Lin, L.; Lin, Z.; Zhou, B.; Zhang, Y.; Yan, G.; Liu, Y.; Yang, P.; Guo, K.; Gu, W.; Chen, Y.; Zhang, G.; He, Q.; Wu, S.; Wang, T.; Shen, H.; Wang, Q.; Zhu, Y.; He, F.; Ping, X., Systematic analysis of missing proteins provides clues to help define all of the protein-coding genes on human chromosome 1. J Proteome Res 2013 (revised). 34. Liu, Y.; Ying, W.; Ren, Z.; Gu, W.; Zhang, Y.; Yan, G.; Yang, P.; Liu, Y.; Yin, X.; Chang, C.; Jiang, J.; Fan, F.; Zhang, C.; Xu, P.; Wang, Q.; Wen, B.; Lin, L.; Wang, T.; Du, C.; Zhong, J.; Wang, T.; He, Q.; Qian, X.; Lou, X.; Zhang, G.; Zhong, F., Chromosome 8-Coded Proteome of Chinese Chromosome Proteome Data Set (CCPD) 2.0 with Partial Immunohistochemical Verifications. J Proteome Res 2013 (revised). 35. Wang, Q.; Wen, B.; Wang, T.; Xu, Z.; Yin, X.; Xu, S.; Ren, Z.; Hou, G.; Zhou, R.; Zhao, H.; Zi, J.; Zhang, S.; Gao, H.; Lou, X.; Sun, H.; Feng, Q.; Chang, C.; Qin, P.; Zhang, C.; Li, N.; Zhu, Y.; Gu, W.; Zhong, J.; Zhang, G.; Yan, G.; Shen, H.; Liu, X.; Lu, H.; Yang, P.; Zhong, F.; He, Q.; Lin, L.; Liu, S., The omics evidences: single nucleotide variants transmissions on Chromosome 20 in liver cancer cell lines. J Proteome Res 2013 (revised). 36. Ringrose, J. H.; van den Toorn, H. W.; Eitel, M.; Post, H.; Neerincx, P.; Schierwater, B.; Altelaar, A. F.; Heck, A. J., Deep proteome profiling of Trichoplax adhaerens reveals remarkable features at the origin of metazoan multicellularity. Nat Commun 2013, 4, 1408. 37. Lundberg, E.; Fagerberg, L.; Klevebring, D.; Matic, I.; Geiger, T.; Cox, J.; Algenas, C.; 36

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Lundeberg, J.; Mann, M.; Uhlen, M., Defining the transcriptome and proteome in three functionally different human cell lines. Mol Syst Biol 2010, 6, 450. 38. Beck, M.; Schmidt, A.; Malmstroem, J.; Claassen, M.; Ori, A.; Szymborska, A.; Herzog, F.; Rinner, O.; Ellenberg, J.; Aebersold, R., The quantitative proteome of a human cell line. Mol Syst Biol 2011, 7, 549. 39. Zhong, F.; Yang, D.; Hao, Y.; Lin, C.; Jiang, Y.; Ying, W.; Wu, S.; Zhu, Y.; Liu, S.; Yang, P.; Qian, X.; He, F., Regular patterns for proteome-wide distribution of protein abundance across species. PLoS One 2012, 7, (3), e32423. 40. Henriksen, P.; Wagner, S. A.; Weinert, B. T.; Sharma, S.; Bacinskaja, G.; Rehman, M.; Juffer, A. H.; Walther, T. C.; Lisby, M.; Choudhary, C., Proteome-wide analysis of lysine acetylation suggests its broad regulatory scope in Saccharomyces cerevisiae. Mol Cell Proteomics 2012. 41. Carvalho, P. C.; Fischer, J. S.; Xu, T.; Cociorva, D.; Balbuena, T. S.; Valente, R. H.; Perales, J.; Yates, J. R., 3rd; Barbosa, V. C., Search engine processor: Filtering and organizing peptide spectrum matches. Proteomics 2012, 12, (7), 944-9. 42. Forbes, S. A.; Tang, G.; Bindal, N.; Bamford, S.; Dawson, E.; Cole, C.; Kok, C. Y.; Jia, M.; Ewing, R.; Menzies, A.; Teague, J. W.; Stratton, M. R.; Futreal, P. A., COSMIC (the Catalogue of Somatic Mutations in Cancer): a resource to investigate acquired mutations in human cancer. Nucleic Acids Res 2010, 38, (Database issue), D652-7. 43. Marko-Varga, G.; Omenn, G. S.; Paik, Y. K.; Hancock, W. S., A first step toward completion of a genome-wide characterization of the human proteome. J Proteome Res 2013, 12, (1), 1-5.

37

ACS Paragon Plus Environment

Page 38 of 45

Figure 1

Page 39 of 45

Samples

progressive metastatic abilities

1 HCC cell lines 2 3 4 Hep3B 5 6 7 8 9 10 11 MHCC97H 12 13 14 15 16 17 18 HCCLM3 19 20 21 22 23 24 25

Journal of Proteome Research

Omics experiments

mRNA RNA-seq RNC-mRNA

Proteins

LC-MS/MS

Bioinformatics analysis

Transcriptome profile Translatome profile Proteome profile

TF-enriched profile ACS Paragon Plus Environment

Missing Proteins Liver Cancer Biology Mutation Analysis

Figure 2

Journal of Proteome Research

Page 40 of 45

A

Transcriptome 18,667

Translatome 18,246 8,194

130

549 9,918

6

C

30

Ensembl Protein-coding Genes Transcriptome PublicDB2012

Proteome Translatome

183

90.0 80.0 70.0

20

60.0 50.0 40.0 10

30.0 20.0

4 Proteome 10,111

100.0

10.0 0

0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y

ACS Paragon Plus Environment

Chromosome

Identified Percent (%)

B

Gene Number (×100)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 41 of 45

A

B

Translatome Distribution

Transcriptome Distribution

Hep3B-T Hep3B-T-P

Proteome Distribution Hep3B-P Hep3B-P-T Hep3B-P-R

Hep3B-R Hep3B-R-P

T-Hep3B T1

T2

T3

T4

R1

R2

R3

R4

P1

P2 P3

P4

R-Hep3B 100%

100%

90%

90%

90%

T-HCCLM3

80%

80%

80%

R-MHCC97H

70%

70%

70%

P-Hep3B P-MHCC97H P-HCCLM3

Figure 3

60%

Percentage

R-HCCLM3

Percentage

100%

T-MHCC97H

Percentage

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

Journal of Proteome Research

60%

50%

50%

40%

40%

60% 50% 40%

30%

30%

30%

20%

20%

20%

10%

10%

10%

0%

0%

0%

T1

T2

T3

T4

R1

cellular process metabolic process localization biological regulation response to stimulus ACS Paragon Plus Environmentreproduction other

R2

R3

R4

cellular component multicellular organismal process biological adhesion

P1

P2

P3

P4

establishment of localization developmental process immune system process

Journal of Proteome Research

B

Number of genes

Proportion of identified genes

A

C

Log10 RPKM

Proportion of identified genes

E

Log10 RPKM

Standard deviation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 42 of 45

D

Log10 RPKM

Proportion of identified genes

Log10 RPKM

Figure 4

ACS Paragon Plus Environment

Log10 RPKM

B

Number of identified genes

45.16%

7148 1000

TF%

50%

40%

1885

30% 100 20% 31

10 4.00% 1

Figure 5

Profile only

7.27%

Share

TF-enrich only

10%

0%

TF

Not TF

400

600

1400

Rank of TF-enriched profile

10000

Number of genes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Journal of Proteome Research

Percentage of transcription factor

A

Page 43 of 45

1200 1000 800 600 400 200 0 0

ACS Paragon Plus Environment

200

800

1000

Rank of proteome profile

1200

1400

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

A

D

Journal of Proteome Research

MHCC97H (129) 65 29 HCCLM3 (50)

11

9

24 Hep3B (116)

1 80

B

SAP peptide, 18,617 (0.39%)

Database

SAP peptide, 129 (0.32%)

Identification

Non-SAP peptide, 39,589 (99.68%)

Non-SAP peptide, 4,792,934 (99.61%)

C

E

Non-SAP peptide SAP peptide

Number of peptide pairs

F

D-E 8 (13%) V-A 7 (12%)

Others 28 (47%)

I-V 5 (8%)

Figure 6

S-A 3 (5%)

A-T 3 (5%)

G-E L - P 3 (5%) 3 (5%) ACS Paragon Plus Environment

Page 44 of 45

B CCPD2013 (9,064)

Protein Number(103)

12

Identified Protein Number Identified Peptide Number Cumulative Protein Number Cumulative Peptide Number

10

8,423

3,676

641

8

20

15

10

6 4

5

2 0

0

CCPD1.0

CCPD2.0 (12,740)

Hep3B

MHCC97H

63.2% 62.1%

60.5%

CCPD2.0 12,639

1248+60 2071

50.0 40.0 30.0

408+27 701

322+17 560

4

20.0

2

10.0

0

0.0

Identification Precent

60.0

8

Figure 7

4

0

HPA 9,576

70.0

10

6

8

HCCLM3

E

14 12

12

11,386/585 76.6%

Unidentified Proteins Increased Proteins by CCPD2.0 CCPD1.0 Identified Proteins

511/49 15.8% 18/2 10.9%

7/0 8.1%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y

Chromosomes

ACS Paragon Plus Environment

78/3 12.7%

Protein Transcript Homology Predicted Uncertain

Protein evidence in neXtProt

D 16

16

14

Number of Proteins (103)

CCPD1.0 (12,099)

Number of Identified Genes(×100)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

C

Journal of Proteome Research

Peptide Number(104)

A

Page 45 of 45

PeptideAtlas 13,996 GPMDB 14,688

Genome 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Graphical 29

TranscriptomeJournal of Proteome Research Translatome

Page 46 of 45 Proteome

22.2% Missing

2.9% Missing

Tissue Specificity Genome Uncertainty

Translation Control RNA-seq missing

Technical Bias of MS Translation Control

RNA-seq missing

Mapping Confusion RNA-seq missing

0.7% unique

1.8% unique

DNA

Ensembl v72 Abstract

mRNA

49.8% Missing

RNC-mRNA

Hepatoma Cell Lines

ACS Paragon Plus Environment

Protein