Special Enrichment Strategies Greatly Increase the ... - ACS Publications

Jul 6, 2015 - Figure 1. Workflow for the identification of missing proteins (MPs). Low molecular weight (LMW) proteins are enriched from serum samples...
1 downloads 0 Views 2MB Size
Subscriber access provided by UNIV OF MISSISSIPPI

Article

Special Enrichment Strategies Greatly Increase the Efficiency of Missing Proteins Identification from Regular Proteome Samples Na Su, Chengpu Zhang, Yao Zhang, Zhiqiang Wang, Fengxu Fan, Mingzhi Zhao, Feilin Wu, Yuan Gao, Yanchang Li, Lingsheng Chen, Miaomiao Tian, Tao Zhang, Bo Wen, Na Sensang, Zhi Xiong, Songfeng Wu, Siqi Liu, Pengyuan Yang, Bei Zhen, Yunping Zhu, Fuchu He, and Ping Xu J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.5b00481 • Publication Date (Web): 06 Jul 2015 Downloaded from http://pubs.acs.org on July 11, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 52

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Xu, Ping; Beijing Proteome Research Center, Protein Post-translational Modification

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 52

Special Enrichment Strategies Greatly Increase the Efficiency of Missing Proteins Identification from Regular Proteome Samples

Na Su1#, Chengpu Zhang1#, Yao Zhang1,3, Zhiqiang Wang1,2, Fengxu Fan1,4, Mingzhi Zhao1, Feilin Wu1,5, Yuan Gao1, Yanchang Li1, Lingsheng Chen1,6, Miaomiao Tian1, Tao Zhang1, Bo Wen, Na Sensang7, Zhi Xiong5, Songfeng Wu1, Siqi Liu8, Pengyuan Yang9, Bei Zhen1*, Yunping Zhu1*, Fuchu He1*, Ping Xu1,2,4* 1

State Key Laboratory of Proteomics, Beijing Proteome Research Center, National

Engineering Research Center for Protein Drugs, National Center for Protein Sciences, Beijing Institute of Radiation Medicine, Beijing 102206, China. 2

Key Laboratory of Combinatorial Biosynthesis and Drug Discovery (Wuhan

University), Ministry of Education, and Wuhan University School of Pharmaceutical Sciences, Wuhan 430071, China.

3

Institute of Microbiology, Chinese Academy of Science, Beijing 100101, China.

4

Anhui Medical University, Hefei 230032, Anhui, China.

5

Life Science College, Southwest Forestry University, Kunming 650224, China.

6

State

Key

Laboratory

for

Conservation

and

Utilization

of

Subtropical

Agro-Bioresources, Guangxi University, Nanning 530005, China.

7

Inner Mongolia Medical University, Hohhot 010110, Inner Mongolia, China.

1

ACS Paragon Plus Environment

Page 3 of 52

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

8

BGI-Shenzhen, Shenzhen 518083, China.

9

Institute of Biomedical Sciences, Department of Chemistry, and Zhongshan Hospital,

Fudan University, 130 DongAn Road, Shanghai 200032, China.

#

These authors contributed equally to this work.

*

To whom correspondence should be addressed:

Ping Xu, Beijing Proteome Research Center, 33 Science Park Road, Changping District, Beijing 102206, China. Tel and Fax: 8610-80705066. E-mail: [email protected]

Fuchu He, Beijing Institute of Radiation Medicine, 27 Taiping Road, Beijing 100850, China. Tel and Fax: 8610-68171208; E-mail: [email protected]

Yunping Zhu, Beijing Proteome Research Center, 33 Science Park Road, Changping District, Beijing 102206, China. Tel and Fax: 8610-80705066. E-mail: [email protected]

Bei Zhen, Beijing Proteome Research Center, 33 Science Park Road, Changping District, Beijing 102206, China. Tel and Fax: 8610-80705066. E-mail: [email protected] 2

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 52

 ABSTRACT As part of the Chromosome-Centric Human Proteome Project (C-HPP) mission, laboratories all over the world have tried to map the entire missing proteins (MPs) since 2012. Based on the first and second Chinese Chromosome Proteome Database (CCPD 1.0 and 2.0) studies, we developed systematic enrichment strategies to identify MPs that fell into four classes: (1) low molecular weight (LMW) proteins, (2) membrane

proteins,

(3)

proteins

that

contained

various

post-translational

modifications (PTMs), and (4) nucleic acid-associated proteins. Of 8845 proteins identified in 7 datasets, 79 proteins were classified as MPs. Among datasets derived from different enrichment strategies, datasets for LMW and PTM yielded the most novel MPs. In addition, we found that some MPs were identified in multiple-datasets, which implied that tandem enrichments methods might improve the ability to identify MPs. Moreover, low expression at the transcription level was the major cause of the “missing” of these MPs; however, MPs with higher expression level also evaded identification, most likely due to other characteristics such as LMW, high hydrophobicity and PTM. By combining a stringent manual check of the MS2 spectra with peptides synthesis verification, we confirmed 30 MPs (neXtProt PE2~PE4) and 6 potential MPs (neXtProt PE5) with authentic MS evidence. By integrating our large-scale datasets of CCPD 2.0, the number of identified proteins has increased 3

ACS Paragon Plus Environment

Page 5 of 52

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

considerably beyond simulation saturation. Here, we show that special enrichment strategies can break through the data saturation bottleneck, which could increase the efficiency of MP identification in future C-HPP studies. All 7 datasets have been uploaded to ProteomeXchange with the identifier PXD002255.

KEYWORDS Chromosome-Centric Human Proteome Project, Missing Protein, Proteome, Enrichment Strategies

 INTRODUCTION With collaborative efforts to map the proteins of individual chromosomes, the Chromosome-Centric Human Proteome Project (C-HPP)1,2 Consortium of the Human Proteome Project (HPP)3 has entered its third research round. According to the neXtProt4 database (release 1/1/2015), there are currently 2948 protein-coding genes whose proteins products have not been identified on proteome level. These proteins are termed “missing proteins” (MPs). A variety of approaches have been employed to find MPs including assessing organ-5 or cell type-specific expression6, early developmental expression in the embryo or fetus7, silent genes and proteins activated under certain stresses (beta-defensin gene family), and overlapping matches of high sequence homology proteins8,9. Following the mission of C-HPP, the Chinese Human Chromosome Proteome Consortium (CCPC) found that protein abundance, 4

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

hydrophobicity and molecular weight (MW) might influence the detectability of expected proteins absent in the proteome.10 Thus, an enrichment strategy that target these proteins might help to identify more MPs. In addition, C-HPP also aims to map specific protein variations such as post-translational modifications (PTMs),2 which may hinder the identification of MPs. Ubiquitination, for example, plays an important role in many cellular processes and is implicated in the development of many severe diseases.11–14 Protein phosphorylation plays a key role in signal transduction processes, and its dysregulation is linked to human disease as well.15–18 Using commonly studied cell lines, Tomonaga et al. detected 8305 phosphoproteins from 11278 identified proteins with 28205 phosphorylation sites. Among them, there were as many as 12852 unknown phosphorylation sites that had never been outlined in the PhosphoSitePlus database.19 More interestingly, they found that 3033 of their identified proteins were MPs.20 Therefore, enrichment of ubiquitinated and phosphorylated proteins may allow detection of MPs modified in post-translational process. Proteins are the driving force for all cellular processes and regulate cellular events through binding to different partners in the cell including other proteins, peptides, DNA and RNA. These interactions are essential in the regulation of cell fates. Previous research revealed that some MPs were missed in proteomic studies because these proteins showed tissue or cell-specific expression. Therefore we hypothesized that some MPs may similarly be enriched in certain subcellular compartments and 5

ACS Paragon Plus Environment

Page 6 of 52

Page 7 of 52

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

decided to focus some of our efforts on the analysis of nuclear proteins and mRNA-binding proteins. Here, we efficiently identify MPs following specific enrichment strategies. Protein samples from four hepatocellular carcinoma (HCC) cell lines, MHCC97H, MHCC97L, HCCLM3, and HCCLM6, were enriched for membrane proteins, nuclear extracts, phosphorylated and ubiquitinated proteins. Serum samples were used to enrich LMW proteins. A total of 8845 proteins were identified, of which 79 were MPs. Using strict manual spectra curation and verifying the results against synthesized peptides, 30 of these 79 MPs were validated.

 MATERIALS and METHODS Samples Used in this Study Human serum samples were obtained from six healthy volunteers (three males and three females; 38~80 years old, mean 61.5 ± 14.2 years) as described previously.21 The samples were pooled before enrichment for LMW proteins. Four human hepatocellular carcinoma (HCC) cell lines with increased metastatic potential (MHCC97L, MHCC97H, HCCLM3, and HCCLM6) were used in this study.10

Enrichment Strategies for Preparing Proteome Samples LMW Proteins In order to enrich serum LMW proteins, the serum samples were treated by differential solubilization, in-house Gel-filter system, glycine SDS-PAGE and ProteoMiner methods as described previously.21 6

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Membrane Proteins For mining MPs from the membrane proteome, membrane proteins were prepared from MHCC97L and HCCLM6 cell lines using ultracentrifugation as described previously.22,23 Briefly, the cell homogenate was centrifuged at 9600 g for 15 min at 4°C. Then the resulting supernatant was centrifuged at 120000 g for 80 min at 4°C. Follow centrifugation, the supernatant was removed and the pellet was washed with 0.1 M ice-cold sodium carbonate (pH 11.5, 4°C) for 1 h and centrifuged at 120000 g for 80 min at 4°C. The resulting supernatant was defined as the first enriched membrane protein fraction. Further membrane protein extraction was performed on the pellet using Mem-PER eukaryotic membrane protein extraction reagents (Pierce Biotechnology, Inc.) following the standard protocol. The membrane proteins were resolved on a 0.2% SDS-PAGE gel prior to liquid chromatography-mass spectrometry (LC-MS) analysis. Phosphoproteome Phosphoproteome enrichment samples were prepared from HCCLM6 cells line as described previously 24,25 Briefly, after an off-line high-pH HPLC separation, the phosphopeptides were enriched by a multi-step immobilized metal ion affinity chromatography (IMAC) method.

Ubiquitome Ubiquitome enrichment samples were prepared from MHCC97H cells as described previously.26 Briefly, total MHCC97H cell lysates were isolated at 100000 g for 30 min at 4°C and then incubated with TUBE-conjugated agarose beads for 30 min at 4°C. After incubation, the beads were washed using native buffer and the proteins were eluted by boiling the beads in SDS-PAGE loading buffer. Enriched 7

ACS Paragon Plus Environment

Page 8 of 52

Page 9 of 52

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

proteins were separated by SDS-PAGE and in-gel trypsinized for further analysis. Nuclear Extracts (NE) Proteome NE of MHCC97L and MHCCLM6 cell line were prepared using NE-PER Nuclear and Cytoplasmic Extraction Reagents (Thermo, Waltham, MA, USA).27,28 NE samples were separated by SDS-PAGE and in-gel trypsinized for further analysis. mRNA binding Proteome mRNA binding proteins were enriched using comprehensive identification of RNA binding proteins by mass spectrometry (ChIRP-MS) as described previously.29

Mass Spectrometry Analysis and Database Searching Peptides were analyzed using an ultra-performance LC-MS/MS platform. The LC separation was performed on a Waters nanoACQUITY ultra-performance liquid chromatography (UPLC) system (Waters, Milford, MA, USA) with an in-house packed capillary column (75 µm I.D.×15 cm) with 3 µm C18 reverse-phase fused-silica (Michrom Bioresources, Inc., Auburn, CA). The sample was eluted with a 60 ~ 140 min nonlinear gradient ramped from 8% to 40% mobile phase B (phase A: 0.1% formic acid (AC) + 2% acetonitrile (ACN) in water, phase B: 0.1% FA in ACN) at a 0.3 µL/min flow rate. Eluted peptides were analyzed using an LTQ-Orbitrap Velos mass spectrometer (Thermo, San Jose, CA, USA). The MS1 was analyzed over a mass range of 300 ~ 1600 Da with a resolution of 30000 at m/z 400. The isolation width was 2 m/z for precursor ion selection. The automatic gain control (AGC) was set to 1 8

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 52

× 106, and the maximum injection time (MIT) was 150 ms. The MS2 was analyzed using data-dependent mode searching for the 20 most intense ions fragmented in the linear ion trap. For each scan, the AGC was set at 1 × 104 and the MIT was 25 ms. The dynamic exclusion was set at 20 ~ 40 s to suppress repeated detection of the same fragment ion peaks. The relative collision energy for MS2 was at 35% for CID and 40% for HCD. For database searching, 7 MS/MS datasets were processed in parallel using both Mascot30 (v2.3.2) and MaxQuant31 (v1.5.0.25). The MS/MS spectra were searched against overlap queries (20,032 entries, including 2915 MPs) between the Swiss-Prot database (release 2014.05) and the neXtProt database (release 2015.01), along

with

245

common

contaminant

protein

sequences

(http://www.maxquant.org/contaminants.zip). The program parameters were as follows. Enzyme specificity was set to trypsin, and the search included cysteine carbamidomethylation as fixed modification, and methionine oxidation as variable modification. For the phosphoproteome and ubiquitome datasets, phosphorylation of Ser, Thr, Tyr residues (STY) and the GlyGly tail of lysine were used, respectively. Up to two missed cleavages were allowed for protease digestion, and only fully tryptic peptides with no fewer than 7 amino acids were utilized. The union of nonconflicting peptide-spectrum matches (PSMs) from two search engines were combined for protein assembly. Only proteins with more than 1 unique peptide were considered. A target-decoy-based strategy was used to verify data quality for each dataset, and the results produced both PSM- and protein-level false discovery rates (FDRs) less than 9

ACS Paragon Plus Environment

Page 11 of 52

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

1%.10

Bioinformatics Analysis of Identified Proteins ProPAS (v1.1) was used to calculate physicochemical properties, including MW and hydrophobicity (Hy), based on protein sequences.32 PeptideSieve was selected to predict high proteotypic propensity peptides (Threshold score ≥ 0.8) of each protein based on physicochemical properties.33 For transmembrane domain (TD) prediction, TMHMM 2.0 was used.34 To analysis nucleic-acid-associated protein datasets, UniProt keywords (http://www.uniprot.org/keywords) “DNA binding” and “RNA binding” were used. For the PTM datasets analysis, experimentally observed phosphorylation and ubiquitination sites were consulted through PhosphoSitePlus.19 GPS 3.035 was used to predict the kinases that phosphorylated the MPs identified in the phosphoproteome datasets. Gene ontology (GO) and KEGG analysis were performed using DAVID (v6.7).36 The terms with p-value ≤ 0.01 were considered significantly enriched. Datasets overlap analysis were performed using VENNY 2.037 and BioVenn38. Hierarchical clustering of the MPs identified in different datasets were classified using Cluster 3.0.39 Peptide sampling simulation of large-scale proteome profiling experiments were described in CCPD 2.0.28

Verification of Missing Protein with Synthesized Peptides To evaluate the authenticity of MPs identified in our study, the MPs’ spectra quality was manually evaluated by observing the base peak intensity and matching b/y ions. 10

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 52

Relatively high-quality peptideswere selected and synthesized for further analysis. The pFind and pBuild40–42 softwares were used for matching spectra from the original peptides to those of synthesized peptides and for spectra displaying. Cosine similarity was calculated concerning only b1+, b2+, y1+ and y2+ ions that matched to the peptide. Both m/z and ion intensity were used to calculate the similarity score. n

∑A ×B i

Score =

i

i =1 n

n

∑ ( Ai )2 ×

∑ (B )

i =1

i =1

2

i

This formula calculates the cosine similarity score between spectra from large-scale experiments and synthesized peptides. For a pair of spectra, A represents the spectrum from large-scale experiments, and B represents the spectrum generated by synthesized peptides. n represents the sum of matched ions in the two spectra. If the ion i is present in both spectrum, the value of both Ai and Bi is the ion intensity. If the ion i is only observed in spectrum A, then Bi is set to 0. If the ion i is only observed in spectrum B, then Ai is set to 0.

 RESULTS and DISCUSSION Protein

Identification from

Enriched

Sub-proteome

Samples

Through Varied Enrichment Strategies Our previous work28 suggested that enrichment strategies could improve the proteome coverage of regular biological samples, resulting in more efficient MP detection. In 11

ACS Paragon Plus Environment

Page 13 of 52

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

this study we employed multiple enrichment strategies to identify additional MPs (Figure 1). After a unified database search and quality control with FDR < 1% on both PSM and protein level, more than 1 million spectra and 109746 peptides corresponding to 8845 proteins were identified from seven datasets, covering 44.2% of human protein-coding genes according to neXtProt database annotations (Table 1, Supplementary Figure S1, Supplementary Table). Of these, 79 were identified as MPs based on the neXtProt database MP list (release 1/1/2015). The number of identified MPs showed no positive correlation (R2=0.46) with the total number of identified proteins in these datasets, which suggested that the identified MPs were not discovered randomly due to increasingly large-scale MS/MS datasets. Instead, the varied proteome sample preparation strategies helped to specifically enrich different groups of proteins for identification by MS. To verify the identity of the 79 newly identified MPs, their properties were systematically analyzed.

Special Enrichment Strategies Help to Identify MPs from HCC Cell Lines To compare the ability of different enrichment strategies to identify MPs, 5 datasets (membrane dataset, phosphoproteome datasets 1 & 2, ubiquitome dataset and NE dataset)

were

prepared

from

4

HCC

cell

lines.

The

CCPD

201310

transcriptome/proteome datasets were employed as a control, because that study used the same HCC cell lines. A total of 8190 proteins were identified in 5 datasets from 12

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

HCC cell line, of which 64 proteins were MPs (Supplementary Figure S2A, left panel). Previously in the CCPD 2013 study, a total of 16273 protein-coding genes were mapped at the transcriptome level, 1561 of which were MPs (Supplementary Figure S2A, middle panel); and 9087 proteins were identified at the proteome level, 57 of which were MPs (Supplementary Figure S2A, left panel). Venn diagram of the proteins identified in CCPD 2013 proteome dataset and in 5 datasets from HCC cell line (Supplementary Figure S2B) showed that our datasets included 993 uniquely identified proteins, 50 of which were MPs, which represents 5.0% of uniquely identified proteins. In contrast, 1890 proteins were uniquely identified in the CCPD 2013 proteome dataset, including 43 MPs, which represent only 2.3% of uniquely identified proteins. This result suggested that the enrichment strategy employed in this study may increase the efficiency of MPs identification, especially for low-abundance proteins. GO analysis of the uniquely identified proteins validated our enrichment strategies (Supplementary Figure S2C). We found that proteins identified in RNA binding and membrane datasets were significantly enriched in nuclear and intracellular membrane components. To determine whether the uniquely identified proteins in 5 datasets from HCC cell line were low-abundance proteins, we consulted transcriptome data generated by the CCPD 2013 study. We found that some of our uniquely identified proteins were not represented in the CCPD 2013 transcriptome data, indicating that these proteins may 13

ACS Paragon Plus Environment

Page 14 of 52

Page 15 of 52

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

be expressed at low level under standard HCC cell line culture conditions. This low abundance would have hampered identification of these proteins by standard large-scale MS/MS experiments that do not employ enrichment strategies. Surprisingly, of the 1354 identified MPs that lack transcriptome data, we identified 19 in this study, whereas only eight were identified in the CCPD 2013 proteome dataset despite containing more proteins overall. These results suggested that enrichment strategies facilitate identification of low-abundance MPs. Because 97% of the proteins in both CCPD 2013 proteome dataset and each enrichment dataset were covered by the CCPD 2013 transcriptome data (Supplementary Figure S2D), we used the quantitative information of transcriptome to compare the MP identification efficiency of large-scale proteomics profiling to that of enrichment strategy. The reads per kilobase per million mapped reads (RPKM) distribution of the 993 uniquely identified proteins and 64 MPs in 5 datasets from HCC cell line showed a similar pattern to that of the CCPD 2013 and included a relatively high proportion of proteins missed by transcriptome analysis (Figure 2A). Of the uniquely identified proteins and MPs in our datasets, 41.5% and 60.9% had an RPKM of