Tissue-Based Proteogenomics Reveals that Human Testis Endows

Aug 18, 2015 - CAS Key Laboratory of Genome Sciences and Information, Beijing ... General Surgery Dept., Capital Medical University Affiliated Beijing...
0 downloads 0 Views 2MB Size
Subscriber access provided by UPSTATE Medical University Health Sciences Library

Article

The Tissue-Based Proteogenomics Reveals that Human Testis Endows Plentiful Missing Proteins Yao Zhang, Qidan Li, Feilin Wu, Ruo Zhou, Yingzi Qi, Na Su, Lingsheng Chen, Shaohang Xu, Tao Jiang, Chengpu Zhang, Gang Cheng, Xinguo Chen, Degang Kong, Yujia Wang, Tao Zhang, Jin Zi, Wei Wei, Yuan Gao, Bei Zhen, Zhi Xiong, Songfeng Wu, Pengyuan Yang, Quanhui Wang, Bo Wen, Fuchu He, Ping Xu, and Siqi Liu J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.5b00435 • Publication Date (Web): 18 Aug 2015 Downloaded from http://pubs.acs.org on August 26, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

The Tissue-Based Proteogenomics Reveals that Human Testis Endows Plentiful Missing Proteins Yao Zhang1,4,5#, Qidan Li2,3,5#, Feilin Wu1,6#, Ruo Zhou3#, Yingzi Qi1#, Na Su1, Lingsheng Chen1,8, Shaohang Xu3, Tao Jiang3, Chengpu Zhang1, Gang Cheng3, Xinguo Chen9, Degang Kong10,Yujia Wang3,Tao Zhang1, Jin Zi3, Wei Wei1, Yuan Gao1, Bei Zhen1, Zhi Xiong6, Songfeng Wu1, Pengyuan Yang11, Quanhui Wang2,3,5, Bo Wen3*, Fuchu He1*, Ping Xu1,7*, Siqi Liu2,3,5* 1

State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Engineering

Research Center for Protein Drugs, National Center for Protein Sciences, Beijing Institute of Radiation Medicine, Beijing 102206, China 2

CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics,

Chinese Academy of Sciences, Beijing 101318, China 3

BGI-Shenzhen, Shenzhen 518083, China

4

Institute of Microbiology, Chinese Academy of Science, Beijing 100101, China

5

Graduate University of the Chinese Academy of Sciences, Beijing 100049, China

6

Life Science College, Southwest Forestry University, Kunming 650224, P. R, China

7

Key Laboratory of Combinatorial Biosynthesis and Drug Discovery (Wuhan University),

Ministry of Education, and Wuhan University School of Pharmaceutical Sciences, Wuhan 430071, China 8

State Key Laboratory for Conservation and Utilization of Subtropical Agro-Bioresources,

Guangxi University, Nanning 530005, China 9

Institute of Organ Transportation, General Hospital of Chinese People’s Armed Police Forces,

Beijing 100039, China 10

General Surgery Dept., Capital Medical University Affiliated Beijing YouAn Hospital, Beijing

100069, China 11

Institutes of Biomedical Sciences, Department of Chemistry and Zhongshan Hospital, Fudan

University, 130 DongAn Road, Shanghai 200032, China #

These authors contributed equally to this work.

1

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

*

To whom correspondence should be addressed: Siqi Liu Beijing Institute of Genomics, CAS, 1 BeiChen West Road, Beijing 100101, China. Tel and Fax: 86-10-80485460; E-mail: [email protected] Ping Xu Beijing Proteome Research Center, 33 Science Park Road, Changping District, Beijing 102206, China. Tel and Fax: 86-10-80705066; E-mail: [email protected] Fuchu He Beijing Institute of Radiation Medicine, 27 Taiping Road, Beijing 100850, China Tel and Fax: 86-10-68171208; E-mail: [email protected] Bo Wen BGI-Shenzhen, 11 Build, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China Tel and Fax: 86-0755-25273620; E-mail: [email protected]

ABSTRACT Investigations of missing proteins (MPs) are being endorsed by many bioanalytical strategies. We proposed that proteogenomics of testis tissue was a feasible approach to identify more MPs because testis tissues have higher gene expression levels. Here, we combined proteomics and transcriptomics to survey gene expression in human testis tissues from three post-mortem individuals. Protein were extracted and separated with glycine- and tricine-SDS-PAGE. A total of 9,597 protein groups were identified; of these 166 protein groups were listed as MPs, including 138 groups (83.1%) with transcriptional evidence. A total of 2,948 proteins are designated as MPs, and 5.6% of these were identified in this study. The high incidence of MPs in testis tissue indicates that this is a rich resource for MPs. Functional category analysis revealed that the biological process that testis MPs mainly involving in are sexual reproduction and spermatogenesis. Some of the MPs are potentially involved in tumorgenesis in other tissues. Therefore, this proteogenomics analysis of individual testis tissues provides convincing evidence 2

ACS Paragon Plus Environment

Page 2 of 32

Page 3 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

for the discovery of MPs. All mass spectrometry data from this study have been deposited in the ProteomeXchange

(dataset

identifier

PXD002179,

username:

[email protected],

password: NFEv8D8P). KEYWORDS Chromosome-Centric Human Proteome Project, testis, missing proteins, proteome, transcriptome, individual



INTRODUCTION

Following in the footsteps of the Human Proteome Project (HPP), the Chromosome-Centric Human Proteome Project (C-HPP) has entered its third productive year. The mission of C-HPP Phase I is to identify additional “missing proteins” (MPs); direct liquid chromatography-mass spectrometry-(LC-MS) or antibody-based proteomics methods have not identified the full complement of MPs. The neXtProt database 1(2014-09-19) estimated that protein evidence is still lacking for 2,948 protein-coding genes (Protein Evidence 2-4). To find these MPs, in-depth proteomics studies on colon cancer cell lines2, liver cell lines3, brain tissues4, 5, and placenta tissues were performed6. Three “human proteome drafts” have been published, which included all major organs, tissues, body fluid samples, and various cell lines7-10. Analysis of the location and abundance of cell transcripts could provide clues for gene expression of the unidentified MPs. Transcriptomics analysis using different tissues indicated that 18% of MP genes may be specifically enriched in certain tissues8. Therefore, in-depth proteomic analysis of these special tissues is a crucial remaining study for identification of MPs. The testis is the organ of male gametophyte development, which is known to have high gene 3

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

expressions levels11. The number of genes expressed in other human tissues varied from 11,000 to 13,000, whereas more than 15,000 gene transcripts have been detected in testis12. An analysis of mouse testis revealed high transcriptome complexity, which was primarily due to meiotic spermatocytes and postmeiotic round spermatids13. The testis has more than 1,300 tissue-enriched genes with more than five-fold higher abundance, and 364 genes have up to 50-fold higher abundance, compared with those of corresponding genes in 26 other organs14. Testis indeed expresses more specific proteins than other normal tissues, such as cancer-testis (CT) antigens, which are a group of immunogenic proteins encoded by genes that are normally expressed in the human germ line, but not in other normal tissues15, 16. Approximately 200 CT genes are reported in the CT database; 28 of these are immunohistochemically detected only in normal testis tissue and some cancer cells17. The testis proteome has been extensively analyzed using a pooling strategy for multiple samples. Based on our own re-analysis of public datasets, we found that the Pandey group prepared protein extracts from three individual adults and fetal testis tissues, and identified 8,341 and 6,954 proteins, respectively7. The Kuster group identified 6,123 proteins using a pool of ten testis samples and extensive analysis8. By contrast, antibody-based protein identification methods may detect more low-abundant proteins and provide more information on cellular localization. Uhlen and colleagues used antibodies and detected 11,330 proteins in testis tissue microarrays, which is one of the largest datasets for human proteome9. These results suggest that individual proteome analysis of individual tissues might provide greater numbers of detectable proteins than pooled tissue samples due to individual variations of some low-abundant proteins. The testis was reported to have the most transcribed genes11-14, and the transcriptional data from seven individual testis 4

ACS Paragon Plus Environment

Page 4 of 32

Page 5 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

tissues was integrated with the antibody-based proteome data. Based on comparisons of mRNA abundance among different tissues, testis was reported to have the most diverse transcriptome and proteome for seeking MPs and missing RNAs. In this study, we analyzed the proteome and transcriptome of three individual human testis samples. To improve the proteome coverage on three testis samples, we used a combination of glycine-SDS-PAGE and tricine-SDS-PAGE for protein separation. We identified a total of 9,597 proteins; of these, 166 protein groups were the confirmed MPs. Transcriptome analysis also revealed that MPs might be highly enriched in testis tissue. Gene ontology (GO) analysis and ingenuity pathway analysis (IPA) of MPs showed that 82.4% of MPs with known annotation information were heavily associated with various diseases, especially cancers. Some of these, such as ADORA3, HLA-C (confirmed) and ZNF23 have been designated as potential biomarkers that may be useful for disease diagnosis and treatment. Transcriptomic and proteomic analyses of individual testis tissues provide an efficient strategy and valuable dataset for further studies of MPs. 

MATERIALS AND METHODS

Testis Tissues Used in this Study Human testis samples were collected from General Hospital of Chinese People’s Armed Police Forces and Capital Medical University Affiliated Beijing YouAn Hospital. Human tissues were collected post mortem as part of a rapid autopsy program from three adult donors by authors X.C and D.K. The IRB approval number is BGI-IRB 15076. Tissues were washed with PBS three times and stored at -80°C until use. The tissues were histologically confirmed to be normal before analysis. This study was approved by the BGI’s Institutional Review Board for the use of human 5

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

tissues. Proteome Sample Preparation and LC-MS/MS Analysis Three testis tissue samples (50 mg) were ground in liquid nitrogen and sonicated in lysis buffer (8M urea, 5mM IAA, 50mM NH4HCO3, 1×protease cocktail) on ice. The unbroken debris was eliminated by centrifugation (13,300×g) at 4°C for 15 min. The protein concentration was determined by a gel-assisted method as described previously18. Extracted proteins were resolved by SDS-PAGE (10%) and tricine-SDS-PAGE (12%), respectively, followed by staining with Coomassie Brilliant Blue. Then, the gel lanes were cut into multiple bands as indicted based on the molecular weight and the protein abundance in the specific region. Each gel band was digested with trypsin before subjecting to LC-MS/MS analysis. LC-MS/MS experiments were performed essentially as described previously19. Briefly, every peptide mixture was dissolved with sample loading buffer (1% acetonitrile and 1% formic acid in water). Then they were separated and analyzed by UPLC (nano Acquity Ultra Performance LC, Waters, Milford, MA, USA) and tandem MS/MS (LTQ Orbitrap Velos, Thermo Fisher Scientific, Waltham, MA, USA). Survey scans were performed in the Orbitrap analyzer at a resolution of 30,000 and target values of 1,000,000 ions over a mass range between 300-1600 m/z. The 10 most intense ions were subjected to fragmentation via collision induced dissociation in the LTQ. For each scan, 5,000 ions were accumulated over a maximum allowed fill time of 25ms and fragmented by wideband activation. Exclusion of precursor ion masses over a time window of 30s was used to reduce repeated peak fragmentation. MS/MS Data Analysis The raw MS/MS data were converted into MGF format file by ProteoWizard (v3.0.4238). The 6

ACS Paragon Plus Environment

Page 6 of 32

Page 7 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

MS/MS data were then searched by three search engines (OMSSA v2.1.8, X!Tandem v2009.10.01.1 and MS-GF+ v9733) against the human Swiss-Prot database (20,050 sequences, release 2014-12-22) with decoy sequences and contamination protein sequences (245 sequences)20-22. Several parameters were set for database searching: cysteine carbamidomethyl was specified as a fixed modification. Deamidation of asparagine and glutamine and oxidation of methionine were specified as variable modifications. The precursor mass tolerance for protein identification on MS was 10 ppm, and the product ion tolerance for MS/MS was 0.6 Da. Full cleavage by trypsin was used, with one missed cleavages permitted. The results from the three search engines were then integrated by IPeak23, 24 which is a tool that combines multiple search engine results. Only the identifications satisfying the following criterions were considered: 1) the peptide length ≥7; 2) the FDR≤1% at peptide level; 3) the FDR≤1% at protein level; 4) at least one peptide longer than 9 aa was required for one hit wonder proteins. The protein level FDR was calculated by using the picked FDR strategy 25. Bioinformatics Analysis of Identified Proteins Protein identification and distribution in the glycine and tricine gels were compared for evaluating their contribution to total proteome datasets. Furthermore, individual contributions to large-scale identification were inspected at the level of protein variability and non-redundant protein and peptide saturation curve. Three published testis datasets (from Pandey, Kuster and Uhlen group), and two public databases (PeptideAtlas26 2014-08, and CCPD 2.027) were chosen for the comparison analysis with our proteome data. Noteworthy, we did not use the testis proteome results from the two Human Proteome drafts directly, but downloaded their raw data from websites and got the results by re-analyzing them with the same pipeline as in this study. 7

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Transcriptome Sample Preparation and Sequencing RNA extraction of three testis samples was performed as described previously28. The cDNA libraries were constructed using a published protocol29 with slight modification. Briefly, the DNA was removed from 2µg of total RNA with DNase I (NEB, Ipswich, MA, USA). The cleaned mRNA was purified from total RNA using Dynabead mRNA Purification Kit (Ambion, Carlsbad, CA, USA). Subsequently, the mRNA was randomly fragmented into short fragments of approximately 150 bp, followed by reverse transcription to cDNA strands. After ligation to Ion Proton adapters, the fragments (average length for three samples 239-247 bp) were diluted to 8 pM for emulsion PCR, and sequenced using the PI™ Chip v2 on the Ion ProtonTM Sequencer (Thermo Fisher Scientific, Waltham, MA, USA). RNA-Seq Data Analysis Raw sequencing reads were pre-processed by removing adapters and low-quality reads and the clean reads were obtained with strict quality control steps of the standard Ion Proton RNA-seq protocol of BGI Shenzhen. Generally, 22-24 M clean reads with an average length of 135 bp were adopted for the three samples. The reference transcript set was based on the Ensembl GRCh37 reference genome and clean reads were mapped to the reference gene using Tmap. No more than three mismatches were allowed in the alignment. At the same time, clean reads were mapped to the UCSC hg19 reference genome using Tophat. The reads aligned to the reference transcript sequences were used to determine transcript level quantification. Three quantification values were reported: effective read counts, coverage, and reads per kilobase per million mapped reads (RPKM). Only genes with uniquely mapped reads ≥ 10 were used for downstream analysis. Effective read counts were normalized with respect to transcript length and overall lane yield to 8

ACS Paragon Plus Environment

Page 8 of 32

Page 9 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

give RPKM, which was used to represent transcript abundance. Quality Checking and Functional Analysis of MPs To evaluate the authenticity of identified MPs in our study, the quality of spectra referring to MPs were manually inspected by observing base peak intensity and b/y ions matching assisted by pLabel software developed by the pFind group30-32. DAVID33 and IPA34 were used for MPs biological function and disease related biomarker analysis of MPs. 

RESULTS AND DISCUSSION

The complexity of a complete proteome makes it challenging to detect all expressed proteins. To achieve a deep coverage of the human proteome, we selected testis tissue as our target sample because of its abundant gene expression. Technically, proteome coverage depends on the separation power of the proteomics platform. Our proteomics strategy started with high-resolution gel fractionation of testis total cell lysate (TCL) samples as the first-dimension separation to reduce proteome complexity, followed by LC-MS/MS analysis. To identify lower molecular weight proteins (LMW), we further resolved the same testis TCL on tricine gels35, 36. Deep RNA-seq data was applied to interpret the proteomics data, which resulted in the proteogenomics study. The detailed experimental designs are shown in Figure 1A & B, and are discussed in more detail in the following section. Protein Diversity Revealed by a Deep-Coverage Proteomics Study on Human Testis Tissue We obtained testis tissue from three individuals for this study. After lysis, SDS-PAGE was used to resolve the TCL, resulting in similar patterns and sample loading with clear, sharp band, indicating that TCL proteins were extracted and separated while preserving such that proteome integrity (Figure 2A, left panel). Each lane was excised into 28 gel bands based on their MW and the 9

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

protein abundance in specific regions. The proteins in these gel bands were in-gel digested with trypsin. LC-MS/MS analysis showed that 9,064 proteins were identified (Figure 2B). To further deepen the proteome coverage for these testis samples, the same TCLs were resolved on tricine gels. Each lane was excised into 22 gel bands as indicated (Figure 2A, right panel). These samples were in-gel digested with trypsin and analyzed by LC-MS/MS, resulting in a total of 8,419 identified proteins. A total of 1,178 and 533 proteins were uniquely identified in regular SDS-PAGE and tricine SDS-PAGE, respectively, increasing the cumulative number of proteins identified to 9,597 (Figure 2B). The theoretical MW distribution of the identified proteins indicated that the proteins from tricine gels were enriched in the LMW region (Figure 2C). However, the proteins uniquely identified from normal glycine SDS-PAGE primarily appeared at a relatively larger MW range with a peak at 50 kDa. This result indicated that our protein separation strategy was valuable for a deeper proteomics study of human testis tissues. Protein abundance may vary in individual samples, which can facilitate the identification of novel proteins. Therefore, we separately analyzed testis samples from three individuals, instead of pooling the samples (Figure 2D). We identified 8,520, 8,362 and 8,434 proteins from each individual testis sample. A total of 7,380 proteins were present in all three datasets, which accounts for approximately 87% of the identified proteins from each individual testis. A total of 403, 317 and 536 proteins were uniquely identified from the three samples. The chosen strategy significantly expanded the coverage of the testis proteome. We used the following method to identify significant numbers of unique proteins from the three individuals. We analyzed the protein abundance distribution of the commonly identified 7,380 10

ACS Paragon Plus Environment

Page 10 of 32

Page 11 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

proteins and the uniquely identified proteins by extracting the intensity from those identified peptides. As shown in Figure 2E, the common and sample-specific proteins in the testis tissues had similar abundance distributions. However, the abundance distribution curves of the specifically identified proteins from individual samples were lower than the abundance distribution curves of the common proteins. This result suggested that proteins uniquely detected in only one sample were likely to have relatively lower abundance. The individual variation of abundance of these proteins promoted their identification, which resulted in deeper coverage of the human proteome by complementing with specific testis tissues. Much deeper coverage of the mammalian proteome to approximately 8,000 to 10,000 gene products have been achieved from a single human cell line owing to the development of high-resolution chromatography and high sensitive mass spectrometry. However, data saturation could be limited by the detection sensitivity and dynamic range of available mass spectrometers. To determine the number of individual samples needed for saturation analysis of the testis proteome, we calculated an accumulation curve with these identified peptides and proteins (Figure 2F & G). The number of non-redundant peptides or proteins increased at a similar level with the addition individuals A, B and C, and the total number of identified proteins rose to 9,597 when two separation methods were introduced. This result indicated that testis is one of the few tissues with almost 10,000 identified proteins7, 8. Testis Proteome Is an Enriched Library for MPs Testis tissue has been extensively analyzed in several recently published studies for human proteome draft map. 9,597, 8,968 and 6,123 proteins were detected by using IPeak with 1% protein level FDR in the current dataset, the Pandey group dataset, and the Kuster group dataset, 11

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

respectively (Figure 3A). The total number of proteins identified from the testis samples is 10,827, which is currently the deepest coverage offering a single human organ/tissue proteome. A total of 5,386 proteins were detected in all three datasets. Over 90% of the proteins in the Pandy and Kuster datasets also were detected in our dataset. The number of uniquely identified proteins was 1,319, 736 and 297 in our dataset, the Pandey group dataset, and the Kuster group dataset, respectively. The greater number of proteins identified in our dataset confirms that our strategies is suitable for the testis samples. Antibody-based proteomics is another approach for the HPP. In the testis proteome dataset generated with antibodies, a total of 11,330 proteins were detected credibly (Figure 3B). Comparing the antibody-based protein dataset with our current MS-based dataset revealed that 6,652 proteins were identified in both datasets, whereas 2,945 and 4,678 proteins were detected exclusively by MS or antibodies. This result suggests that these two approaches have good complementarity. To test the specificity of testis proteome, we compared our current dataset with a protein list from the human draft map generated by Pandey’s group. A total of 584 proteins were uniquely identified in our study (Figure 3C). We also found that 66% of the MPs identified in our MS analysis were among these 584 unique proteins (Supplementary Figure 2). When we compared our current testis proteome dataset with the PeptideAtlas (2014-08) and CCPD 2.0, which are publicly accessible proteomics databases, we still found 233 proteins were uniquely identified in our testis samples (Figure 3D). This result confirms that our testis proteomics study identifies significant numbers of unique proteins. Comparing the testis proteomics datasets with the missing protein list provided by the C-HPP 12

ACS Paragon Plus Environment

Page 12 of 32

Page 13 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

consortium, we identified 200, 66, and 39 MPs from our dataset, the Pandey and Kuster’s testis datasets, respectively (Figure 3E, Supplementary Figure 3). Clearly, besides the uneven numbers of discovered MPs, the majority MPs found in each groups were different from each other. As various sample preparation methods, enzymes for protein digestion and peptide or protein separation strategies were adopted among the labs (summarized in Supplementary Table 3), the difference was comprehensible. We further counted the number of MPs derived from our three individual testis tissues (Figure 3F). A total of 145, 131, and 116 MPs were identified in sample A, B, and C, respectively (Supplementary Figure 3). About 70% MPs in each individual were shared, yet the ratio decreased to 50% when adding them together. This result emphasizes that individual variations are important in proteomics studies. Testis Transcriptome Is More Diverse than Proteome MPs To estimate the number of expressed protein-coding genes in the testis samples, we applied a proteogenomics strategy to the same testis samples. Total proteins and mRNAs were extracted from the same samples, and whole transcriptome libraries were sequenced using RNA-seq (Figure 1A & B). On average, 23.1 million reads and 3.1 Gb of sequenced nucleotides were obtained for each mRNA sample; 99.5% of these were mapped to the reference genome, and 85.1% of the reads were mapped to the reference sequences. After applying a cut off to reads number lower than 10, the average total reads was approximately 14.5M in each testis tissues. This translates to sequencing of approximately 16,000 genes. Comparative analysis indicated that 15,491 genes were detected in all the three samples with overlap 90%, whereas  2% (100-300) of the genes were specifically detected in a single sample (Figure 4A). This achieves RNA-seq saturation much easier than with three individual samples 13

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(Figure 4B). This result is consistent with the proteomics data (Figure 2E); the uniquely sequenced gene products from each individual sample also turned out to have low abundance as estimated by RPKM (Figure 4C). Combining three individual testis samples, there were 8,959 commonly identified genes in the transcriptome and proteome (Figure 4D). Some genes were detected in each-omics dataset with no clear explanation. For example, there were 7,424 genes (45%) identified from the transcriptome but not the proteome, and 638 genes (7%) were identified only from the proteome only (Figure 4D). To investigate the correlation between the abundance of mRNA and protein, we analyzed genes commonly present in all six datasets for transcriptomics and proteomics from three individuals (Figure 4E). Generally, mRNA abundance was highly correlated with these three samples with Pearson's correlation coefficient from 0.94-0.98. The correlation among the proteome dataset was slightly lower, with Pearson's correlation coefficient from 0.83-0.87. This difference might be caused by translational regulation or protein quality controls, which are commonly present in cells. Compared with other proteomics studies, the higher correlation rate for transcriptome also may indicate the high homogeneity of testis samples and good reproducibility of our biological experiments. However, the correlation between transcriptome and proteome was lower in our dataset than in the datasets from proteomics studies, which suggested complex post-transcriptional regulation or potential quality control process in our samples. We checked the mRNA evidence for the identified MPs further. For the MPs identified with both proteins and mRNAs in one tissue (80%), we found that 95% of mRNAs were detected in all three individuals (Figure 4F). The protein abundance of the uniquely identified MPs was globally lower than that of other proteins with MS evidence. Therefore, we were curious about their mRNA 14

ACS Paragon Plus Environment

Page 14 of 32

Page 15 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

abundance (Figure 4G). The mRNA abundance distribution curves for these two groups of genes essentially superposed. These results further demonstrated that the testis has highly diverse gene expression, which might different from that in other tissues in the human body. To confirm the identified missing proteins, the reliability of MPs identified with MS technology was further checked, and the mass spectra are presented in Supplementary Figure 4. A total of 166 groups (182 proteins) MPs were verified by stringently filtering the quality of their MS spectra with higher base peak intensity and better and continuous b/y ion matches. Most of the reliable MPs had higher mRNA levels (Figure 4H). Conversely, the majority of filtered MPs did not have mRNA evidence (Figure 4I). This confirmed list of MPs will be followed for further analysis in this study. To further understand the mechanism of MP gene expression regulation in different organs or tissues, we compared the abundance distribution of mRNAs in 27 tissues. The mRNA abundance distribution for total genes (gray) identified in various tissues and MPs (orange) were portrayed as box plots. The height of the box represented the dynamic range of mRNA abundance. As shown in Figure 5, mRNA abundance of total genes was generally greater than that of genes encoding MPs, except in testis. This result strongly indicated that MP identification resulted from their higher abundance. Although the mRNA abundances for the remaining genes encoding MPs is even lower than those for the already identified MPs, it is still higher in testis than that in most other tissues, suggesting that testis is an ideal target tissues for future MP studies. We also observed that skin, thyroid, ovary and brain also may serve as suitable sample resources for MPs studies (Supplementary Figure 5).

15

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 32

MPs Have Close Association with Various Diseases The testis is a preferred resource for cancer biomarker screening because of the high correlation between genes expression in testis and tumor tissues. To investigate the relationship between testis and tumor gene expression, we performed IPA analysis for 166 protein groups containing 182 proteins uniquely identified in the current study. The results showed that 108 proteins (72%) had some correlation with the occurrence or development of tumor diseases (Figure 6A). Therefore, testis appears to be an ideal biological sample for screening early disease diagnosis biomarkers and identification of new drug and vaccine targets. GO analysis suggested that the biological functions of the identified MPs were primarily involved in sexual reproduction, gamete generation, spermatogenesis, male gamete generation, and modification-dependent macromolecule catabolic process. (Figure 6B). The expression abundance of MPs enriched in testis was consistent with serious cancers, and was lower than total MP expression abundance distribution (Figure 6C). For example, the membrane proteins ADORA3 and HLA-C (confirmed) and the nuclear protein ZNF23 have been suggested as biomarkers for early detection of multiple types of severe diseases, including cancer, infection diseases and cardiovascular diseases (Figure 6D & E). ADORA337 was reported as a member of Gi protein-coupled receptors, which inhibited adenylate cyclase activity by binding to their ligands. ADORA3 dysregulation may result in coronary heart disease38, 39and human colon cancer. Previous reports showed that ZNF23 inhibited cell cycle progression40, and ZNF23 dysregulation was strongly associated with human ovarian cancer cells

41

and

hepatocellular carcinoma42. HLA class I regulates natural killer (NK) cells43 to determine the fate of hepatitis C (HCV) infection and modulate the pattern of rheumatoid arthritis expression44. The abundance distribution suggested that some of these newly identified disease-related MPs were 16

ACS Paragon Plus Environment

Page 17 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

relatively enriched in testis, although their abundance was still globally lower than that of other identified proteins in testis samples (Figure 6F). CONCLUSIONS We performed a systematic proteomics and transcriptomics analysis of three individual testis samples using gel separation, LC-MS/MS, and deep RNA-seq, and confirmed that the testis has diverse gene expression and large amounts of expressed genes. To achieve deeper proteome coverage in these samples, we combined regular SDS-PAGE and tricine-SDS-PAGE to resolve proteins before LC-MS analysis. We identified a total of 9,597 proteins for the testis proteome by using multi-search strategy based on the MS/MS data, which is currently one of the largest tissue proteome datasets. A total of 166 protein groups were uniquely identified as potential MPs based on comparisons with the C-HPP MPs list (neXtProt 2014-09-19). The same testis samples were subjected to deep RNA-seq and stringent, manual check of MS spectra. The results confirmed that 182 proteins belonging to 166 protein groups are newly identified MPs. This is currently one of the largest datasets identifying new MPs in C-HPP. Transcriptome analysis indicated that MPs might be enriched in testis tissue, which can help to guide future MP investigation. This study shows that MPs can be efficiently identified using individual testis tissues, tandem protein separation methods, and high-resolution spectrometry. Individual testis samples have high protein diversity abundance; and are a rich resource for MPs duo to individual variations. The strategies and analytical methods used in this study can help to guide future proteomics studies of MPs.

17

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60



ASSOCIATED CONTENT

Supporting Information Testis proteins were identified using two methods (MaxQuant and IPeak). Approximately 98% of the proteins identified by MaxQuant were included in the IPeak result. The uniquely identified MPs in this study were compared based on our MS analysis. Venn diagram of MPs identified in three individual in three individual testis samples. High-resolution mass spectra of 166 MP groups (182 proteins) based on deep RNA-seq and stringent, manual checks of MS spectra. Skin, thyroid, ovary, and brain tissues could be suitable resources for identification of unidentified MPs. Summary of sample preparation and raw data information for three testis tissues proteome studies. Detailed information about the identified proteins, peptides and MPs based on proteomics and transcriptomics is available free of charge via http://pubs.acs.org/.



ACKNOWLEDGMENTS

This study was supported by the Chinese National Basic Research Programs (2011CB910600, 2013CB911201, and 2014CBA02002-A), the National Natural Science Foundation of China (Grant Nos.31400697, 31170780, 31470809 and 31400698), the International Collaboration Program (2014DFB30020), the National High-Tech Research and Development Program of China (SS2012AA020202, 2012AA020502, 2011AA02A114, 2014AA020900 and 2014AA020607), the National Natural Science Foundation of Beijing (Grant No. 5152008), the National Mega projects for Key Infectious Diseases (2013zx10003002), the Key Projects in the National Science & Technology Pillar Program (2012BAF14B00), and the Fundamental & Advanced Research Project of Chongqing, China (No. cstc2013jcyjC00001). 18

ACS Paragon Plus Environment

Page 18 of 32

Page 19 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research



ABBREVIATIONS

HPP, Human Proteome Project; C-HPP, Chromosome-Centric Human Proteome Project; MPs, missing proteins; CT, cancer-testis; CCPD, Chinese Chromosome Proteome Database; MW, molecular weight; RPKM, reads per kilobase per million mapped reads; TCL, total cell lysate; LMW, low molecular weight proteins.



REFERENCES

1.

Omenn, G. S.; Lane, L.; Lundberg, E. K.; Beavis, R. C.; Nesvizhskii, A. I.; Deutsch, E. W., Metrics for

the Human Proteome Project 2015: Progress on the Human Proteome and Guidelines for High-Confidence Protein Identification. Journal of proteome research 2015. 2.

Shiromizu, T.; Adachi, J.; Watanabe, S.; Murakami, T.; Kuga, T.; Muraoka, S.; Tomonaga, T.,

Identification of missing proteins in the neXtProt database and unregistered phosphopeptides in the PhosphoSitePlus database as part of the Chromosome-centric Human Proteome Project. J Proteome Res 2013, 12, (6), 2414-21. 3.

Zhang, C.; Li, N.; Zhai, L.; Xu, S.; Liu, X.; Cui, Y.; Ma, J.; Han, M.; Jiang, J.; Yang, C.; Fan, F.; Li, L.;

Qin, P.; Yu, Q.; Chang, C.; Su, N.; Zheng, J.; Zhang, T.; Wen, B.; Zhou, R.; Lin, L.; Lin, Z.; Zhou, B.; Zhang, Y.; Yan, G.; Liu, Y.; Yang, P.; Guo, K.; Gu, W.; Chen, Y.; Zhang, G.; He, Q. Y.; Wu, S.; Wang, T.; Shen, H.; Wang, Q.; Zhu, Y.; He, F.; Xu, P., Systematic analysis of missing proteins provides clues to help define all of the protein-coding genes on human chromosome 1. J Proteome Res 2014, 13, (1), 114-25. 4.

Kwon, K. H.; Kim, J. Y.; Kim, S. Y.; Min, H. K.; Lee, H. J.; Ji, I. J.; Kang, T.; Park, G. W.; An, H. J.; Lee,

B.; Ravid, R.; Ferrer, I.; Chung, C. K.; Paik, Y. K.; Hancock, W. S.; Park, Y. M.; Yoo, J. S., Chromosome 11-centric human proteome analysis of human brain hippocampus tissue. J Proteome Res 2013, 12, (1), 97-105. 5.

Martins-de-Souza, D.; Carvalho, P. C.; Schmitt, A.; Junqueira, M.; Nogueira, F. C.; Turck, C. W.;

Domont, G. B., Deciphering the human brain proteome: characterization of the anterior temporal lobe and corpus callosum as part of the Chromosome 15-centric Human Proteome Project. J Proteome Res 2014, 13, (1), 147-57. 6.

Lee, H. J.; Jeong, S. K.; Na, K.; Lee, M. J.; Lee, S. H.; Lim, J. S.; Cha, H. J.; Cho, J. Y.; Kwon, J. Y.; Kim,

H.; Song, S. Y.; Yoo, J. S.; Park, Y. M.; Kim, H.; Hancock, W. S.; Paik, Y. K., Comprehensive genome-wide proteomic analysis of human placental tissue for the Chromosome-Centric Human Proteome Project. J Proteome Res 2013, 12, (6), 2458-66. 7.

Kim, M. S.; Pinto, S. M.; Getnet, D.; Nirujogi, R. S.; Manda, S. S.; Chaerkady, R.; Madugundu, A. K.;

Kelkar, D. S.; Isserlin, R.; Jain, S.; Thomas, J. K.; Muthusamy, B.; Leal-Rojas, P.; Kumar, P.; Sahasrabuddhe, N. A.; Balakrishnan, L.; Advani, J.; George, B.; Renuse, S.; Selvan, L. D.; Patil, A. H.; Nanjappa, V.; Radhakrishnan, A.; Prasad, S.; Subbannayya, T.; Raju, R.; Kumar, M.; Sreenivasamurthy, S. K.; Marimuthu, A.; Sathe, G. J.; Chavan, S.; Datta, K. K.; Subbannayya, Y.; Sahu, A.; Yelamanchi, S. D.; Jayaram, S.; Rajagopalan, P.; Sharma, J.; Murthy, K. R.; Syed, N.; Goel, R.; Khan, A. A.; Ahmad, S.; Dey, 19

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

G.; Mudgal, K.; Chatterjee, A.; Huang, T. C.; Zhong, J.; Wu, X.; Shaw, P. G.; Freed, D.; Zahari, M. S.; Mukherjee, K. K.; Shankar, S.; Mahadevan, A.; Lam, H.; Mitchell, C. J.; Shankar, S. K.; Satishchandra, P.; Schroeder, J. T.; Sirdeshmukh, R.; Maitra, A.; Leach, S. D.; Drake, C. G.; Halushka, M. K.; Prasad, T. S.; Hruban, R. H.; Kerr, C. L.; Bader, G. D.; Iacobuzio-Donahue, C. A.; Gowda, H.; Pandey, A., A draft map of the human proteome. Nature 2014, 509, (7502), 575-81. 8.

Wilhelm, M.; Schlegl, J.; Hahne, H.; Moghaddas Gholami, A.; Lieberenz, M.; Savitski, M. M.;

Ziegler, E.; Butzmann, L.; Gessulat, S.; Marx, H.; Mathieson, T.; Lemeer, S.; Schnatbaum, K.; Reimer, U.; Wenschuh, H.; Mollenhauer, M.; Slotta-Huspenina, J.; Boese, J. H.; Bantscheff, M.; Gerstmair, A.; Faerber, F.; Kuster, B., Mass-spectrometry-based draft of the human proteome. Nature 2014, 509, (7502), 582-7. 9.

Uhlen, M.; Fagerberg, L.; Hallstrom, B. M.; Lindskog, C.; Oksvold, P.; Mardinoglu, A.; Sivertsson,

A.; Kampf, C.; Sjostedt, E.; Asplund, A.; Olsson, I.; Edlund, K.; Lundberg, E.; Navani, S.; Szigyarto, C. A.; Odeberg, J.; Djureinovic, D.; Takanen, J. O.; Hober, S.; Alm, T.; Edqvist, P. H.; Berling, H.; Tegel, H.; Mulder, J.; Rockberg, J.; Nilsson, P.; Schwenk, J. M.; Hamsten, M.; von Feilitzen, K.; Forsberg, M.; Persson, L.; Johansson, F.; Zwahlen, M.; von Heijne, G.; Nielsen, J.; Ponten, F., Proteomics. Tissue-based map of the human proteome. Science 2015, 347, (6220), 1260419. 10. Lane, L.; Bairoch, A.; Beavis, R. C.; Deutsch, E. W.; Gaudet, P.; Lundberg, E.; Omenn, G. S., Metrics for the Human Proteome Project 2013-2014 and strategies for finding missing proteins. J Proteome Res 2014, 13, (1), 15-20. 11. Fagerberg, L.; Hallström, B. M.; Oksvold, P.; Kampf, C.; Djureinovic, D.; Odeberg, J.; Habuka, M.; Tahmasebpoor, S.; Danielsson, A.; Edlund, K., Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics. Molecular & Cellular Proteomics 2014, 13, (2), 397-406. 12. Ramskold, D.; Wang, E. T.; Burge, C. B.; Sandberg, R., An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Comput Biol 2009, 5, (12), e1000598. 13. Soumillon, M.; Necsulea, A.; Weier, M.; Brawand, D.; Zhang, X.; Gu, H.; Barthes, P.; Kokkinaki, M.; Nef, S.; Gnirke, A.; Dym, M.; de Massy, B.; Mikkelsen, T. S.; Kaessmann, H., Cellular source and mechanisms of high transcriptome complexity in the mammalian testis. Cell Rep 2013, 3, (6), 2179-90. 14. Djureinovic, D.; Fagerberg, L.; Hallstrom, B.; Danielsson, A.; Lindskog, C.; Uhlen, M.; Ponten, F., The human testis-specific proteome defined by transcriptomics and antibody-based profiling. Mol Hum Reprod 2014, 20, (6), 476-88. 15. Simpson, A. J.; Caballero, O. L.; Jungbluth, A.; Chen, Y. T.; Old, L. J., Cancer/testis antigens, gametogenesis and cancer. Nat Rev Cancer 2005, 5, (8), 615-25. 16. Wang, J.; Xia, Y.; Wang, G.; Zhou, T.; Guo, Y.; Zhang, C.; An, X.; Sun, Y.; Guo, X.; Zhou, Z.; Sha, J., In-depth proteomic analysis of whole testis tissue from the adult rhesus macaque. Proteomics 2014, 14, (11), 1393-402. 17. Almeida, L. G.; Sakabe, N. J.; deOliveira, A. R.; Silva, M. C.; Mundstein, A. S.; Cohen, T.; Chen, Y. T.; Chua, R.; Gurung, S.; Gnjatic, S.; Jungbluth, A. A.; Caballero, O. L.; Bairoch, A.; Kiesler, E.; White, S. L.; Simpson, A. J.; Old, L. J.; Camargo, A. A.; Vasconcelos, A. T., CTdatabase: a knowledge-base of high-throughput and curated data on cancer-testis antigens. Nucleic Acids Res 2009, 37, (Database issue), D816-9. 18. Xu, P.; Duong, D. M.; Peng, J., Systematical optimization of reverse-phase chromatography for shotgun proteomics. Journal of proteome research 2009, 8, (8), 3944-3950. 19. Mann, K.; Mann, M., In-depth analysis of the chicken egg white proteome using an LTQ Orbitrap 20

ACS Paragon Plus Environment

Page 20 of 32

Page 21 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Velos. Proteome Sci 2011, 9, (1), 7. 20. Lander, E. S.; Linton, L. M.; Birren, B.; Nusbaum, C.; Zody, M. C.; Baldwin, J.; Devon, K.; Dewar, K.; Doyle, M.; FitzHugh, W., Initial sequencing and analysis of the human genome. Nature 2001, 409, (6822), 860-921. 21. UniProt, C., UniProt: a hub for protein information. Nucleic Acids Res 2015, 43, (Database issue), D204-12. 22. Cunningham, F.; Amode, M. R.; Barrell, D.; Beal, K.; Billis, K.; Brent, S.; Carvalho-Silva, D.; Clapham, P.; Coates, G.; Fitzgerald, S.; Gil, L.; Giron, C. G.; Gordon, L.; Hourlier, T.; Hunt, S. E.; Janacek, S. H.; Johnson, N.; Juettemann, T.; Kahari, A. K.; Keenan, S.; Martin, F. J.; Maurel, T.; McLaren, W.; Murphy, D. N.; Nag, R.; Overduin, B.; Parker, A.; Patricio, M.; Perry, E.; Pignatelli, M.; Riat, H. S.; Sheppard, D.; Taylor, K.; Thormann, A.; Vullo, A.; Wilder, S. P.; Zadissa, A.; Aken, B. L.; Birney, E.; Harrow, J.; Kinsella, R.; Muffato, M.; Ruffier, M.; Searle, S. M.; Spudich, G.; Trevanion, S. J.; Yates, A.; Zerbino, D. R.; Flicek, P., Ensembl 2015. Nucleic Acids Res 2015, 43, (Database issue), D662-9. 23. Wen, B.; Du, C.; Li, G.; Ghali, F.; Jones, A. R.; Käll, L.; Xu, S.; Zhou, R.; Ren, Z.; Feng, Q., IPeak: An open source tool to combine results from multiple MS/MS search engines. Proteomics 2015. 24. Wen, B.; Li, G.; Wright, J. C.; Du, C.; Feng, Q.; Xu, X.; Choudhary, J. S.; Wang, J., The OMSSAPercolator: an automated tool to validate OMSSA results. Proteomics 2014, 14, (9), 1011-4. 25. Savitski, M. M.; M, W. I.; Hahne, H.; Kuster, B.; Bantscheff, M., A scalable approach for protein false discovery rate estimation in large proteomic data sets. Mol Cell Proteomics 2015. 26. Deutsch, E. W.; Sun, Z.; Campbell, D.; Kusebauch, U.; Chu, C. S.; Mendoza, L.; Shteynberg, D.; Omenn, G. S.; Moritz, R. L., State of the Human Proteome in 2014/2015 As Viewed through PeptideAtlas: Enhancing Accuracy and Coverage through the AtlasProphet. J Proteome Res 2015. 27. Zhang, C.; Li, N.; Zhai, L.; Xu, S.; Liu, X.; Cui, Y.; Ma, J.; Han, M.; Jiang, J.; Yang, C., Systematic analysis of missing proteins provides clues to help define all of the protein-coding genes on human chromosome 1. Journal of proteome research 2013, 13, (1), 114-125. 28. Köhrer, K.; Domdey, H., [27] Preparation of high molecular weight RNA. Methods in enzymology 1991, 194, 398-405. 29. Mortazavi, A.; Williams, B. A.; McCue, K.; Schaeffer, L.; Wold, B., Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 2008, 5, (7), 621-8. 30. Fu, Y.; Yang, Q.; Sun, R.; Li, D.; Zeng, R.; Ling, C. X.; Gao, W., Exploiting the kernel trick to correlate fragment ions for peptide identification via tandem mass spectrometry. Bioinformatics 2004, 20, (12), 1948-1954. 31. Li, D.; Fu, Y.; Sun, R.; Ling, C. X.; Wei, Y.; Zhou, H.; Zeng, R.; Yang, Q.; He, S.; Gao, W., pFind: a novel database-searching software system for automated peptide and protein identification via tandem mass spectrometry. Bioinformatics 2005, 21, (13), 3049-3050. 32. Wang, L. h.; Li, D. Q.; Fu, Y.; Wang, H. P.; Zhang, J. F.; Yuan, Z. F.; Sun, R. X.; Zeng, R.; He, S. M.; Gao, W., pFind 2.0: a software package for peptide and protein identification via tandem mass spectrometry. Rapid Communications in Mass Spectrometry 2007, 21, (18), 2985-2991. 33. Huang, D. W.; Sherman, B. T.; Lempicki, R. A., Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature protocols 2008, 4, (1), 44-57. 34. Krämer, A.; Green, J.; Pollard, J.; Tugendreich, S., Causal Analysis Approaches in Ingenuity Pathway Analysis (IPA). Bioinformatics 2013, btt703. 35. Schagger, H., Tricine-SDS-PAGE. Nat Protoc 2006, 1, (1), 16-22. 36. Chen, L.; Zhai, L.; Li, Y.; Li, N.; Zhang, C.; Ping, L.; Chang, L.; Wu, J.; Li, X.; Shi, D.; Xu, P., 21

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Development of gel-filter method for high enrichment of low-molecular weight proteins from serum. PLoS One 2015, 10, (2), e0115862. 37. Stiles, G., Adenosine receptors. Journal of Biological Chemistry 1992, 267, (10), 6451-6454. 38. Jenner, T. L.; Rose'meyer, R. B., Adenosine A(3) receptor mediated coronary vasodilation in the rat heart: changes that occur with maturation. Mech Ageing Dev 2006, 127, (3), 264-73. 39. Peculis, R.; Latkovskis, G.; Tarasova, L.; Pirags, V.; Erglis, A.; Klovins, J., A nonsynonymous variant I248L of the adenosine A3 receptor is associated with coronary heart disease in a Latvian population. DNA Cell Biol 2011, 30, (11), 907-11. 40. Huang, C.; Jia, Y.; Yang, S.; Chen, B.; Sun, H.; Shen, F.; Wang, Y., Characterization of ZNF23, a KRAB-containing protein that is downregulated in human cancers and inhibits cell cycle progression. Exp Cell Res 2007, 313, (2), 254-63. 41. Huang, C.; Yang, S.; Ge, R.; Sun, H.; Shen, F.; Wang, Y., ZNF23 induces apoptosis in human ovarian cancer cells. Cancer Lett 2008, 266, (2), 135-43. 42. Shi, Y.; Zheng, L.; Luo, G.; Wei, J.; Zhang, J.; Yu, Y.; Feng, Y.; Li, M.; Xu, N., Expression of zinc finger 23 gene in human hepatocellular carcinoma. Anticancer research 2011, 31, (10), 3595-3599. 43. Khakoo, S. I.; Thio, C. L.; Martin, M. P.; Brooks, C. R.; Gao, X.; Astemborski, J.; Cheng, J.; Goedert, J. J.; Vlahov, D.; Hilgartner, M., HLA and NK cell inhibitory receptor genes in resolving hepatitis C virus infection. Science 2004, 305, (5685), 872-874. 44. Yen, J.-H.; Moore, B. E.; Nakajima, T.; Scholl, D.; Schaid, D. J.; Weyand, C. M.; Goronzy, J. J., Major histocompatibility complex class I–recognizing receptors are disease risk genes in rheumatoid arthritis. The Journal of experimental medicine 2001, 193, (10), 1159-1168.



FIGURE LEGENDS

Figure 1. Flowchart illustrating the strategies for identifying MPs in three individual testis tissues by proteome and transcriptome profiling, and MP analysis based on the integration of information and mRNA abundance. Figure 2. Proteome sample preparation and identification. (A) Separation of total proteins from three individual testis tissues using 10% glycine-SDS-PAGE and 12% gricine-SDS-PAGE. (B) Venn diagram of proteins identification by glycine-SDS-PAGE and tricine-SDS-PAGE. (C) Distribution of proteins identified by glycine-SDS-PAGE and tricine-SDS-PAGE. (D) Venn diagram of the proteins identified in three individuals (individual A, B, and C). (E) Comparison of protein abundance variability in individual testis samples. 22

ACS Paragon Plus Environment

Page 22 of 32

Page 23 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(F-G) Proteome profiling saturation using two gel-separation methods and MS were evaluated by non-redundant peptide and protein identification. Figure 3. Summary of testis proteome data and the contribution of individual testis to MP identification. (A) Venn diagram of testis proteome results for three MS identification datasets. (B) Venn diagram compares testis proteomes using MS-based identification and antibody-based profiling experiment. (C) Venn diagram comparison of two MS-based testis proteome datasets. Xulab_testis, current study; Pandey all tissues dataset, previously published study. (D) Venn diagram comparison of proteomes from Xulab testis, public CCPD 2.0, and PeptideAtlas. (E) Identification of MPs in pooled samples and individual samples using MS-based methods. Comparison analysis of the three individual contributions to MP identification. Figure 4. Transcriptome data profiling and proteogenomics analysis of MPs. (A) Venn diagram of testis transcripts from three individual samples (B) Transcriptome profiling saturation based on RNA-seq strategy. (C) Comparison of mRNA analysis in three individual testis samples. (D) Comparison of testis gene expression products at protein and mRNA levels. (E) Correlation of gene expression at the mRNA and protein abundance levels in three testis samples. (F) Identification of MP mRNAs identification of MPs in three individual testis samples. (G) Comparison of mRNA abundance of MPs and non-MPs in testis samples. 23

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(H) Enrichment comparison between confirmed MPs and filtered MPs from quality-checked spectra. (I) Comparison of confirmed MPs with filtered MPs at the mRNA expression level. Figure 5. Testis tissue-specific of MPs identified based on mRNA abundance distribution in different organs and tissues. The mRNA abundance distribution of total genes (gray) identified in various tissues and the mRNA abundance distribution of MPs (orange) are shown as box plots. Box height represents the dynamic range of mRNA abundance. Figure 6. Potential functions and disease association analysis of MPs. (A) Proportion of disease-related MPs out of the total identified MPs. (B) GO analysis of MP biological function. (C) The MPs abundance comparison of disease-related, testis-enriched with all proteins. (D) Potential disease relation of MPs by IPA analysis. (E) Cell localization and application of three biomarkers. (F) Testis-specific MPs enrichment in our dataset. Graphical Abstract To enhance the proteome coverage of missing proteins (MPs), we combined glycine-SDS-PAGE and tricine-SDS-PAGE to separate proteins and then subjected them to LC-MS/MS analysis. We performed systematic proteomics and transcriptomics analysis on three individual testis samples to identify MPs. A total of 9,597 protein groups were identified; of these, 166 protein groups were listed as MPs, including 138 groups with transcriptional evidence. A total of 2,948 proteins are currently listed as MPs, and 5.6% of these were identified in this study. These results indicate that 24

ACS Paragon Plus Environment

Page 24 of 32

Page 25 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

testis is a rich resource for investigation and identification of MPs. Supplementary Table 1. Detailed information about the identified proteins/genes for three samples based on proteomics and transcriptomics. Supplementary Table 2. Detailed information about the identified proteins and, peptides and confirmed MPs. Supplementary Table 3. Summary of sample preparation and raw data information for three testis tissues proteome studies. Supplementary Figure 1. Proteins identification of Testis samples by MaxQuant and IPeak two searching engines in parallel. Supplementary Figure 2. Comparison of the uniquely identified proteins in this study and MPs based on our MS analysis. Supplementary Figure 3. Venn diagram of three individuals’ potential MPs at the proteins level. Supplementary Figure 4. The high quality mass spectra of 166 MPs groups (182 proteins) based on deep RNA-seq and stringent manual check. Supplementary Figure 5. The skin, thyroid, ovary and brain might are potential resources for effectively seeking other MPs.

25

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 1. Flowchart illustrating the strategies for identifying MPs in three individual testis tissues by proteome and transcriptome profiling, and MP analysis based on the integration of information and mRNA abundance. 175x168mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 26 of 32

Page 27 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 2. Proteome sample preparation and identification. (A) Separation of total proteins from three individual testis tissues using 10% glycine-SDS-PAGE and 12% gricine-SDS-PAGE. (B) Venn diagram of proteins identification by glycine-SDS-PAGE and tricine-SDS-PAGE. (C) Distribution of proteins identified by glycine-SDS-PAGE and tricine-SDS-PAGE. (D) Venn diagram of the proteins identified in three individuals (individual A, B, and C). (E) Comparison of protein abundance variability in individual testis samples. (F-G) Proteome profiling saturation using two gel-separation methods and MS were evaluated by nonredundant peptide and protein identification. 174x234mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3. Summary of testis proteome data and the contribution of individual testis to MP identification. (A) Venn diagram of testis proteome results for three MS identification datasets. (B) Venn diagram compares testis proteomes using MS-based identification and antibody-based profiling experiment. (C) Venn diagram comparison of two MS-based testis proteome datasets. Xulab_testis, current study; Pandey all tissues dataset, previously published study. (D) Venn diagram comparison of proteomes from Xulab testis, public CCPD 2.0, and PeptideAtlas. (E) Identification of MPs in pooled samples and individual samples using MS-based methods. Comparison analysis of the three individual contributions to MP identification. 175x238mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 28 of 32

Page 29 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 4. Transcriptome data profiling and proteogenomics analysis of MPs. (A) Venn diagram of testis transcripts from three individual samples (B) Transcriptome profiling saturation based on RNA-seq strategy. (C) Comparison of mRNA analysis in three individual testis samples. (D) Comparison of testis gene expression products at protein and mRNA levels. (E) Correlation of gene expression at the mRNA and protein abundance levels in three testis samples. (F) Identification of MP mRNAs identification of MPs in three individual testis samples. (G) Comparison of mRNA abundance of MPs and non-MPs in testis samples. (H) Enrichment comparison between confirmed MPs and filtered MPs from quality-checked spectra. (I) Comparison of confirmed MPs with filtered MPs at the mRNA expression level. 172x230mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 30 of 32

Page 31 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 5. Testis tissue-specific of MPs identified based on mRNA abundance distribution in different organs and tissues. The mRNA abundance distribution of total genes (gray) identified in various tissues and the mRNA abundance distribution of MPs (orange) are shown as box plots. Box height represents the dynamic range of mRNA abundance. 163x127mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 6. Potential functions and disease association analysis of MPs. (A) Proportion of disease-related MPs out of the total identified MPs. (B) GO analysis of MP biological function. (C) The MPs abundance comparison of disease-related, testis-enriched with all proteins. (D) Potential disease relation of MPs by IPA analysis. (E) Cell localization and application of three biomarkers. (F) Testis-specific MPs enrichment in our dataset. 171x219mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 32 of 32

Page 33 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Graphical Abstract To enhance the proteome coverage of missing proteins (MPs), we combined glycine-SDS-PAGE and tricineSDS-PAGE to separate proteins and then subjected them to LC-MS/MS analysis. We performed systematic proteomics and transcriptomics analysis on three individual testis samples to identify MPs. A total of 9,597 protein groups were identified; of these, 166 protein groups were listed as MPs, including 138 groups with transcriptional evidence. A total of 2,948 proteins are currently listed as MPs, and 5.6% of these were identified in this study. These results indicate that testis is a rich resource for investigation and identification of MPs. 87x45mm (300 x 300 DPI)

ACS Paragon Plus Environment