First Proteomic Exploration of Protein-Encoding Genes on

*(F.H.) Tel and Fax: 8610-68171208. E-mail: [email protected]. (P.X.) Tel and Fax: 8610-80705155. ... ACS Publications does not have a subscription t...
0 downloads 3 Views 946KB Size
Article pubs.acs.org/jpr

First Proteomic Exploration of Protein-Encoding Genes on Chromosome 1 in Human Liver, Stomach, and Colon Songfeng Wu,†,‡,○ Ning Li,†,‡,○ Jie Ma,†,‡,○ Huali Shen,§,○ Dahai Jiang,∥,○ Cheng Chang,†,‡ Chengpu Zhang,†,‡ Liwei Li,†,‡ Hongxing Zhang,†,‡ Jing Jiang,†,‡ Zhongwei Xu,†,‡ Lingyan Ping,†,‡ Tao Chen,†,‡ Wei Zhang,†,‡ Tao Zhang,†,‡ Xiaohua Xing,†,‡ Tailong Yi,†,‡ Yanchang Li,†,‡ Fengxu Fan,†,‡ Xiaoqian Li,†,‡ Fan Zhong,§ Quanhui Wang,∥,⊥ Yang Zhang,§ Bo Wen,∥ Guoquan Yan,§ Liang Lin,∥ Jun Yao,§ Zhilong Lin,∥ Feifei Wu,§ Liqi Xie,§ Hongxiu Yu,§ Mingqi Liu,§ Haojie Lu,§ Hong Mu,¶ Dong Li,†,‡ Weimin Zhu,# Bei Zhen,†,‡ Xiaohong Qian,†,‡ Jun Qin,†,‡ Siqi Liu,*,∥,⊥ Pengyuan Yang,*,§ Yunping Zhu,*,†,‡ Ping Xu,*,†,‡ and Fuchu He*,†,‡,§ †

State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, Beijing 102206, China ‡ National Engineering Research Center for Protein Drugs, Beijing 102206, China § Institutes of Biomedical Sciences and Department of Chemistry, 130 DongAn Road, Fudan University, Shanghai 200032, China ∥ BGI-Shenzhen, ShenZhen 518083, China ⊥ Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100029, China ¶ State Key Laboratory of Molecular Oncology, Cancer Institute & Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100021, China # Taicang Institute for Life Sciences Information, Taicang 215400, China S Supporting Information *

ABSTRACT: The launch of the Chromosome-Centric Human Proteome Project provides an opportunity to gain insight into the human proteome. The Chinese Human Chromosome Proteome Consortium has initiated proteomic exploration of protein-encoding genes on human chromosomes 1, 8, and 20. Collaboration within the consortium has generated a comprehensive proteome data set using normal and carcinomatous tissues from human liver, stomach, and colon and 13 cell lines originating in these organs. We identified 12,101 proteins (59.8% coverage against Swiss-Prot human entries) with a protein false discovery rate of less than 1%. On chromosome 1, 1,252 proteins mapping to 1,227 genes, representing 60.9% of Swiss-Prot entries, were identified; however, 805 proteins remain unidentified, suggesting that analysis of more diverse samples using more advanced proteomic technologies is required. Genes encoding the unidentified proteins were concentrated in seven blocks, located at p36, q12-21, and q42-44, partly consistent with correlation of these blocks with cancers of the liver, stomach, and colon. Combined transcriptome, proteome, and cofunctionality analyses confirmed 23 coexpression clusters containing 165 genes. Biological information, including chromosome structure, GC content, and protein coexpression pattern was analyzed using multilayered, circular visualization and tabular visualization. Details of data analysis and updates are available in the Chinese Chromosome-Centric Human Proteome Database (http://proteomeview.hupo.org.cn/chromosome/). KEYWORDS: proteome, chromosome, C-HPP, coexpression, data visualization



INTRODUCTION The Human Genome Project provides fundamental sequence reference resources for large-scale omics research.1,2 The Ensembl human genome database (http://www.ensembl.org/ Homo_sapiens/, version 68.37) contains 21,224 known genes, 568 novel genes, 266 putative genes, 14,427 pseudogenes, and 15,952 RNA genes. The field of proteomics benefits not only © 2012 American Chemical Society

from improvements in chromatography and mass spectrometry but also from the availability of gene sequences and annotations Special Issue: Chromosome-centric Human Proteome Project Received: September 3, 2012 Published: December 20, 2012 67

dx.doi.org/10.1021/pr3008286 | J. Proteome Res. 2013, 12, 67−80

Journal of Proteome Research

Article

derived from genomic research.3 For example, mass spectra can be matched to peptides more efficiently4,5 using protein sequence databases.6,7 The Human Plasma Proteome Project,8 Human Liver Proteome Project (HLPP),9 Human Brain Proteome Project,10 and others have used genomic data and generated the corresponding proteome data sets using different samples/tissues. With the advances in rapid, high-resolution protein sequencing technology, protein-coding genes can be identified and quantified more efficiently and precisely. It is time that proteomics “repays” genomics and contributes to the annotation of protein-coding genes.11 Over the past few years, the Chromosome-Centric Human Proteome Project (C-HPP), which aims to define the full set of protein-coding genes on each chromosome, has been launched.12,13 Twenty-four research groups worldwide have joined the project and contributed to the annotation of different chromosomes, based on research interest in a particular disease or gene cluster.13 Digestive cancers, such as hepatocellular carcinoma, gastric cancer, and colorectal cancer, are among the most common and severe diseases in China. To better understand the physiology and pathology of liver diseases, Chinese proteomics groups have focused their efforts, in the past decade, on the HLPP, including generation of proteome expression profiles of the human fetal liver14 and Chinese adult liver.9 The Chinese Human Proteome Organization (CNHUPO) engaged in extensive collaboration with the C-HPP and initiated relevant research in China while continuing efforts to advance proteomics projects for chromosomes 1, 8, and 20. The Chinese Human Chromosome Proteome Consortium (CCPC) is coordinated by CNHUPO and includes researchers from Beijing Proteome Research Center, Fudan University, Beijing Genome Institute, and Jinan University. CCPC has established several research groups that are leading proteomics research in China. Like its foreign counterparts, CCPC has come to realize that there is currently no unique approach to a chromosome proteomics project. Under the guidance of C-HPP, we must become creative with experimental design and data analysis. CCPC has clearly outlined the primary goals of this project. First, C-HPP is not an ordinary project targeting proteomic analysis of a biological sample but is fueled by the conviction that all of the work should be focused on chromosomal features. Second, a major concern of C-HPP is the relationship between the protein-encoding genes on chromosomes and their expression status. A comprehensive analysis of this relationship should take into consideration each aspect of gene expression (e.g., transcription and translation) using qualitative and quantitative information. Third, during C-HPP implementation, we should make full use of the accumulated knowledge in chromosome research, especially knowledge of disease-related chromosomes. In this way, C-HPP’s findings will not be confined to theory but instead will extend to biological or medical applications. For this reason, CCPC used three gastrointestinal tumors and representative cell lines as the starting materials of chromosome proteomic analysis. Here, we report the results of the first collaboration in CCPC’s effort to construct proteomic expression profiles of samples from human liver, colon, and stomach. Eighteen sets of samples were used, including normal and cancerous tissues and cell lines (eight liver, two colon, and three gastric cell lines). Finally, 12,101 highconfidence proteins were identified with a protein false discovery rate (FDR) less than 1%. Using this high-quality proteomic data set (the Chinese Chromosome Proteome Data set [CCPD]),

Beijing Proteome Research Center began to make contributions to proteomic annotation of chromosome 1. The same data set was also used to annotate chromosomes 8 and 20, and the current state of these chromosomes will be described in this volume. Chromosome 1 is the largest of the human chromosomes, containing seven bands and 249 M base pairs (bp), and is separated into short and long arms.15 Many disease-related genes are found on human chromosome 1. To date, approximately 460 genetic phenotypes such as Parkinson’s disease, Alzheimer’s disease, autosomal recessive deafness, recurrent pregnancy loss, and Duffy blood group system have been traced to chromosome 1 (http://www.ncbi.nlm.nih.gov/omim). Chromosome 1 also plays an important role in cancers of the liver, stomach, and colon. For example, because of genomic alterations, at least 785 genes on chromosome 1 are part of liver cancer-related signatures, according to published microarray and proteomic studies.16 It has been reported that most are located at 1q21.1-q32.1 and 1q32.1-q44.17−21 Several other regions, including 1p36.33-36.31 and 1q31.1-32.1, might harbor putative tumor suppressor genes related to gastric or colon cancers.22−30 In this work, we focused on the generation and systematic analysis of the CCPD. The protein-encoding genes on chromosome 1 were fully annotated using multiple public databases; some properties were calculated on the basis of protein sequences. Unidentified proteins were found to be those with unusual physicochemical properties or low abundance. Combined analysis of the transcriptome and proteome allowed us to deduce the unidentified blocks (a “block” refers to a particular region on chromosome 1 in which proteins of adjacent genes are all qualitatively identified or unidentified) and coexpression clusters (a “cluster” refers to a particular region on chromosome 1 in which adjacent genes are quantitatively coexpressed) on this chromosome. Circular and tabular visualization were used to provide a global view of the data and results. Detailed data can be found in the Chinese Chromosome-Centric Human Proteome Database (C-HPD, http://proteomeview.hupo.org.cn/ chromosome/) and will be available in Chromosome-Assembled human Proteome browsER (CAPER, http://61.50.138.124/ index.php) in the future. As C-HPP investigators learned in Boston, it is essential that we all utilize the agreed-upon standard databases and criteria for the C-HPP chromosome baselines: Ensembl for number of genes (v68, release Jul 2012); neXtProt (release Sep 2012), GPMdb (release Jul 2012), and Peptide Atlas (1% FDR at protein level, release Sep 12) for number and list of proteins confidently identified from MS data sets; and Human Protein Atlas (HPA, v10.0) for antibody-based protein identifications. The baselines of genes/proteins for chromosome 1 from Ensembl, GPMDB, PeptideAtlas, ProteinAtlas, and neXtProt as well as their corresponding identifications in CCPD are shown in Table 1.



MATERIALS AND METHODS

Sample Preparation

Liver samples were described in HLPP.9 The eight liver cell lines MHCC97L (97L), MHCC97H (97H), HCCLM3 (LM3), HCCLM6 (LM6), SNU398, SNU449, SNU475, and Hep3B31−33 were cultured in Dulbecco’s modified Eagle medium (Gibco, USA) supplemented with 10% fetal bovine serum (Life Technologies) at 37 °C in a humidified atmosphere containing 5% CO2. Cancerous and adjacent normal tissues of 68

dx.doi.org/10.1021/pr3008286 | J. Proteome Res. 2013, 12, 67−80

Journal of Proteome Research

Article

Affymetrix HG-U133 2.0 Plus oligonucleotide microarrays were used for five hepatic cell lines, MHCC97L, MHCC97H, HCCLM3, HCCLM6, and Hep3B. Microarray data for the other three hepatic cell lines (SNU398, SNU449, and SNU475) were downloaded from http://www.ebi.ac.uk/arrayexpress/ experiments/E-MTAB-37. Transcriptome data for colorectal cell lines were also generated by microarray.51 Gene expression levels were determined from RNA-Seq data based on the RPKM (reads per kilobase per million) value.52 Quantitative microarray data were utilized for further study based on MAS 5.0 (Affymetrix Expression Console Software, http:// www.affymetrix.com/). When different samples were compared, RNASeq and microarray raw intensity values were divided by the median value in each sample for normalization. Details of the qualitative and quantitative analysis of transcriptome data can be found in Supporting Information and Methods.

Table 1. Number of Genes/Proteins in the Five Databases and Their Corresponding Identified Results for Chromosome 1a no. of genes database CCPD

no. of proteins

Ensembl

GPMDBb

PeptideAtlas

HPA

neXtProtc

2,043 1,227

1,455 1,121

1,166 1,014

1,441 990

1,419 1,081

The five databases and their versions are Ensembl (v68, release Jul 2012), GPMDB (release Jul 2012), PeptideAtlas (release Sep 2012), ProteinAtlas (HPA, v 10.0), and neXtProt (release Sep 2012). bThe data set annotated as “Green” was used, with the threshold “>20 Observations and log(e) < −5”. cThe data set annotated as “protein evidence” was used. a

colon and stomach were collected at Beijing Cancer Institute & Hospital and Sun Yat-Sen University Cancer Center under the supervision of the institutions’ ethical review boards. Two colon cell lines, SW480 and HCT116,34,35 and three gastric cell lines, AGS, BGC823, and Sgc-7901,36−39 were also used in this work. Tissues were lysed as described previously.40 The quality of the extracted proteins was inspected following SDS-PAGE and Coomassie blue staining. The concentration of extracted protein was determined using the BCA assay. Equal amounts of protein from 3−5 patients were pooled prior to proteomics analysis to reduce individual variation. Detailed information is available in Supporting Information and Methods.

Bioinformatic Annotation of Identified Proteins

To analyze the expression patterns of the proteome and transcriptome, proteins in Swiss-Prot were systematically annotated by combining the information from the Swiss-Prot, Online Mendelian Inheritance in Man (OMIM; NCBI, confirmed Mendelian phenotype), Cancer Gene Census (CGC, Sanger Center; cancer gene information), ProteinAtlas,53 and Ensembl databases. Evidence of protein expression, post-translational modifications (PTMs), and protein list were downloaded from Swiss-Prot directly. Antibody evidence was extracted from ProteinAtlas. Disease information was obtained from both OMIM and CGC. Gene position and chromosomal band information was downloaded from the Ensembl database (version 67.37) using BioMart (www.ensembl.org/biomart/martview). The gene symbols were utilized for mapping Ensembl to Swiss-Prot. For cofunction analysis, the Gene Ontology (GO) terms were downloaded from ftp.ensembl.org/pub/release-67/genbank/homo_sapiens/. The mass spectra evidence was obtained from the global proteome machine database (GPMDB),54 NIST spectra library (NIST_SL) (http://peptide.nist.gov/), and Human PeptideAtlas.55 Physicochemical properties and protein detectability were calculatedon the basis of the protein sequences. The physicochemical properties of proteins, including molecular weight (MW), hydrophobicity (Hy), and isoelectric point (pI) were computed with ProPAS56 software. Peptide detectability (pepDetectability) was determined by an support vector machinebased tool, STEPP,57 and protein detectability (proDetectability) was deduced using the following formula:

Proteomics Analysis

The protein mixtures were either subjected to gel-assisted digestion or sodium dodecyl sulfate-polyacrylamide gel electrophoresis (SDS-PAGE) separation followed by in-gel digestion with trypsin as described.41 The peptide mixtures from gelassisted digestion were separated by high-pH reverse-phase chromatography.42 All peptide mixtures were analyzed by liquid chromatography tandem mass spectrometry (LC−MS/MS) in CCPC laboratories. The mass spectrometers used in this project were TripleTOF 5600 (AB SCIEX, Concord, ON), Orbitrap Q-Exactive or LTQ-Orbitrap Velos (Thermo Fisher Scientific). Details of protein digestion, extraction, peptide mixture prefractionation, and MS instrument settings can be found in the Supporting Information and Methods. All raw files of mass spectra were converted into mzXML and MGF files using the msconvert module43 in the Trans-Proteomic Pipeline (TPP v4.5.2).44 The MS/MS peak lists were searched using the Mascot v2.3.2 local server5 against the database containing sequences of all human proteins from Swiss-Prot (20,231 proteins, 2012_07 release) and common contaminants (115 proteins, ftp.thegpm.org/fasta/cRAP). The target-decoy strategy was applied to maintain both peptide and protein FDRs at less than 1%.45,46 A normalized label-free quantitation method based on the extracted ion chromatograms (XICs) was applied to all confidently identified peptides and proteins.47 Details of peptide/protein identification quality control methods and XICbased protein quantification methods are available in Supporting Information and Methods.

n

proDetectability = log10(∑ pepDetectability) i=1

Here n is the number of all theoretical digested peptides in one protein. Cluster and Enrichment Analysis

For cluster analysis of protein expression data, two-way hierarchical clustering of proteins was implemented using Cluster 3.058 and the default parameters coupled with centroid linkage. The GO molecular function and biological process ontologies and pathway analysis was applied through the online tool DAVID59,60 coupled with MetaCore from GeneGo, Inc. A p-value of 0.001 was used as a cutoff to achieve the significant categorical annotation.

Transcriptome Analysis

Transcriptome analysis was performed using the corresponding RNAs isolated from tissues and cell lines. RNA-Seq technology (Illumina HiSeq 2000) was used to generate the transcriptome data for normal and cancerous gastric tissues,48 gastric cell lines AGS and BGC823, normal and cancerous hepatic tissues,49 and normal and cancerous colorectal tissues.50 69

dx.doi.org/10.1021/pr3008286 | J. Proteome Res. 2013, 12, 67−80

Journal of Proteome Research

Article

Figure 1. Flowchart of experimental strategies and data processing methods.

Identification of Gene Coexpression/Cofunction Clusters



RESULTS AND DISCUSSION To identify as many proteins as possible for C-HPP projects, we designed the project in such a way as to use diverse samples of diseased tissue. Comprehensive proteome data sets, as well as the corresponding transcriptome data sets, were generated using both tissue and cell lines from three organs of the human digestive system [i.e., liver, colon, and stomach (Figure 1)]. The data were processed for assignment to the database entries with their normalized expression abundance. Strict quality controls were applied to each step of both the experiments and the data analysis process (Figure 1).

The chromosome 1 gene coexpression clusters were calculated on the basis of the quantitative data from both the proteome and the transcriptome, following Lercher’s method.61 The cofunction clusters were identified according to the shared functions from GO terms.61 Tandem duplicates were defined by protein BLAST62 (word size=2). Protein pairs were considered tandem duplicates if the blast e-value was less than 0.2.63 A detailed description of the method can be found in Supporting Information and Methods. Data Visualization and Database Construction

Qualitative Analysis of the Identified Proteins

As in the gene display method of Paik et al.,13 tabular visualization of genes on chromosome 1 was carried out using conditional formatting of Excel spreadsheets and included the information on protein characterization and the indices of protein and mRNA abundance from this study for each identified gene. There are five levels in Supplementary Table S4. The top level is shown in green, which indicates the highest quality for protein characterization or top 25% protein/mRNA abundance, whereas yellow, red, and gray represent decreased quality or abundance. The dark box in the table indicates that the protein was not characterized or could not be identified in the proteome or transcriptome. The heatmap of hierarchical clustering of the protein quantitation values, sample correlation coefficient matrix, and gene coexpression clusters of chromosome 1 were displayed with Java TreeView-1.1.6r2 (http://jtreeview.sourceforge.net/). The global overview of chromosome annotations (from chromosome structure to protein evidence and modifications), detected abundance of protein and mRNA, and expression patterns (unidentified blocks and coexpression clusters) were layered in a circular layout for visualization using Circos.64 The database C-HPD was constructed using the MySQL relational database to store the data and JSP technology to implement the web interface. Currently, the database consists of three parts: basic information for chromosomes, protein and gene annotation for each chromosome, and transcriptive information for each chromosome.

Collaboration within the CCPC produced a large proteome data set, CCPD, containing 6.5 million spectra and 192,546 unique peptides for 12,101 proteins with a protein FDR of less than 1%, corresponding to 11,632 genes (Supplementary Tables S1 and S2). The high quality of protein identification was demonstrated by the large number of unique peptides (Supplementary Figure S1A) and high protein sequence coverage (Supplementary Figure S1B). There were 425 proteins with only one unique peptide in our final protein list after applying a 1% FDR cutoff. The high quality of the matches was also demonstrated by a higher Mascot score (Supplementary Figure S1C). To compare tissue specificity and increase the coverage of identified proteins to all protein-coding genes in the human genome, three tissues (from the human liver, colon, and stomach) were used (Table 2). The results showed that use of samples from different organs increased the number of identified peptides and proteins, indicating that the combination of experimental results from different samples would be complementary and conducive to chromosome-based proteomics. The number of identified proteins appeared to increase more slowly than the number of peptides (Figure 2A), suggesting that the number of identified proteins tended to reach a plateau with the increase in numbers of tissues or cell lines. We identified, with high confidence, 9,346 proteins in the liver, 9,506 in the colon, and 7,941 in the stomach. Among these, 6,184 (51.1%) proteins were overlapped by all three 70

dx.doi.org/10.1021/pr3008286 | J. Proteome Res. 2013, 12, 67−80

Journal of Proteome Research

Article

Compared with the Swiss-Prot database, the general coverage of identified proteins was approximately 59.8%. The coverage rate for different chromosomes was similar, ranging from 66.6% for chromosome 2 to 41.6% for chromosome 21. For the three chromosomes that CCPC is devoted to, the coverage was 60.9% (chr 1), 58.9% (chr 8), and 58.4% (chr 20, Figure 2C).

Table 2. Overview of Samples and Data Sets in Proteomic Analysis tissue liver

colon

stomach

total

sample normal tissue 97H 97L LM3 LM6 Hep3B SNU339 SNU449 SNU475 cancerous tissue normal tissue SW480 HCT116 cancerous tissue normal tissue AGS BGC823 Sgc-7901

no. of identified spectra

no. of identified peptides

no. of identified proteins

789,201 397,380 289,831 223,065 278,896 220,809 226,484 273,160 243,738 562,787 572,802 540,271 345,981 220,652 219,581 482,442 447,780 219,224 6,554,084

72,627 27,241 26,809 15,728 26,181 25,941 25,034 24,768 23,387 87,955 103,182 36,130 46,353 17,529 17,676 50,805 51,689 31,002 192,546

7,092 3,738 3,568 3,014 3,736 4,011 3,910 3,994 3,804 7,749 8,506 4,429 5,398 3,018 3,375 6,236 6,130 3,734 12,101

Contributions to the Public Protein/Mass Spectra Databases

The publicly available protein databases enable the systematic annotation of protein entries from different evidence. For example, Swiss-Prot provides the protein evidence function, which assigns every protein to one of the following levels with decreasing confidence: protein, transcript, homology, prediction, and uncertain. Compared with the protein evidence annotated in Swiss-Prot, 10,264 of 12,101 proteins in our CCPD were annotated as genes with protein-level evidence in the database, accounting for 75.4% of the genes with protein-level evidence in Swiss-Prot. This result again confirmed the high confidence of our identified proteins. Furthermore, 1,837 identified proteins in our data set were annotated in Swiss-Prot as genes with evidence levels of transcript, homology, or even prediction or uncertain (Figure 2D). Therefore, our results provided new evidence for and insight into the functions of these genes. The public proteomics databases are another valuable resource for protein level validation. As described in Materials and Methods, there are three major mass spectra databases (GPMDB, Human PeptideAtlas, and NIST_SL) that can be used as sources for comparison. Despite the smaller scale and volume of our proteomics data, we consider our contributions to

data sets, suggesting a high degree of consistency and confidence of our proteomics platform and protein sequencing power (Figure 2B). Detailed information about the identified proteins and peptides can be found in Supplementary Tables S1 and S2.

Figure 2. General description of proteome identified results. (A) Number of identified peptides and proteins in the three tissues and the corresponding cell lines. (B) Overlap of identified proteins in the three tissues. (C) Identified proteins on different chromosomes. The three chromosomes that the CCPC is devoted to are highlighted. The number of identified proteins over the number of total proteins on different chromosomes are shown near the highlighted bar. The percentage of identified proteins on different chromosomes can be found in the line above. (D) Identified protein distribution in Swiss-Prot’s five different evidence categories. (E) Overlap of identified peptides with Human PeptideAtlas and NIST_SL. (F). Overlap of highconfidence identified proteins with GPMDB, NIST_SL, and PeptideAtlas. 71

dx.doi.org/10.1021/pr3008286 | J. Proteome Res. 2013, 12, 67−80

Journal of Proteome Research

Article

Figure 3. Protein abundance distribution and cluster analyses of proteome expression data for liver, stomach, and colon samples. (A) Distribution of protein abundance. (B) Clustering of different samples. (C) Cluster analysis of proteomic data from 17 different samples. The analyses of gene function in different clusters indicate potentially different functions and coordinated coexpression in different tissues.

with the gastric AGS cell line and colon cancer tissue having the lowest correlation coefficient and liver cell lines 97L and LM6 having the highest correlation coefficient (Figure 3B). To take a global view of the proteomic expression pattern, two-way hierarchical clustering was applied to the samples and identified proteins using the protein concentration. The clustering of 17 samples showed that the organ diversity was more significant than the difference between cancer and normal tissues of the same organ (Figure 3C), which is consistent with the coefficient analysis described above (Figure 3B). As expected, the liver samples were clustered together and appeared as an independent branch. There were some crossovers between the stomach and colon, indicating that to some extent, these two organs might share some physiological functions (Figure 3C). Cluster analysis of different proteins revealed four blocks in which the protein expression levels varied dramatically between different organs (as indicated in Figure 3C). Further research should be carried out for the validation of these function biases.

these public resources significant. As shown in Figure 2E, 79,581 unique peptide sequences in our CCPD were not present either in the NIST peptide list or that of Human PeptideAtlas. These peptides were identified with high confidence in our study because 10,741 (1,181 for chromosome 1) were identified by at least one high-accuracy MS/MS spectrum with a Mascot ion score greater than 50. At the protein level, we contributed 564 new identified proteins (58 for chromosome 1) with three or more unique peptides, which are not present in these public databases (Figure 2F). These results suggest the high coverage of our data set at the level of both the individual protein and whole proteome. Quantitative Analysis of the Identified Proteins

The abundance distribution of identified proteins at the proteomic level shows that the proteins on our list span 7 orders of magnitude and that 90% of the proteins fall in the range of 3−4 orders of magnitude (Figure 3A). The Pearson’s correlation coefficient between different samples ranges from 0.21 to 0.96, 72

dx.doi.org/10.1021/pr3008286 | J. Proteome Res. 2013, 12, 67−80

Journal of Proteome Research

Article

Figure 4. Overview of the identified proteins on chromosome 1. (A) Distribution of protein-coding genes and identified genes on chromosome 1. The red line shows the densities of identified genes; the blue shows the densities of the total protein-coding genes. (B) Number of identified/unidentified genes for protein evidence in Swiss-Prot; high-quality mass spectra from GPMDB, NIST_SL, and HPA; antibody data; and disease data. (C) Overlap of the three PTMs and the identified proteins. (D) Number of identified proteins in the enriched pathways of chromosome 1. (E) Analyses of missing proteins on chromosome 1. Missing proteins were matched to the three mass spectra databases and separated into three data sets: 356 matched to highconfidence mass spectra data; 410 matched to low-confidence data; and 39 could not be matched to the databases. The four data sets are compared based on physicochemical properties and mRNA abundance.

Protein Identifications on Chromosome 1

very low protein identification rate, such as the region of the q terminal, 1q2. Compared with public data sets (including protein evidence from Swiss-Prot, high quality mass spectra, and antibody and disease data from OMIM and Sanger CGC database, respectively), 70%−80% of annotated proteins were covered by our CCPD (Figure 4B). The results indicate that CCPD is enriched in the known and well-annotated protein list of chromosome 1. The overlap of the three well-studied protein post-translational modification data sets and CCPD show that 857 of the 1,252 identified proteins on chromosome 1 contain one or more modifications (Figure 4C). There are only five overlapped proteins among those acetylated or glycosylated. The percentage of all modified proteins represented by the identified proteins varies with the type of modification. Of the 481 glycoproteins, only 47.8% were identified. However, 83.6% of the 758 phosphorylated proteins were identified, as were 96.7% of the acetylated proteins (Figure 4C).

On human chromosome 1, there are 2,057 Swiss-Prot entries (Release 2012_07) and 2,040 protein-coding genes in the Ensembl protein database (Release-67). In this project, we identified 1,252 Swiss-Prot entries, corresponding to 1,227 Ensembl genes. The baselines and identified genes/proteins for chromosome 1 from Ensembl, GPMDB, PeptideAtlas, ProteinAtlas, and neXtProt are shown in Table 1. Enrichment Analysis of Identified Proteins on Chromosome 1

The enrichment of identified proteins was analyzed in terms of chromosome structure, protein evidence, and functional pathway. According to the gene density distribution along chromosome 1 (Figure 4A), the areas with highest gene densities are at the two terminals of the p and q arms (1p3 and 1q4) and at 1q2 close to the centromere. We also noticed that proteins in several regions had been heavily identified, including those at 1p2 and 1q2-1q3. On the other hand, there were several regions with a 73

dx.doi.org/10.1021/pr3008286 | J. Proteome Res. 2013, 12, 67−80

Journal of Proteome Research

Article

Figure 5. Gene expression blocks on chromosome 1. Seven gene expression blocks are located close to the centromere or the termini of the two chromosome arms. Of these five blocks (1, 2, 4, 5, and 7), which are absent or expressed at a low level in the transcriptome and proteome, contain many homologous genes and might be the result of tandem duplication. The two remaining blocks (3 and 6) appear to be low-abundance or missing in the proteome data but have median or high abundance in the transcriptome data.

protein list of the public mass spectra databases, and 410 proteins were among the low-confidence results in these databases. However, 39 proteins remained that could not be found in any of the three public databases (Figure 4E). Herein, all proteins on chromosome 1 can be divided into four subgroups, Group A (1,252 identified proteins), Group B (356 unidentified proteins with high-confidence in public databases), Group C (410 unidentified proteins with low-confidence in public databases), and Group D (39 unidentified proteins without mass spectra evidence). The physicochemical properties including MW, Hy, pI, and protein detectability of the four groups were analyzed and compared. As shown in Figure 4E, the average MW of Group A proteins was higher than that of the three unidentified-protein groups. Further, the 39 proteins not detected in the three public databases had the lowest average MW, indicating that the lowMW proteins tend to be more difficult to identify by mass spectrometry. Bias in Hy, pI, and protein detectability could also be seen with unidentified proteins. In conclusion, the identified proteins tend to be higher in molecular weight, hydrophilicity, acidity, and detectability. The bias in physicochemical properties might be caused by the proteomic technology and might also lead to absence of membrane proteins (Supplementary Figure S2A). In addition to technical bias, two additional factors may be associated with protein identification. Using mRNA abundance as an index to estimate the abundance of the corresponding proteins, we found that proteins are more difficult to identify

Further pathway enrichment analysis shows that the genes of eleven pathways were significantly enriched on chromosome 1. The ratio of the identified gene number to the total gene number on chromosome 1 varies among the different enriched pathways (Figure 4D). For example, only four proteins (7%) in the olfactory transduction pathway were identified, indicating the obvious absence of the pathway in the three tissues studied. The genes involved in glutathione metabolism and Fc gamma R-mediated phagocytosis were fully identified, indicating that these are highly active pathways in the tissues represented in our data set. These results were consistent with the gene density analysis described above and the unidentified blocks discussed below. Analysis of Technical and Sampling Bias of Unidentified Proteins on Chromosome 1

The overlap between the CCPD and the public protein/mass spectra data indicates that there is still room for improvement of protein identification. Thus, the analysis of identified and unidentified proteins contributes to a better understanding of technical and sampling bias and might result in greater coverage of protein-coding genes in the future. Among the 2,057 proteins encoded by genes on chromosome 1 in the Swiss-Prot database, 1,252 proteins (60.87%) were identified and 805 were unidentified proteins in our proteome data set. Compared with the three public mass spectra databases (GPMDB, Human PeptideAtlas, and NIST_SL), 356 of the unidentified proteins were matched to the high-confidence 74

dx.doi.org/10.1021/pr3008286 | J. Proteome Res. 2013, 12, 67−80

Journal of Proteome Research

Article

Figure 6. (A) Regions of gene coexpression and cofunction clusters on human chromosome 1 and heatmaps of gene coexpression clusters 1, 3, 18, and 20. The blue and red horizontal lines indicate the locations of coexpression clusters found using mRNA data (R) and proteomic data (P), respectively. The black horizontal lines indicate the locations of cofunction clusters found by cofunction analysis (F). Coexpression clusters with at least two kinds of evidence are shown. The heatmap was generated using the Pearson’s correlation coefficients of all gene pairs in a cluster. Clusters 1, 18, and 20 were generated using mRNA expression data, while cluster 3 was generated using protein expression data. The intensity of the color in the heatmap indicates the degree of correlation. Genes in red front are coexpressed. (B) An example of a cancer suppressor-related gene coexpression pair in cluster 20. CSRP1, located at 1q32.1, may be involved in the progression of colorectal cancer and is coexpressed with LAD1. Both genes were overexpressed in normal colon tissue but underexpressed in colon cancer tissue, on the basis of our proteome and transcriptome quantification results. The XIC of each high-confidence unique peptide was extracted using Xcalibur (v2.2).

when the abundance of the corresponding mRNA is low (Figure 4E). Tissue specificity is another factor for unidentified proteins, which tended not to be annotated in the liver, stomach, and colon, according to the ProteinAtlas database,53 accounting for the absence of these proteins in our data set (Supplementary Figure S2B).

close to the terminals of the long or short arms, which is also consistent with the observation shown in Figure 4A. In blocks 1, 2, 4, 5, and 7, most genes could not be identified by proteome evidence, and transcriptome evidence was either low or without detectable signal. This observation is consistent with the protein evidence in Swiss-Prot, most of which is from the corresponding transcript. The mass spectra evidence from GPMDB, the NIST spectra library, and the Human PeptideAtlas for these genes also tends to be low-confidence or absent. We further analyzed the sequences of the genes in these blocks and found that most were tandem duplicates. For example, block 7 contains 55 genes, 42 of which belong to the olfactory receptor family and are tandem duplicates. Block 5 is composed of three different gene families, including six CD5 antigen genes, six Fc receptor-like genes, and 16 olfactory receptor genes. The putative neuroblastoma breakpoint family repeat region could be found in Block 3, which was significantly enriched on chromosome 1.15 A detailed list of the tandem duplicated gene families in each block is shown in Figure 5A. Two blocks (3 and 6) contain genes with low/absent protein expression but with median or high mRNA expression (Figure 5A). However, analysis of the physicochemical properties and detectability of these proteins showed no obvious bias. More work should be carried out to understand these phenomena.

Block Analysis of the Unidentified Proteins along Chromosome 1

We found that there are some regions on chromosome 1 marked with 'dark' in the tabular visualization, suggesting that few of the genes in those regions are expressed. Here, we call these regions “unidentified protein blocks”. The significant identified/ unidentified blocks along chromosomes could be obtained from the number of continuously identified/unidentified genes. On chromosome 1, there are 2,057 proteins, with 1,252 identified and 805 unidentified. The chance that one gene would be missed is 805/2057 = 0.391. Therefore, the probability of n continuously missed genes is p = (0.391)n. A region is estimated to be significantly unidentified if it contains at least five continuous missed genes (p < 0.01). Using this as the cutoff, 18 significant unidentified gene blocks have been identified (Supplementary Table S6). When these unidentified gene blocks are combined in the tabular visualization (Supplementary Table S4), seven expressing gene blocks are also detected along chromosome 1, with the gene number ranging from 19 to 69, according to the expression levels of the transcriptome and proteome (Figure 5A). Interestingly, of these seven blocks, three are close to the centromere and four are

Analysis of Gene Coexpression Patterns along Chromosome 1

In eukaryotes, the order of genes on the chromosomes is not random and genes with similar and/or coordinated expression are often clustered.65 Although transcriptomic analysis has been 75

dx.doi.org/10.1021/pr3008286 | J. Proteome Res. 2013, 12, 67−80

Journal of Proteome Research

Article

Figure 7. Compilation of proteome information for chromosome 1. Information includes the public annotation (including the protein evidence, modifications, and disease), transcriptome and proteome expression data, and the expression patterns obtained from the analyses of qualitative unidentified blocks and quantitative coexpression/cofunction clusters. In the circular visualization, all of the information is surrounded by the chromosome, and protein characterization and protein/mass spectra evidence are shown in the rings adjacent to the chromosome. The qualitative and quantitative identification of the transcriptome and proteome are shown in the following rings. The inner rings, close to the center of the circle, are the results of analysis of gene expression patterns along chromosome 1.

used to establish gene coexpression profiles,66 research using proteomic data to establish gene coexpression clusters is still in a preliminary stage.66 Here, we investigated the gene coexpression clusters on human chromosome 1 by analyzing the transcriptome (R) and proteome (P) expression data generated from the collected samples. Further, gene cofunction analysis (F) using GO annotation was used as additional evidence of gene coexpression clusters. Clusters with at least two types of evidence are reported here as gene coexpression clusters with high confidence. Twenty-three high-quality gene coexpression clusters were found on human chromosome 1, four of which (17%) were confirmed by three types of evidence (R, P, and F). Twelve (52%) found by mRNA expression data were also confirmed by cofunction analysis (R and F), and five (22%) were confirmed by both protein expression data and cofunction analysis (P and F). The two (9%) remaining clusters were confirmed by mRNA and protein expression data (R and P). Detailed information on the coexpression clusters is presented in Supplementary Table S7. When investigating the details of the 23 coexpression clusters, we found that the four clusters (1, 6, 8, and 19) confirmed by three types of evidence were regions of chromosome 1 with tandem duplicate genes. For example, cluster 1 contains three adjacent genes, C1QA, C1QC, and C1QB, at chromosome 1p36.12 (Figure 6A), all of which are related to the phenotype of C1q deficiency, a rare autosomal recessive disorder associated with several diseases (http://omim.org/entry/613652).

Twelve clusters were found by mRNA expression data and cofunction analysis. Genes in most of these clusters were also found to be tandem duplicates. For example, guanylate-binding proteins located at 1p22.2, amylase protein components at 1p21.1, late cornified envelope proteins at 1q21.3, and olfactory receptor family at 1q23.1 and 1q44 were all found in such types of clusters. An interesting observation was that although 23 genes of the olfactory receptor family in cluster 23 were found with coexpression patterns using mRNA data, no such proteins were identified in this region, which is also consistent with our earlier analysis. Proteomic expression data identified five more significant coexpression clusters (2, 3, 10, 11, and 21), which were also verified by cofunction analysis. For example, cluster 3 contains four adjacent genes, NEXN, FUBP1, DNAJB4, and GIPC2 located at 1p31.1 (Figure 6A). The phenotype of NEXN expression is related to cardiomyopathy (http://omim.org/entry/ 613121). The adjacent FUBP1 is a cancer gene, and mutation of FUBP1 contributes to human oligodendroglioma.67 FUBP1 and DNAJB4 were also found on the list of genes coexpressed with NEXN in COXPRESdb, a database of coexpressed gene networks in mammals.68 Although genes in clusters 18 and 20 are not cofunctionally present, the two gene clusters located at 1q31.2 and 1q32.1 were verified by both protein and mRNA evidence, indicating a strong correlation in expression between the genes and gene product 76

dx.doi.org/10.1021/pr3008286 | J. Proteome Res. 2013, 12, 67−80

Journal of Proteome Research

Article

three chromosomes. As the first effort, CCPD was constructed and 12,101 high-confidence proteins were identified. Although a large proportion of identifications could be found in the public mass spectra database, an additional 10,741 peptides and 564 proteins were identified in our data set. Beijing Proteome Research Center focuses on chromosome 1, for which 1,252 proteins mapping to 1,227 genes were identified in our data set. Using this large data set, multiple analyses were performed. First, the data characterizing protein-coding genes on chromosome 1 were collected, and enrichment analysis was performed, focusing on chromosomal arrangement, protein characterization, and pathway. The bias in protein identification was found to correlate with physicochemical properties and tissue specificity, indicating that additional technologies and tissues are needed in future analyses. The analysis of qualitative identified/unidentified blocks and quantitative coexpression clusters shows the nonrandom distribution of genes along the chromosome.

pairs in these regions. For example, four genes, UCHL5, TROVE2, GLRX2, and CDC73, are in cluster 18, located at 1q31.2 (Figure 6A). CDC73 is a disease-related gene with hyperparathyroidism phenotype. Based on the annotations in COXPRESdb, the other three genes are also on the list of genes coexpressed with CDC73. Cluster 20 contains four genes, LAD1, TNNI1, PHLDA3, and CSRP1. CSRP1, located at 1q32.1, may be a cancer suppressor gene involved in the progression of colorectal cancer.30 PHLDA3 was also found coexpressed with CSRP1 and LAD1 in COXPRESdb. Based on the analysis of mRNA expression data, the four genes in cluster 20 were indeed overexpressed in normal tissues of the colon and underexpressed in colon cancer tissues (Supplementary Table S3). This was also observed in the protein expression data. The abundance of CSRP1 and LAD1 in normal colon tissue was 1 order of magnitude higher than that in colon cancer tissue (Supplementary Table S1). Figure 6B shows the detailed investigation of XICs with their high-confidence unique peptides for these two gene products identified in colon cancer and normal tissues (see Supplementary Figure S3 for annotated spectra), as we expected. Although 23 clusters were found in this study, these do not represent all of the coexpression clusters on chromosome 1. With the accumulation of transcriptomic data, as well as proteomic data, more tissue- or disease-specific coexpression clusters may be identified. We expect that the proteome expression profile can help to confirm and adjust the gene coexpression clusters found using transcriptomic data, and the integration should shed light on the mechanisms of transcriptional and translational regulation.



ASSOCIATED CONTENT

S Supporting Information *

Supplementary materials and methods, figures, and Tables S5−S7. This material is available free of charge via the Internet at http:// pubs.acs.org.



AUTHOR INFORMATION

Corresponding Author

*(F.H.) Tel and Fax: 8610-68171208. E-mail: [email protected]. cn. (P.X.) Tel and Fax: 8610-80705155. E-mail: xupingghy@ gmail.com. (Y.Z.) Tel: 8610-80705225. E-mail: zhuyunping@ gmail.com. (P.Y.) Tel and Fax: 8621-54237961. E-mail: [email protected]. (S.L.) Tel and Fax: 8610-80485324. E-mail: [email protected].

Data Visualization and Database Construction

The large number of protein-coding genes on chromosome 1 makes it impossible to display all the information in a single condensed graph. Here, two different visualization methods were applied to present the data and results for different purposes. The circular visualization provides the overview of data and results of analysis in a condensed graph. Here, we integrated the information on protein characterization, protein and mass spectra evidence, expression abundance of the transcriptome and proteome, and the results of analysis of unidentified blocks and coexpression clusters using the circular visualization, which is a good way to provide a panoramic view of chromosome 1 (Figure 7). The tabular visualization provides an easier way to check the detailed information for each protein (Supplementary Table S4). Most information that was integrated into the circular visualization, with the exception of the results of analysis, is also found in the tabular visualization. However, the table can also be used for data analysis and confirmation. More detailed information and external links can be accessed and queried in the C-HPD. Currently, the content of the database consists of three different types of information, i.e., gene and protein arrangement along chromosomes, powerful annotation of proteins and genes, and expression information for the transcriptome and proteome. C-HPD is now publicly available at http://proteomeview.hupo.org.cn/chromosome. More detailed information and updates of chromosome 1 proteome research can be found in the database.

Author Contributions ○

These authors contributed equally to this work.

Notes

The authors declare no competing financial interest. Additional Data Available: Supplementary Tables S1−S4 associated with this manuscript may be downloaded from ProteomeCommons.org Tranche using the following hash: mrPsJD5yqIU5+l2f3VdjFRydqlQkIo4ItJWXyR74oyGm34LhmZ2fGJ03hDS9D9vUV1lCXTTuDIEwv +Ik9NUu05+cQ7MAAAAAAAAEwA==



ACKNOWLEDGMENTS This work was supported by the Chinese National Basic Research Program (2011CB910600, 2013CB911200 and 2011CB910701), the National High-Tech Research and Development Program (2011AA02A114, 2012AA020201, 2012AA020409 and 2012AA020502), the National Natural Science Foundation of China (31070673, 31170780, 81123001, 21105121, 21275160 and 91131009), the National International Cooperation Project (2011-1001), Beijing Municipal Natural Science Foundation 5122013, Key Projects in the National Science & Technology Pillar Program (2012BAF14B00), and the State Key Lab Project (SKLP-K200904 and SKLP-Y201102).





CONCLUSION As a cooperative international project, C-HPP will be advanced by the multiple research groups that claim their chromosomes of interest. Accordingly, the CCPC was established and claimed

ABBREVIATIONS CCPC, Chinese Human Chromosome Proteome Consortium; CCPD, Chinese Chromosome Proteome Data set; C-HPD, Chinese Chromosome-Centric Human Proteome Database; 77

dx.doi.org/10.1021/pr3008286 | J. Proteome Res. 2013, 12, 67−80

Journal of Proteome Research

Article

Guan, P.; Heiman, T. J.; Higgins, M. E.; Ji, R.-R.; Ke, Z.; Ketchum, K. A.; Lai, Z.; Lei, Y.; Li, Z.; Li, J.; Liang, Y.; Lin, X.; Lu, F.; Merkulov, G. V.; Milshina, N.; Moore, H. M.; Naik, A. K.; Narayan, V. A.; Neelam, B.; Nusskern, D.; Rusch, D. B.; Salzberg, S.; Shao, W.; Shue, B.; Sun, J.; Wang, Z. Y.; Wang, A.; Wang, X.; Wang, J.; Wei, M.-H.; Wides, R.; Xiao, C.; Yan, C.; Yao, A.; Ye, J.; Zhan, M.; Zhang, W.; Zhang, H.; Zhao, Q.; Zheng, L.; Zhong, F.; Zhong, W.; Zhu, S. C.; Zhao, S.; Gilbert, D.; Baumhueter, S.; Spier, G.; Carter, C.; Cravchik, A.; Woodage, T.; Ali, F.; An, H.; Awe, A.; Baldwin, D.; Baden, H.; Barnstead, M.; Barrow, I.; Beeson, K.; Busam, D.; Carver, A.; Center, A.; Cheng, M. L.; Curry, L.; Danaher, S.; Davenport, L.; Desilets, R.; Dietz, S.; Dodson, K.; Doup, L.; Ferriera, S.; Garg, N.; Gluecksmann, A.; Hart, B.; Haynes, J.; Haynes, C.; Heiner, C.; Hladun, S.; Hostin, D.; Houck, J.; Howland, T.; Ibegwam, C.; Johnson, J.; Kalush, F.; Kline, L.; Koduru, S.; Love, A.; Mann, F.; May, D.; McCawley, S.; McIntosh, T.; McMullen, I.; Moy, M.; Moy, L.; Murphy, B.; Nelson, K.; Pfannkoch, C.; Pratts, E.; Puri, V.; Qureshi, H.; Reardon, M.; Rodriguez, R.; Rogers, Y.-H.; Romblad, D.; Ruhfel, B.; Scott, R.; Sitter, C.; Smallwood, M.; Stewart, E.; Strong, R.; Suh, E.; Thomas, R.; Tint, N. N.; Tse, S.; Vech, C.; Wang, G.; Wetter, J.; Williams, S.; Williams, M.; Windsor, S.; Winn-Deen, E.; Wolfe, K.; Zaveri, J.; Zaveri, K.; Abril, J. F.; Guigo, R.; Campbell, M. J.; Sjolander, K. V.; Karlak, B.; Kejariwal, A.; Mi, H.; Lazareva, B.; Hatton, T.; Narechania, A.; Diemer, K.; Muruganujan, A.; Guo, N.; Sato, S.; Bafna, V.; Istrail, S.; Lippert, R.; Schwartz, R.; Walenz, B.; Yooseph, S.; Allen, D.; Basu, A.; Baxendale, J.; Blick, L.; Caminha, M.; Carnes-Stine, J.; Caulk, P.; Chiang, Y.-H.; Coyne, M.; Dahlke, C.; Mays, A. D.; Dombroski, M.; Donnelly, M.; Ely, D.; Esparham, S.; Fosler, C.; Gire, H.; Glanowski, S.; Glasser, K.; Glodek, A.; Gorokhov, M.; Graham, K.; Gropman, B.; Harris, M.; Heil, J.; Henderson, S.; Hoover, J.; Jennings, D.; Jordan, C.; Jordan, J.; Kasha, J.; Kagan, L.; Kraft, C.; Levitsky, A.; Lewis, M.; Liu, X.; Lopez, J.; Ma, D.; Majoros, W.; McDaniel, J.; Murphy, S.; Newman, M.; Nguyen, T.; Nguyen, N.; Nodell, M.; Pan, S.; Peck, J.; Peterson, M.; Rowe, W.; Sanders, R.; Scott, J.; Simpson, M.; Smith, T.; Sprague, A.; Stockwell, T.; Turner, R.; Venter, E.; Wang, M.; Wen, M.; Wu, D.; Wu, M.; Xia, A.; Zandieh, A.; Zhu, X. The sequence of the human genome. Science 2001, 291 (5507), 1304−1351. (3) Aebersold, R.; Mann, M. Mass spectrometry-based proteomics. Nature 2003, 422 (6928), 198−207. (4) Eng, J. K.; McCormack, A. L.; Yates Iii, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5 (11), 976−989. (5) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20 (18), 3551−67. (6) Kersey, P. J.; Duarte, J.; Williams, A.; Karavidopoulou, Y.; Birney, E.; Apweiler, R. The International Protein Index: an integrated database for proteomics experiments. Proteomics 2004, 4 (7), 1985−1988. (7) Bairoch, A.; Apweiler, R.; Wu, C. H.; Barker, W. C.; Boeckmann, B.; Ferro, S.; Gasteiger, E.; Huang, H.; Lopez, R.; Magrane, M. The universal protein resource (UniProt). Nucleic Acids Res. 2005, 33 (suppl 1), D154−D159. (8) Omenn, G. S.; States, D. J.; Adamski, M.; Blackwell, T. W.; Menon, R.; Hermjakob, H.; Apweiler, R.; Haab, B. B.; Simpson, R. J. Overview of the HUPO Plasma Proteome Project: results from the pilot phase with 35 collaborating laboratories and multiple analytical groups, generating a core dataset of 3020 proteins and a publicly-available database. Proteomics 2005, 5 (13), 3226−3245. (9) Chinese Human Liver Proteome Profiling Consortium. First Insight into the Human Liver Proteome from PROTEOMESKYLIVERHu 1.0, a Publicly Available Database. J. Proteome Res. 2010, 9 (1), 79−94. (10) Hamacher, M.; Apweiler, R.; Arnold, G.; Becker, A.; Bluggel, M.; Carrette, O.; Colvis, C.; Dunn, M. J.; Frohlich, T.; Fountoulakis, M.; van Hall, A.; Herberg, F.; Ji, J.; Kretzschmar, H.; Lewczuk, P.; Lubec, G.; Marcus, K.; Martens, L.; Palacios Bustamante, N.; Park, Y. M.; Pennington, S. R.; Robben, J.; Stuhler, K.; Reidegeld, K. A.; Riederer, P.; Rossier, J.; Sanchez, J. C.; Schrader, M.; Stephan, C.; Tagle, D.; Thiele, H.; Wang, J.; Wiltfang, J.; Yoo, J. S.; Zhang, C.; Klose, J.; Meyer,

C-HPP, Chromosome-Centric Human Proteome Project; OMIM, Online Mendelian Inheritance in Man; CGC, Cancer Gene Census; CAPER, Chromosome-Assembled human Proteome browsER; HPA, Human ProteinAtlas



REFERENCES

(1) Lander, E. S.; Linton, L. M.; Birren, B.; Nusbaum, C.; Zody, M. C.; Baldwin, J.; Devon, K.; Dewar, K.; Doyle, M.; FitzHugh, W.; Funke, R.; Gage, D.; Harris, K.; Heaford, A.; Howland, J.; Kann, L.; Lehoczky, J.; LeVine, R.; McEwan, P.; McKernan, K.; Meldrim, J.; Mesirov, J. P.; Miranda, C.; Morris, W.; Naylor, J.; Raymond, C.; Rosetti, M.; Santos, R.; Sheridan, A.; Sougnez, C.; Stange-Thomann, N.; Stojanovic, N.; Subramanian, A.; Wyman, D.; Rogers, J.; Sulston, J.; Ainscough, R.; Beck, S.; Bentley, D.; Burton, J.; Clee, C.; Carter, N.; Coulson, A.; Deadman, R.; Deloukas, P.; Dunham, A.; Dunham, I.; Durbin, R.; French, L.; Grafham, D.; Gregory, S.; Hubbard, T.; Humphray, S.; Hunt, A.; Jones, M.; Lloyd, C.; McMurray, A.; Matthews, L.; Mercer, S.; Milne, S.; Mullikin, J. C.; Mungall, A.; Plumb, R.; Ross, M.; Shownkeen, R.; Sims, S.; Waterston, R. H.; Wilson, R. K.; Hillier, L. W.; McPherson, J. D.; Marra, M. A.; Mardis, E. R.; Fulton, L. A.; Chinwalla, A. T.; Pepin, K. H.; Gish, W. R.; Chissoe, S. L.; Wendl, M. C.; Delehaunty, K. D.; Miner, T. L.; Delehaunty, A.; Kramer, J. B.; Cook, L. L.; Fulton, R. S.; Johnson, D. L.; Minx, P. J.; Clifton, S. W.; Hawkins, T.; Branscomb, E.; Predki, P.; Richardson, P.; Wenning, S.; Slezak, T.; Doggett, N.; Cheng, J. F.; Olsen, A.; Lucas, S.; Elkin, C.; Uberbacher, E.; Frazier, M.; Gibbs, R. A.; Muzny, D. M.; Scherer, S. E.; Bouck, J. B.; Sodergren, E. J.; Worley, K. C.; Rives, C. M.; Gorrell, J. H.; Metzker, M. L.; Naylor, S. L.; Kucherlapati, R. S.; Nelson, D. L.; Weinstock, G. M.; Sakaki, Y.; Fujiyama, A.; Hattori, M.; Yada, T.; Toyoda, A.; Itoh, T.; Kawagoe, C.; Watanabe, H.; Totoki, Y.; Taylor, T.; Weissenbach, J.; Heilig, R.; Saurin, W.; Artiguenave, F.; Brottier, P.; Bruls, T.; Pelletier, E.; Robert, C.; Wincker, P.; Smith, D. R.; Doucette-Stamm, L.; Rubenfield, M.; Weinstock, K.; Lee, H. M.; Dubois, J.; Rosenthal, A.; Platzer, M.; Nyakatura, G.; Taudien, S.; Rump, A.; Yang, H.; Yu, J.; Wang, J.; Huang, G.; Gu, J.; Hood, L.; Rowen, L.; Madan, A.; Qin, S.; Davis, R. W.; Federspiel, N. A.; Abola, A. P.; Proctor, M. J.; Myers, R. M.; Schmutz, J.; Dickson, M.; Grimwood, J.; Cox, D. R.; Olson, M. V.; Kaul, R.; Shimizu, N.; Kawasaki, K.; Minoshima, S.; Evans, G. A.; Athanasiou, M.; Schultz, R.; Roe, B. A.; Chen, F.; Pan, H.; Ramser, J.; Lehrach, H.; Reinhardt, R.; McCombie, W. R.; de la Bastide, M.; Dedhia, N.; Blocker, H.; Hornischer, K.; Nordsiek, G.; Agarwala, R.; Aravind, L.; Bailey, J. A.; Bateman, A.; Batzoglou, S.; Birney, E.; Bork, P.; Brown, D. G.; Burge, C. B.; Cerutti, L.; Chen, H. C.; Church, D.; Clamp, M.; Copley, R. R.; Doerks, T.; Eddy, S. R.; Eichler, E. E.; Furey, T. S.; Galagan, J.; Gilbert, J. G.; Harmon, C.; Hayashizaki, Y.; Haussler, D.; Hermjakob, H.; Hokamp, K.; Jang, W.; Johnson, L. S.; Jones, T. A.; Kasif, S.; Kaspryzk, A.; Kennedy, S.; Kent, W. J.; Kitts, P.; Koonin, E. V.; Korf, I.; Kulp, D.; Lancet, D.; Lowe, T. M.; McLysaght, A.; Mikkelsen, T.; Moran, J. V.; Mulder, N.; Pollara, V. J.; Ponting, C. P.; Schuler, G.; Schultz, J.; Slater, G.; Smit, A. F.; Stupka, E.; Szustakowski, J.; ThierryMieg, D.; Thierry-Mieg, J.; Wagner, L.; Wallis, J.; Wheeler, R.; Williams, A.; Wolf, Y. I.; Wolfe, K. H.; Yang, S. P.; Yeh, R. F.; Collins, F.; Guyer, M. S.; Peterson, J.; Felsenfeld, A.; Wetterstrand, K. A.; Patrinos, A.; Morgan, M. J.; de Jong, P.; Catanese, J. J.; Osoegawa, K.; Shizuya, H.; Choi, S.; Chen, Y. J. Initial sequencing and analysis of the human genome. Nature 2001, 409 (6822), 860−921. (2) Venter, J. C.; Adams, M. D.; Myers, E. W.; Li, P. W.; Mural, R. J.; Sutton, G. G.; Smith, H. O.; Yandell, M.; Evans, C. A.; Holt, R. A.; Gocayne, J. D.; Amanatides, P.; Ballew, R. M.; Huson, D. H.; Wortman, J. R.; Zhang, Q.; Kodira, C. D.; Zheng, X. H.; Chen, L.; Skupski, M.; Subramanian, G.; Thomas, P. D.; Zhang, J.; Gabor Miklos, G. L.; Nelson, C.; Broder, S.; Clark, A. G.; Nadeau, J.; McKusick, V. A.; Zinder, N.; Levine, A. J.; Roberts, R. J.; Simon, M.; Slayman, C.; Hunkapiller, M.; Bolanos, R.; Delcher, A.; Dew, I.; Fasulo, D.; Flanigan, M.; Florea, L.; Halpern, A.; Hannenhalli, S.; Kravitz, S.; Levy, S.; Mobarry, C.; Reinert, K.; Remington, K.; Abu-Threideh, J.; Beasley, E.; Biddick, K.; Bonazzi, V.; Brandon, R.; Cargill, M.; Chandramouliswaran, I.; Charlab, R.; Chaturvedi, K.; Deng, Z.; Francesco, V. D.; Dunn, P.; Eilbeck, K.; Evangelista, C.; Gabrielian, A. E.; Gan, W.; Ge, W.; Gong, F.; Gu, Z.; 78

dx.doi.org/10.1021/pr3008286 | J. Proteome Res. 2013, 12, 67−80

Journal of Proteome Research

Article

H. E. HUPO Brain Proteome Project: summary of the pilot phase and introduction of a comprehensive data reprocessing strategy. Proteomics 2006, 6 (18), 4890−8. (11) Jaffe, J. D.; Berg, H. C.; Church, G. M. Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics 2004, 4 (1), 59−77. (12) Paik, Y. K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Marko-Varga, G.; Aebersold, R.; Bairoch, A.; Yamamoto, T.; Legrain, P.; Lee, H. J.; Na, K.; Jeong, S. K.; He, F.; Binz, P. A.; Nishimura, T.; Keown, P.; Baker, M. S.; Yoo, J. S.; Garin, J.; Archakov, A.; Bergeron, J.; Salekdeh, G. H.; Hancock, W. S. Standard guidelines for the chromosome-centric human proteome project. J. Proteome Res. 2012, 11 (4), 2005−13. (13) Paik, Y. K.; Jeong, S. K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Cho, S. Y.; Lee, H. J.; Na, K.; Choi, E. Y.; Yan, F.; Zhang, F.; Zhang, Y.; Snyder, M.; Cheng, Y.; Chen, R.; Marko-Varga, G.; Deutsch, E. W.; Kim, H.; Kwon, J. Y.; Aebersold, R.; Bairoch, A.; Taylor, A. D.; Kim, K. Y.; Lee, E. Y.; Hochstrasser, D.; Legrain, P.; Hancock, W. S. The ChromosomeCentric Human Proteome Project for cataloging proteins encoded in the genome. Nat. Biotechnol. 2012, 30 (3), 221−3. (14) Ying, W.; Jiang, Y.; Guo, L.; Hao, Y.; Zhang, Y.; Wu, S.; Zhong, F.; Wang, J.; Shi, R.; Li, D.; Wan, P.; Li, X.; Wei, H.; Li, J.; Wang, Z.; Xue, X.; Cai, Y.; Zhu, Y.; Qian, X.; He, F. A dataset of human fetal liver proteome identified by subcellular fractionation and multiple protein separation and identification technology. Mol. Cell. Proteomics 2006, 5 (9), 1703−7. (15) Gregory, S. G.; Barlow, K. F.; McLay, K. E.; Kaul, R.; Swarbreck, D.; Dunham, A.; Scott, C. E.; Howe, K. L.; Woodfine, K.; Spencer, C. C.; Jones, M. C.; Gillson, C.; Searle, S.; Zhou, Y.; Kokocinski, F.; McDonald, L.; Evans, R.; Phillips, K.; Atkinson, A.; Cooper, R.; Jones, C.; Hall, R. E.; Andrews, T. D.; Lloyd, C.; Ainscough, R.; Almeida, J. P.; Ambrose, K. D.; Anderson, F.; Andrew, R. W.; Ashwell, R. I.; Aubin, K.; Babbage, A. K.; Bagguley, C. L.; Bailey, J.; Beasley, H.; Bethel, G.; Bird, C. P.; Bray-Allen, S.; Brown, J. Y.; Brown, A. J.; Buckley, D.; Burton, J.; Bye, J.; Carder, C.; Chapman, J. C.; Clark, S. Y.; Clarke, G.; Clee, C.; Cobley, V.; Collier, R. E.; Corby, N.; Coville, G. J.; Davies, J.; Deadman, R.; Dunn, M.; Earthrowl, M.; Ellington, A. G.; Errington, H.; Frankish, A.; Frankland, J.; French, L.; Garner, P.; Garnett, J.; Gay, L.; Ghori, M. R.; Gibson, R.; Gilby, L. M.; Gillett, W.; Glithero, R. J.; Grafham, D. V.; Griffiths, C.; Griffiths-Jones, S.; Grocock, R.; Hammond, S.; Harrison, E. S.; Hart, E.; Haugen, E.; Heath, P. D.; Holmes, S.; Holt, K.; Howden, P. J.; Hunt, A. R.; Hunt, S. E.; Hunter, G.; Isherwood, J.; James, R.; Johnson, C.; Johnson, D.; Joy, A.; Kay, M.; Kershaw, J. K.; Kibukawa, M.; Kimberley, A. M.; King, A.; Knights, A. J.; Lad, H.; Laird, G.; Lawlor, S.; Leongamornlert, D. A.; Lloyd, D. M.; Loveland, J.; Lovell, J.; Lush, M. J.; Lyne, R.; Martin, S.; Mashreghi-Mohammadi, M.; Matthews, L.; Matthews, N. S.; McLaren, S.; Milne, S.; Mistry, S.; Moore, M. J.; Nickerson, T.; O’Dell, C. N.; Oliver, K.; Palmeiri, A.; Palmer, S. A.; Parker, A.; Patel, D.; Pearce, A. V.; Peck, A. I.; Pelan, S.; Phelps, K.; Phillimore, B. J.; Plumb, R.; Rajan, J.; Raymond, C.; Rouse, G.; Saenphimmachak, C.; Sehra, H. K.; Sheridan, E.; Shownkeen, R.; Sims, S.; Skuce, C. D.; Smith, M.; Steward, C.; Subramanian, S.; Sycamore, N.; Tracey, A.; Tromans, A.; Van Helmond, Z.; Wall, M.; Wallis, J. M.; White, S.; Whitehead, S. L.; Wilkinson, J. E.; Willey, D. L.; Williams, H.; Wilming, L.; Wray, P. W.; Wu, Z.; Coulson, A.; Vaudin, M.; Sulston, J. E.; Durbin, R.; Hubbard, T.; Wooster, R.; Dunham, I.; Carter, N. P.; McVean, G.; Ross, M. T.; Harrow, J.; Olson, M. V.; Beck, S.; Rogers, J.; Bentley, D. R.; Banerjee, R.; Bryant, S. P.; Burford, D. C.; Burrill, W. D.; Clegg, S. M.; Dhami, P.; Dovey, O.; Faulkner, L. M.; Gribble, S. M.; Langford, C. F.; Pandian, R. D.; Porter, K. M.; Prigmore, E. The DNA sequence and biological annotation of human chromosome 1. Nature 2006, 441 (7091), 315−21. (16) Lee, L.; Wang, K.; Li, G.; Xie, Z.; Wang, Y.; Xu, J.; Sun, S.; Pocalyko, D.; Bhak, J.; Kim, C.; Lee, K. H.; Jang, Y. J.; Yeom, Y. I.; Yoo, H. S.; Hwang, S. Liverome: a curated database of liver cancer-related gene signatures with self-contained context information. BMC Genomics 2011, 12 (Suppl 3), S3. (17) Tomlinson, G. E.; Douglass, E. C.; Pollock, B. H.; Finegold, M. J.; Schneider, N. R. Cytogenetic evaluation of a large series of hepatoblastomas: numerical abnormalities with recurring aberrations involving 1q12-q21. Genes Chromosomes Cancer 2005, 44 (2), 177−84.

(18) Sy, S. M.; Wong, N.; Lai, P. B.; To, K. F.; Johnson, P. J. Regional over-representations on chromosomes 1q, 3q and 7q in the progression of hepatitis B virus-related hepatocellular carcinoma. Mod. Pathol. 2005, 18 (5), 686−92. (19) Kim, T. M.; Yim, S. H.; Shin, S. H.; Xu, H. D.; Jung, Y. C.; Park, C. K.; Choi, J. Y.; Park, W. S.; Kwon, M. S.; Fiegler, H.; Carter, N. P.; Rhyu, M. G.; Chung, Y. J. Clinical implication of recurrent copy number alterations in hepatocellular carcinoma and putative oncogenes in recurrent gains on 1q. Int. J. Cancer 2008, 123 (12), 2808−15. (20) Ma, N. F.; Hu, L.; Fung, J. M.; Xie, D.; Zheng, B. J.; Chen, L.; Tang, D. J.; Fu, L.; Wu, Z.; Chen, M.; Fang, Y.; Guan, X. Y. Isolation and characterization of a novel oncogene, amplified in liver cancer 1, within a commonly amplified region at 1q21 in hepatocellular carcinoma. Hepatology 2008, 47 (2), 503−10. (21) Inagaki, Y.; Yasui, K.; Endo, M.; Nakajima, T.; Zen, K.; Tsuji, K.; Minami, M.; Tanaka, S.; Taniwaki, M.; Itoh, Y.; Arii, S.; Okanoue, T. CREB3L4, INTS3, and SNAPAP are targets for the 1q21 amplicon frequently detected in hepatocellular carcinoma. Cancer Genet. Cytogenet. 2008, 180 (1), 30−6. (22) Peng, Z.; Zhang, F.; Zhou, C.; Ling, Y.; Bai, S.; Liu, W.; Qiu, G.; He, L.; Wang, L.; Wei, D.; Lin, E.; Xie, K. Genome-wide search for loss of heterozygosity in Chinese patients with sporadic colorectal cancer. Int. J. Gastrointest Cancer 2003, 34 (1), 39−48. (23) Zhou, C. Z.; Qiu, G. Q.; Zhang, F.; He, L.; Peng, Z. H. Loss of heterozygosity on chromosome 1 in sporadic colorectal carcinoma. World J. Gastroenterol. 2004, 10 (10), 1431−5. (24) Silverberg, M. S.; Cho, J. H.; Rioux, J. D.; McGovern, D. P.; Wu, J.; Annese, V.; Achkar, J. P.; Goyette, P.; Scott, R.; Xu, W.; Barmada, M. M.; Klei, L.; Daly, M. J.; Abraham, C.; Bayless, T. M.; Bossa, F.; Griffiths, A. M.; Ippoliti, A. F.; Lahaie, R. G.; Latiano, A.; Pare, P.; Proctor, D. D.; Regueiro, M. D.; Steinhart, A. H.; Targan, S. R.; Schumm, L. P.; Kistner, E. O.; Lee, A. T.; Gregersen, P. K.; Rotter, J. I.; Brant, S. R.; Taylor, K. D.; Roeder, K.; Duerr, R. H. Ulcerative colitis-risk loci on chromosomes 1p36 and 12q15 found by genome-wide association study. Nat. Genet. 2009, 41 (2), 216−20. (25) Fijneman, R. J.; Carvalho, B.; Postma, C.; Mongera, S.; van Hinsbergh, V. W.; Meijer, G. A. Loss of 1p36, gain of 8q24, and loss of 9q34 are associated with stroma percentage of colorectal cancer. Cancer Lett 2007, 258 (2), 223−9. (26) Haimi, M.; Iancu, T. C.; Shaffer, L. G.; Lerner, A. Severe lysosomal storage disease of liver in del(1)(p36): a new presentation. Eur. J. Med. Genet. 2011, 54 (3), 209−13. (27) Carvalho, R.; Milne, A. N.; Polak, M.; Corver, W. E.; Offerhaus, G. J.; Weterman, M. A. Exclusion of RUNX3 as a tumour-suppressor gene in early-onset gastric carcinomas. Oncogene 2005, 24 (56), 8252−8. (28) Bae, S. C.; Choi, J. K. Tumor suppressor activity of RUNX3. Oncogene 2004, 23 (24), 4336−40. (29) Goel, A.; Arnold, C. N.; Tassone, P.; Chang, D. K.; Niedzwiecki, D.; Dowell, J. M.; Wasserman, L.; Compton, C.; Mayer, R. J.; Bertagnolli, M. M.; Boland, C. R. Epigenetic inactivation of RUNX3 in microsatellite unstable sporadic colon cancers. Int. J. Cancer 2004, 112 (5), 754−9. (30) Zhou, C. Z.; Qiu, G. Q.; Wang, X. L.; Fan, J. W.; Tang, H. M.; Sun, Y. H.; Wang, Q.; Huang, F.; Yan, D. W.; Li, D. W.; Peng, Z. H. Screening of tumor suppressor genes on 1q31.1−32.1 in Chinese patients with sporadic colorectal cancer. Chin. Med. J. (Engl. Ed.) 2008, 121 (24), 2479−86. (31) Park, J. G.; Lee, J. H.; Kang, M. S.; Park, K. J.; Jeon, Y. M.; Lee, H. J.; Kwon, H. S.; Park, H. S.; Yeo, K. S.; Lee, K. U.; et al. Characterization of cell lines established from human hepatocellular carcinoma. Int. J. Cancer 1995, 62 (3), 276−82. (32) Park, J. G.; Lee, S. K.; Hong, I. G.; Kim, H. S.; Lim, K. H.; Choe, K. J.; Kim, W. H.; Kim, Y. I.; Tsuruo, T.; Gottesman, M. M. MDR1 gene expression: its effect on drug resistance to doxorubicin in human hepatocellular carcinoma cell lines. J. Natl. Cancer Inst. 1994, 86 (9), 700−5. (33) Li, Y.; Tian, B.; Yang, J.; Zhao, L.; Wu, X.; Ye, S. L.; Liu, Y. K.; Tang, Z. Y. Stepwise metastatic human hepatocellular carcinoma cell model system with multiple metastatic potentials established through 79

dx.doi.org/10.1021/pr3008286 | J. Proteome Res. 2013, 12, 67−80

Journal of Proteome Research

Article

consecutive in vivo selection and studies on metastatic characteristics. J. Cancer Res. Clin. Oncol. 2004, 130 (8), 460−8. (34) Trainer, D. L.; Kline, T.; McCabe, F. L.; Faucette, L. F.; Feild, J.; Chaikin, M.; Anzano, M.; Rieman, D.; Hoffstein, S.; Li, D. J.; et al. Biological characterization and oncogene expression in human colorectal carcinoma cell lines. Int. J. Cancer 1988, 41 (2), 287−96. (35) Brattain, M. G.; Fine, W. D.; Khaled, F. M.; Thompson, J.; Brattain, D. E. Heterogeneity of malignant cells from a human colonic carcinoma. Cancer Res. 1981, 41 (5), 1751. (36) Barranco, S. C.; Townsend, C. M., Jr.; Casartelli, C.; Macik, B. G.; Burger, N. L.; Boerwinkle, W. R.; Gourley, W. K. Establishment and characterization of an in vitro model system for human adenocarcinoma of the stomach. Cancer Res. 1983, 43 (4), 1703−9. (37) Lin, C. H.; Fu, Z. M.; Liu, Y. L.; Yang, J. L.; Xu, J. F.; Chen, Q. S.; Chen, H. M. Investigation of SGC-7901 cell line established from human gastric carcinoma cells. Chin. Med. J. (Engl. Ed.) 1984, 97 (11), 831−4. (38) Li, N.; Guo, R.; Li, W.; Shao, J.; Li, S.; Zhao, K.; Chen, X.; Xu, N.; Liu, S.; Lu, Y. A proteomic investigation into a human gastric cancer cell line BGC823 treated with diallyl trisulfide. Carcinogenesis 2006, 27 (6), 1222−1231. (39) Xing, R.; Li, W.; Cui, J.; Zhang, J.; Kang, B.; Wang, Y.; Wang, Z.; Liu, S.; Lu, Y. Gastrokine 1 induces senescence through p16/Rb pathway activation in gastric cancer cells. Gut 2012, 61 (1), 43−52. (40) Kang, B.; Hao, C.; Wang, H.; Zhang, J.; Xing, R.; Shao, J.; Li, W.; Xu, N.; Lu, Y.; Liu, S. Evaluation of hepatic-metastasis risk of colorectal cancer upon the protein signature of PI3K/AKT pathway. J. Proteome Res. 2008, 7 (8), 3507−3515. (41) Xu, P.; Duong, D. M.; Seyfried, N. T.; Cheng, D.; Xie, Y.; Robert, J.; Rush, J.; Hochstrasser, M.; Finley, D.; Peng, J. Quantitative proteomics reveals the function of unconventional ubiquitin chains in proteasomal degradation. Cell 2009, 137 (1), 133−45. (42) Song, C.; Ye, M.; Han, G.; Jiang, X.; Wang, F.; Yu, Z.; Chen, R.; Zou, H. Reversed-phase-reversed-phase liquid chromatography approach with high orthogonality for multidimensional separation of phosphopeptides. Anal. Chem. 2010, 82 (1), 53−6. (43) Kessner, D.; Chambers, M.; Burke, R.; Agus, D.; Mallick, P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 2008, 24 (21), 2534−6. (44) Deutsch, E. W.; Mendoza, L.; Shteynberg, D.; Farrah, T.; Lam, H.; Tasman, N.; Sun, Z.; Nilsson, E.; Pratt, B.; Prazen, B.; Eng, J. K.; Martin, D. B.; Nesvizhskii, A. I.; Aebersold, R. A guided tour of the TransProteomic Pipeline. Proteomics 2010, 10 (6), 1150−1159. (45) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 2007, 4 (3), 207−14. (46) Li, N.; Wu, S.; Zhang, C.; Chang, C.; Zhang, J.; Ma, J.; Li, L.; Qian, X.; Xu, P.; Zhu, Y.; He, F. PepDistiller: A quality control tool to improve the sensitivity and accuracy of peptide identifications in shotgun proteomics. Proteomics 2012, 12 (11), 1720−5. (47) Schwanhausser, B.; Busse, D.; Li, N.; Dittmar, G.; Schuchhardt, J.; Wolf, J.; Chen, W.; Selbach, M. Global quantification of mammalian gene expression control. Nature 2011, 473 (7347), 337−342. (48) Kim, Y. H.; Liang, H.; Liu, X.; Lee, J. S.; Cho, J. Y.; Cheong, J. H.; Kim, H.; Li, M.; Downey, T. J.; Dyer, M. D.; Sun, Y.; Sun, J.; Beasley, E. M.; Chung, H. C.; Noh, S. H.; Weinstein, J. N.; Liu, C. G.; Powis, G. AMPKalpha modulation in cancer progression: multilayer integrative analysis of the whole transcriptome in Asian gastric cancer. Cancer Res. 2012, 72 (10), 2512−21. (49) Huang, Q.; Lin, B.; Liu, H.; Ma, X.; Mo, F.; Yu, W.; Li, L.; Li, H.; Tian, T.; Wu, D.; Shen, F.; Xing, J.; Chen, Z. N. RNA-Seq analyses generate comprehensive transcriptomic landscape and reveal complex transcript patterns in hepatocellular carcinoma. PLoS One 2011, 6 (10), e26168. (50) The Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature 2012, 487 (7407), 330−7. (51) Khamas, A.; Ishikawa, T.; Shimokawa, K.; Mogushi, K.; Iida, S.; Ishiguro, M.; Mizushima, H.; Tanaka, H.; Uetake, H.; Sugihara, K.

Screening for epigenetically masked genes in colorectal cancer Using 5Aza-2’-deoxycytidine, microarray and gene expression profile. Cancer Genomics Proteomics 2012, 9 (2), 67−75. (52) Mortazavi, A.; Williams, B. A.; McCue, K.; Schaeffer, L.; Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 2008, 5 (7), 621−8. (53) Uhlen, M.; Oksvold, P.; Fagerberg, L.; Lundberg, E.; Jonasson, K.; Forsberg, M.; Zwahlen, M.; Kampf, C.; Wester, K.; Hober, S.; Wernerus, H.; Bjorling, L.; Ponten, F. Towards a knowledge-based Human Protein Atlas. Nat. Biotechnol. 2010, 28 (12), 1248−50. (54) Craig, R.; Cortens, J. P.; Beavis, R. C. Open source system for analyzing, validating, and storing protein identification data. J. Proteome Res. 2004, 3 (6), 1234−42. (55) Deutsch, E. W. The PeptideAtlas Project. Methods Mol. Biol. 2010, 604, 285−96. (56) Wu, S.; Zhu, Y. ProPAS: standalone software to analyze protein properties. Bioinformation 2012, 8 (3), 167−9. (57) Webb-Robertson, B. J.; Cannon, W. R.; Oehmen, C. S.; Shah, A. R.; Gurumoorthi, V.; Lipton, M. S.; Waters, K. M. A support vector machine model for the prediction of proteotypic peptides for accurate mass and time proteomics. Bioinformatics 2010, 26 (13), 1677−83. (58) Eisen, M. B.; Spellman, P. T.; Brown, P. O.; Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U.S.A. 1998, 95 (25), 14863−8. (59) Huang, D. W.; Sherman, B. T.; Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 2008, 4 (1), 44−57. (60) Huang da, W.; Sherman, B. T.; Lempicki, R. A. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009, 37 (1), 1−13. (61) Lercher, M. J.; Blumenthal, T.; Hurst, L. D. Coexpression of neighboring genes in Caenorhabditis elegans is mostly due to operons and duplicate genes. Genome Res. 2003, 13 (2), 238−243. (62) Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 1990, 215 (3), 403−10. (63) Lercher, M. J.; Urrutia, A. O.; Hurst, L. D. Clustering of housekeeping genes provides a unified model of gene order in the human genome. Nat. Genet. 2002, 31 (2), 180−3. (64) Krzywinski, M.; Schein, J.; Birol, I.; Connors, J.; Gascoyne, R.; Horsman, D.; Jones, S. J.; Marra, M. A. Circos: an information aesthetic for comparative genomics. Genome Res. 2009, 19 (9), 1639−45. (65) Hurst, L. D.; Pal, C.; Lercher, M. J. The evolutionary dynamics of eukaryotic gene order. Nat Rev Genet 2004, 5 (4), 299−310. (66) Florens, L.; Washburn, M. P.; Raine, J. D.; Anthony, R. M.; Grainger, M.; Haynes, J. D.; Moch, J. K.; Muster, N.; Sacci, J. B.; Tabb, D. L.; Witney, A. A.; Wolters, D.; Wu, Y.; Gardner, M. J.; Holder, A. A.; Sinden, R. E.; Yates, J. R.; Carucci, D. J. A proteomic view of the Plasmodium falciparum life cycle. Nature 2002, 419 (6906), 520−6. (67) Bettegowda, C.; Agrawal, N.; Jiao, Y.; Sausen, M.; Wood, L. D.; Hruban, R. H.; Rodriguez, F. J.; Cahill, D. P.; McLendon, R.; Riggins, G.; Velculescu, V. E.; Oba-Shinjo, S. M.; Marie, S. K.; Vogelstein, B.; Bigner, D.; Yan, H.; Papadopoulos, N.; Kinzler, K. W. Mutations in CIC and FUBP1 contribute to human oligodendroglioma. Science 2011, 333 (6048), 1453−5. (68) Obayashi, T.; Kinoshita, K. COXPRESdb: a database to compare gene coexpression in seven model animals. Nucleic Acids Res. 2011, 39 (Database issue), D1016−22.

80

dx.doi.org/10.1021/pr3008286 | J. Proteome Res. 2013, 12, 67−80