Perspective pubs.acs.org/jpr
Decoding the Disease-Associated Proteins Encoded in the Human Chromosome 4 Lien-Chin Chen,†,○ Mei-Ying Liu,‡,○ Yung-Chin Hsiao,∥ Wai-Kok Choong,† Hsin-Yi Wu,§ Wen-Lian Hsu,† Pao-Chi Liao,*,¶ Ting-Yi Sung,*,† Shih-Feng Tsai,*,#,‡,⊥ Jau-Song Yu,*,∇ and Yu-Ju Chen*,§,× †
Institute of Information Science, Academia Sinica, Taipei, Taiwan Genome Research Center, National Yang-Ming Univeristy, Taipei, Taiwan ∥ Molecular Medicine Research Center, Chang Gung University, Tao-Yuan, Taiwan § Institute of Chemistry, Academia Sinica, Taipei, Taiwan × Department of Chemistry, National Taiwan University, Taipei, Taiwan ¶ Department of Environmental and Occupational Health, College of Medicine, National Cheng Kung University, Tainan, Taiwan ⊥ Institute of Molecular and Genomic Medicine, National Health Research Institutes, Hsin-Chu, Taiwan # Department of Life Sciences and Institute of Genome Sciences, National Yang-Ming University, Taipei, Taiwan ∇ Molecular Medicine Research Center; Graduate Institute of Biomedical Sciences, College of Medicine, Chang Gung University, Tao-Yuan, Taiwan ‡
S Supporting Information *
ABSTRACT: Chromosome 4 is the fourth largest chromosome, containing approximately 191 megabases (∼6.4% of the human genome) with 757 protein-coding genes. A number of marker genes for many diseases have been found in this chromosome, including genetic diseases (e.g., hepatocellular carcinoma) and biomedical research (cardiac system, aging, metabolic disorders, immune system, cancer and stem cell) related genes (e.g., oncogenes, growth factors). As a pilot study for the chromosome 4-centric human proteome project (Chr 4-HPP), we present here a systematic analysis of the disease association, protein isoforms, coding single nucleotide polymorphisms of these 757 protein-coding genes and their experimental evidence at the protein level. We also describe how the findings from the chromosome 4 project might be used to drive the biomarker discovery and validation study in disease-oriented projects, using the examples of secretomic and membrane proteomic approaches in cancer research. By integrating with cancer cell secretomes and several other existing databases in the public domain, we identified 141 chromosome 4-encoded proteins as cancer cell-secretable/shedable proteins. Additionally, we also identified 54 chromosome 4-encoded proteins that have been classified as cancer-associated proteins with successful selected or multiple reaction monitoring (SRM/MRM) assays developed. From literature annotation and topology analysis, 271 proteins were recognized as membrane proteins while 27.9% of the 757 proteins do not have any experimental evidence at the protein-level. In summary, the analysis revealed that the chromosome 4 is a rich resource for cancer-associated proteins for biomarker verification projects and for drug target discovery projects. KEYWORDS: chromosome-centric human proteome project, chromosome 4, cancer biomarker, MRM, SRM
W
chromosome.2 Many national chromosome-centric initiatives have been undertaken using complementary proteomics approaches for the intensive investigation of how many proteins are encoded by the human genome and what the functions of these proteins are. Specifically, in this study we have integrated the following biological resources for analysis:
ith the successful completion of the Human Genome Project and advancement of the emerging proteomics technologies, the Human Proteome Organization (HUPO) launched a global Human Proteome Project (HPP) with attempts to map the entire set of the estimated 20300 human protein-coding genes, approximately 30% of which do not have experimental evidence at the protein-level (up to 2011).1 The Chromosome-Centric Human Proteome Project (C-HPP) was launched with the long-term goal of comprehensive annotation of the expression, subcellular localization, interaction network and function of the full set of proteins encoded on each © 2012 American Chemical Society
Special Issue: Chromosome-centric Human Proteome Project Received: August 31, 2012 Published: December 20, 2012 33
dx.doi.org/10.1021/pr300829r | J. Proteome Res. 2013, 12, 33−44
Journal of Proteome Research
Perspective
Ensembl3 v68 (release July 2012) for the number of genes, neXtProt4 (release September 11, 2012), GPMdb5 Guide to the Human Proteome (release October 1, 2012), PeptideAtlas6 latest human build (release October 2012) for the number and the list of proteins confidently identified from MS data sets with 1% FDR at protein level, and Human Protein Atlas (HPA)7 version 10.0 (release September 12, 2012) for antibody-based protein identifications.
■
FEATURES OF HUMAN CHROMOSOME 4 The human chromosome 4 is the fourth-largest human chromosome and contains over 191 Mb on the basis of the Ensembl, which represents about 6% of the human genome sequence. A total of 2488 genes on chromosome 4 were predicted by Ensembl including 757 protein-coding genes, 1036 genes of noncoding RNAs and 695 pseudogenes. Table 1 shows the gene count of each gene biotype on chromosome 4.
Figure 1. Summary of protein isoforms encoded by chromosome 4. These 757 protein-coding genes were mapped to their corresponding protein isoforms in the neXtProt protein databases.
Table 1. Summary of the Protein-Coding Genes, Noncoding RNAs, and Pseudogenes on the Human Chromosome 4 Ensembl release 68
chromosome 4
no. of protein coding gene no. of pseudogene no. of miRNA no. of rRNA no. of snRNA no. of snoRNA no. of misc_RNA no. of polymorphic pseudogene no. of sense overlapping no. of sense_intronic no. of processed_transcript no. of antisense no. of lincRNA total no. of genes
757 695 149 24 118 56 105 2 7 16 52 151 356 2488
project at the council meeting in 2012. The decision was made on the basis of two considerations. First, there has been substantial contribution from Taiwan on sequencing bacterial artificial chromosome (BAC) clones for the chromosome 4q region. Second, diseases associated with chromosome 4 have been targets of intensive research by Taiwanese scientists, for example, the genetic susceptibility to hepatocellular carcinoma.8,9 A number of important genes have been identified on chromosome 4, and these genes are associated with a variety of diseases, including genetic diseases, cancers, stem cells, and other biological functions.10 Among these genes, AFP, located on the q arm of chromosome 4 (4q25), is the clinically used tumor marker for hepatocellular carcinoma.11 These genetic diseases can be further classified into different types, including metabolic diseases (e.g., ADH gene clusters involved in alcoholic metabolism),12 neurologic diseases (e.g., Huntington disease13 and Parkinson disease),14 musculoskeletal disorders (e.g., muscular dystrophy, limb-girdle, type 2E, and achondroplasia15), and others. For some genes, defects in same genes might present heterogeneous phenotypes. For example, phenotypes of different mutations in the FGFR3 gene vary from skeletal lesions (such as achondroplasia, hypochondroplasia, and Muenke syndrome), dysplasia (e.g., Thanatophoric dysplasia), to a variety of somatic cancers (such as colorectal cancer, bladder cancers). Alterations in chromosome 4 have been identified in some cancers, including bladder cancer, colorectal cancer, leukemias of various types, and notably hepatocellular carcinoma (HCC). Information from OMIM, a public database for providing continuously updated information of human genes, genetic disorders and traits, with particular focus on the molecular relationship between genetic variation and phenotypic expression, and other databases, such as Cancer Genome Project16 and Genetic Association Database,17,18 accelerate the research to enhance our understanding of the molecular basis of these genes. Table 2 lists the summary of genes associated with diseases, including genetic disease, cancer, and stem cell on chromosome 4. As a complement of C-HPP to the biology- and disease-driven projects (B/D-HPP), the summary allows us to focus on the selected disease/biology-driven targets for Human Proteome Project. For example, the KIT gene19 encodes the human homologue of the proto-oncogene c-kit, also known as mast/stem cell growth factor receptor, which is a
The genetic variations and post-translational modification of protein greatly enhance the diversity of human proteome. Despite the advancements of proteomic technologies, however, the specific protein isoforms are seriously under-represented because of analytical limitations on shotgun-based protein identification and lack of isoform-specific antibodies. To explore the complexity of these protein variants, we have also mapped these 757 protein-coding genes to their corresponding protein isoforms in the neXtProt protein databases. Figure 1 shows the distribution of genes with different number of isoforms. Though proteins with only the single isoform present the most prominent group (n = 445), 35.7% (159/445) of them have no experimental evidence at the protein level. The analyses reveals that as high as 21.1% (n = 160), 9.6% (n = 73) and 5.7% (n = 43) of the 757 protein-coding genes have 2, 3, or 4 isoforms per protein-coding gene, respectively, which highlights the need to develop proper tools to experimentally identify these protein isoforms. Only a limited number of genes have a very large number of protein isoforms (≥5 isoforms, n = 36)
■
DISEASE VIEW OF CHROMOSOME 4 AND PREVIOUS EFFORTS IN TAIWAN On the basis of the “adopt-a chromosome” strategy, a group of scientists from the Taiwan Proteomics Society (www. proteomics.org.tw) officially initiated the chromosome 4 34
dx.doi.org/10.1021/pr300829r | J. Proteome Res. 2013, 12, 33−44
tyrosine kinase receptor. The c-kit isoforms are expressed in hematopoietic system, and gastrointestinal system, melanocytes, and germ cells. The c-kit protein plays an important role in regulation of cell survival and proliferation, hematopoiesis, melanogenesis, gametogenesis, stem cell maintenance, mast cell development, migration, and function.20,21 Defects in the KIT gene are associated with a number of tumors, such as gastrointestinal stromal tumors (GISTs), melanoma, and leukemias. Most GISTs have been found to harbor gain-offunction mutations.22 Therefore, the c-kit protein is currently used as a tumor marker for diagnosis of GISTs using immunohistochemical methods. Thus, the KIT gene is a good target for developing new therapeutic approaches.23−25 Among these diseases and genes, the functions and mechanisms of certain genes in disease, such as the HCC and CISD2 genes, are under intensive investigation by several teams in Taiwan using multiple approaches, including genetics, genomics, and animal models. In Taiwan, HCC is one of the most common cancers and the main focus of national research projects. Results of studies on etiology showed that loss of heterozygosity of chromosome 4q has been frequently found in human HCC, and these findings suggest this region might be critical for HCC carcinogenesis.8,9,26,27 Through genomic and genetic approaches, several candidate genes, such as the PAPSS1 and PTPN13 genes involved in hepatocarcinogenesis have been identified in this region.28,29 Continued efforts on the HCC-related proteomics study using either the targeted gene-centric approach or the shotgun proteomics approach on HCC specimens may uncover the molecular mechanisms of hepatocarcinogenesis, and provide preventive and therapeutic approaches for future clinical applications.
a The classification of genes associated with diseases and biological functions was based on the information from OMIM database, Cancer Genome Project, and Genetic Association Database. Genetic diseases include metabolic, neurological, musculoskeletal, developmental, hematological diseases and other.
AFP (Alpha-fetoprotein), FBA, FBB, FBG (Fibrin/Fibrinogen), KIT (KIT)
Perspective
cancerrelated genes stem cellrelated genes tumor markers currently being used
genes
ADD1, ADH1A, ADH1B, ADH1C, ADH4, ADH5, ADH6, ADH7, AFP, AGA, AIMP1, ALB, ANK2, ANTXR2, BBS12, BBS7, BMPR1B, CC2D2A, CFI, CISD2, CNGA1, COQ2, CORIN, CYP4 V2, DMP1, DOK7, DSPP, EDNRA, ENAM, ETFDH, EVC, EVC2, F11, FGB, FGFR3, FGG, FRAS1, GABRA2, GLRB, GNRHR, GRXCR1, HADHSC, HMX1, HPGD, HTT, IDUA, IGFBP7, IL2, KLKB1, LIAS, LRAT, MANBA, MAPK10, MDCMP, MFSD8, MMAA, MSX1, MTP, MYOZ2, NEK1, NKX3−2, NR3C2, PDE6B, PITX2, PKD2, PMX2B, PRDM5, PROM1, PRSS12, QDPR, SCARB2, SEPSECS, SGCB, SH3BP2, SLC25A4, SLC2A9, SLC34A2, SLC4A4, SNCA, SOD3, TACR3, TLL1, TLR3, WDR19, WFS1 AFF1, AFP, AMBN, AREG, ARHH, CD38, CHIC2, CXCL10, DCK, DKK2, DUX4, EGF, EIF4E, FAT1, FBXW7, FGF2, FGF5, FGFR3, FIP1L1, FRYL, HNRNPD, HTRA3, IGFBP7, ING2, IL2, KDR, KIT, LEF1, MAD2L1, MAPK10, MLLT2, NFKB1, NPY1R, NR3C2, NUDT6, PAPSS1, PDGFRA, PHOX2B, PTPN13, RAP1GDS1, RASL11B, RASSF6, RCHY1, REST, RHOH, S100P, SLC34A2, SORBS2, SPP1, SYNPO2, TACC3, TET2, TLR2, WHSC1 ABCG2, BMP3, CCNA2, DUX4, ELOVL6, FGF2, KDR, KIT, LEF1, MSX1, NEUROG2, PITX2, POU4F2, PROM1, SFRP2 genetic diseases
disease class
Table 2. Summary of Genes Associated with Diseases, Including Genetic Diseases and Cancer, and Stem Cell on Chromosome 4a
Journal of Proteome Research
■
SINGLE NUCLEOTIDE POLYMORPHISMS IN CHROMOSOME 4 Single-nucleotide polymorphisms (SNPs) have been explored as a high-resolution marker set for accelerating the mapping of disease genes.30 One of the 5-year midterm goals within the CHPP is the mapping of splice variants and single nucleotide polymorphisms. On the basis of the Ensembl database, the multiple structural variants and sequence variants, including single nucleotide polymorphisms (SNP), insertions, deletions, and mutations, have been identified and mapped to chromosome 4. SNPs may locate in intragenic genes, including protein-coding and noncoding regions, and intergenic regions. Those located in the protein-coding region might produce synonymous as well as nonsynonymous (missense) amino acid changes. Some of nonsynonymous changes might increase the risk of disease susceptibility or responsiveness to drug treatments. In order to study the effects of nonsynonymous changes on gene function, we have selected coding SNPs (cSNPs) on genes on chromosome 4 as potential targets. Those nonsynonymous cSNPs that have an allelic frequency of greater than 1% in the population were compiled. As an example, Figure 2 shows some cSNPs located within the 87−125 Mb region on chromosome 4, which is the candidate region for the investigation of hepatocarcinogeneis. This region contains a total of 304 nonsynonymous cSNPs in 98 genes, including genes associated with carcinogenesis and involved in other biological processes. A list of nonsynonymous cSNPs spanning the entire chromosome 4 can be found in Supporting Information Table S1. Several ADH genes, which contain functional SNPs important to alcohol metabolism, are also 35
dx.doi.org/10.1021/pr300829r | J. Proteome Res. 2013, 12, 33−44
Journal of Proteome Research
Perspective
Figure 2. Disease of interest on the chromosome 4 and coding single nucleotide polymorphisms (cSNPs) within the selected region of chromosome 4 (87−125 Mb). Selected diseases of interest are shown at the left side of the chromosome 4. The right panel shows selected genes located within the 87−125 Mb region, which are candidates for investigating hepatocarcinogenesis. Representative data are shown for nonsynonymous cSNPs ID, SNP position, amino acid changes and allele frequency. The information on genotype and allele frequency was taken from dbSNP (build 135). The diagram of chromosome 4 was adapted from NCBI MapViewer.
■
located within this region (Figure 2).12 One of those genes is the ADH1B gene. The ADH1B gene encodes the beta subunit of class I alcohol dehydrogenase (ADH), which is responsible for metabolizing alcohol to acetaldehyde. A nonsynonymous SNP (rs129984), which causes a substitution of histidine for arginine at codon 48 (p.Arg48His), was identified in the ADH1B gene. The occurrence frequency of the His48 allele varies among different ethnic populations; it is high in eastern Asian populations, and low in European, Sub-Saharan African, and Native American populations.31 The His48 variant is characterized by altered electrophoretic behaviors and increased the Vmax of the enzyme.32 Compared to individuals with the Arg48 allele, individuals with the His48 allele have faster alcohol metabolism and increased formation of acetaldehyde, which causes uncomfortable symptoms. Thus, individuals with the His48 allele are more sensitive to alcohol and may be protected against alcohol dependence and developing alcoholrelated diseases.33,34 The difference in distribution of the His48 allele between populations might partially account for differences in the incidence of alcohol dependence between populations. This example of the His48 variant of the ADH1B demonstrates the possible impact of a cSNPs on the function of a protein and the potential pharmacogenetic implications. Thus, population-specific information on genetic variation may provide a rationale for functional study of the nsSNPs.
EVIDENCE OF THE CHROMOSOME 4 ENCODED PROTEINS As a first step to map the entire protein set on chromosome 4, all of the 2488 genes were mapped to the neXtProt and GPMdb and PeptideAtlas databases to obtain the updated information on the experimental evidence at the protein level. The neXtProt defines five evidence levels for the existence of a protein as follows: level 1, evidence at protein level, e.g., with clear experimental evidence; level 2, evidence at transcript level, e.g., with expression data indicating the existence of a transcript but without strict proof of the existence of a protein; level 3, inferred from homology, e.g., probable existence of a protein due to the existence of clear orthologs in closely related species; level 4, predicted, e.g., entries without evidence at protein, transcript, or homology levels, level 5, uncertain, e.g., the existence of the protein is uncertain. In addition, we defined those genes that do not have any reviewed data in neXtProt, as level 0. GPMdb provides four types of evidence as follows: high quality evidence of translation, medium quality evidence of translation, low quality evidence of translation, and out of database or no credible evidence of translation evidence. PeptideAtlas provides evidence of peptide identifications obtained from MS-based experiments by using PeptideProphet probability to measure the confidence of peptide identifications. A higher PeptideProphet probability indicates higher confidence. Figure 3 presents the genes in the region of interest, related to hepatocarcinogenesis and Wolfram syndrome (approximately from 87 to 125 Mb). To compare the protein evidence 36
dx.doi.org/10.1021/pr300829r | J. Proteome Res. 2013, 12, 33−44
Journal of Proteome Research
Perspective
Figure 3. Overview and evidence level of the putative protein-coding genes on the selected region related to hepatocarcinogenesis and Wolfram syndrome (87−125 Mb) on human chromosome 4. The 300 genes have been color coded according to the experimental evidence of the corresponding protein in the Ensembl (release 68). The size of each box corresponds to the number of amino acids of the largest splice variant of the corresponding gene. Approximately 15 genes are shown across each horizontal line; the distances between genes are not drawn to scale. The chromosomal position from the left arm of the chromosome is shown to the left. For each gene, the left and right color codes represent the protein evidence provided by neXtProt and GPMdb, respectively. The color codes of neXtProt/GPMdb are as follows: green = evidence at protein level (level 1) in neXtProt/high quality evidence of translation in GPMdb; yellow = evidence at transcript level (level 2) in neXtProt/medium quality evidence of translation in GPMdb; red = inferred from homology (level 3) and predicted (level 4) in neXtProt/low quality evidence of translation in GPMdb; black = uncertain (level 5) and no reviewed data available in neXtProt (level 0)/no credible evidence of translation evidence or out of database in GPMdb. Genes mentioned in the text are indicated with asterisks.
HNF1 binding site.36 These interesting proteins are highlighted with an asterisk in Figure 3. Among these 757 protein-encoding genes, 72.1% (546 genes) have been observed with experimental evidence at the protein level (level 1), while an additional 21.8% (165 genes) and 0.9% (7 genes) have been found only at the transcriptional level of human (level 2) or closely related species (level 3), respectively. Not surprisingly, 0.3% (2 genes) and 0.3% (2 genes) have no evidence either on the predicted (level 4) or uncertain (level 5) levels, respectively. Moreover, 4.6% (35 genes) have no reviewed data in the neXtProt database (level 0). In summary, a total of 27.9% (211 genes) (211/757) of the chromosome 4 protein-coding genes lack any protein-level evidence, which highlights the very limited knowledge of these under-represented chromosome 4 proteins and the urgent need for a systematic investigation by the C-HPP.
from different database, each gene region is divided into left and right parts with color codes to represent the protein evidence provided by neXtProt and GPMdb. For each gene, the left and right color codes represent the protein evidence provided by neXtProt and GPMdb, respectively.35 The color codes of neXtProt/GPMdb are as follows: green = evidence at protein level (level 1) in neXtProt/high quality evidence of translation in GPMdb; yellow = evidence at transcript level (level 2) in neXtProt/medium quality evidence of translation in GPMdb; red = inferred from homology (level 3) or predicted (level 4) in neXtProt/low quality evidence of translation in GPMdb; black = uncertain (level 5) or no reviewed data available in neXtProt (level 0)/no credible evidence of translation evidence or out of database in GPMdb. Among the 757 protein coding genes, it is noted that about 95% of the evidence is identical between neXtProt and UniProt (Release 2012_08 of 5 September 2012). There are several well-known genes in this interesting region, including CISD2 for Wolfram syndrome and ADH5, ADH4, ADH6, ADH1A, ADH1B, ADH1C, and ADH7 for alcohol metabolism. Among these ADH genes, the regulation of tissue-specific and temporal expression of the class I ADH genes is shown to occur through a DNA looping mechanism and governed by the conserved
■
CHROMOSOME 4 ENCODED PROTEINS AS POTENTIAL CANCER BIOMARKERS Along with the C-HPP, the Biology/disease-driven Human Proteome Project, termed the B/D-HPP, is concomitantly being conducted to map the existence of each protein in specific cells, tissues, organs or biofluids as well as to annotate 37
dx.doi.org/10.1021/pr300829r | J. Proteome Res. 2013, 12, 33−44
Journal of Proteome Research
Perspective
specificity of those, as individual biomarkers, is in general not quite good enough for practical use in clinical settings.43 This reference also provides a list of 1261 protein candidate biomarkers, and indicates that 5 of the 9 existing FDAapproved cancer biomarkers showed specificities of 90−98% but with sensitivity ranging from 40 to 74.5%. Therefore, the authors of this study recommended looking at panels of proteins as biomarkers. Thus, the identification of good candidate biomarkers and panels of candidate biomarkers is an urgent need for the further development of more clinically useful, body fluid-accessible cancer biomarkers, which can help to curb the growing burden of cancer worldwide. Although good candidates for body fluid-accessible cancer biomarkers can be identified directly using body fluids from control and diseased subjects, the success of this strategy has been greatly hampered by the presence of numerous abundant proteins in most types of human body fluids, especially in serum/plasma.44 As cancer cells grown inside the human body are likely to secrete/shed proteins into the extracellular space (and ultimately into the circulation systems) and several cancer biomarkers currently used in clinics are secreted/shed proteins from cancer cells including prostate-specific antigen, αfetoprotein, carcinoembryonic antigen, CA125 and CA15-3, a systematic targeted analysis of the cancer cell secretome has been shown as an alternative and promising strategy to find good body fluid-accessible biomarker candidates for a variety of cancer types, which are distinct from host-response biomarker candidates.45−48 Our strategy was to identify potential cancer biomarkers among the 757 proteins encoded by chromosome 4, by using cancer cell secretome and several other existing databases in the public domain. We focused on the search for potential body fluid-accessible cancer biomarkers because these candidates represent ideal targets for future development of effective tools for in vitro diagnosis/prognosis of cancer using noninvasive samples. To this end, we first identified chromosome 4encoded proteins that can also be secreted/released by cancer cells by integrating the 757 protein data set with our previously published secretome data set (containing 4584 nonredundant proteins) from 23 cancer cell lines derived from 11 cancer types (1−3 cell lines per cancer type).49 This analysis revealed that 141 (18.6%, 141/757) chromosome 4-encoded proteins could be detected in the secretome of at least one cancer cell line (Supporting Information Table S2). Among these, 49 proteins (34.7%, 49/141) were detectable in Human Plasma Proteome Project (HPPP). Moreover, 34 proteins (24.1%, 34/141) could be detected commonly in the conditioned media of at least 12 cell lines. By further integrating the list of cancer type-specific secreted/released 1381 proteins generated in our previous work,49 we identified 40 chromosome 4-encoded proteins that were uniquely detected in the secretome of one specific cancer type (Table 3). The expression levels of these 40 chromosome 4-encoded proteins in corresponding cancer tissues versus normal tissues were then retrieved from the Human Protein Atlas (HPA), and the results showed that 14 proteins are highly expressed (moderate to strong intensity) in ≥50% of the cancer tissues examined in the HPA (Table 3). For example, 92% of the HPA-examined bladder cancer tissues showed moderate to strong expression of 15-hydroxyprostaglandin dehydrogenase (HPGD) which was uniquely detected in the bladder cancer cell secretome;49 100% of the HPA-examined head and neck cancer tissues harbored moderate to strong expression of SEC24 related gene family member D (SEC24D), which was
their disease-associated content. Cancer is one of the most urgent topics in the Taiwan community. In the forthcoming era of genomic and proteomics medicine, a panel of protein markers may be expected to become a routine cancer diagnostic kit. Several national priority programs, such as the National Research Program for Genomic Medicine (NRPGM, 2002− 2010) and the National Research Program for Biopharmaceuticals (NRPB, 2011−present), have been launched to focus on cancer-oriented topics, including the genomic and proteomic studies of liver cancer. With the long-term objective of creating a cost-effective clinical care pathway, Taiwan has recently launched a national project, called “Taiwan Biosignatures” through joint efforts of medical centers, research institutes and universities. To date, the most commonly applied pipeline for biomarker development involves a global discovery phase followed by validation of the potential biomarker candidates, before a biomarker is eventually adopted as a clinical tool. However, biomarker discovery studies have produced few protein marker candidates good for clinical uses because of the difficulty of large-scale verification and validation on patient specimens. In 2012, Institute of Medicine (IOM, National Academy of Sciences, USA) published a report in which best practices and a guide along the entire pathway were provided to assist developing omics-based tests, from discovery to clinical trials.37 The Taiwan Biosignatures project aims to discover, develop, and validate new biomarkers and enable molecular technologies for prevention, early detection, and effective therapeutic interventions for diseases. Typically, a multidisciplinary team with expertise in health economics, disease modeling, clinical research, bioinformatics, and analytical technology is assembled to first evaluate the most cost-effective node in the clinical care pathway, followed by comprehensive literature mining for biomarker candidate development, and subsequent verification and validation on a large cohort of clinical specimens. It is expected that the disease biomarker and therapeutic target candidates, especially for cancer, identified from these cancer-related programs will provide rich information on the cancer proteome to complement to the chromosome 4.
■
SECRETED PROTEINS AND MEMBRANE PROTEINS ON CHROMOSOME 4 It has been long recognized that living cells can secrete/shed proteins into extracellular space, either constitutively or in response to environmental signals. In human genome, it has been estimated that secreted proteins account for about onetenth of the human proteome, including molecules involved in diverse biological functions such as extracellular signaling, blood coagulation, immune defense, and carcinogenesis.38 Proteins can be secreted/shed from living eukaryotic cells (the secretome) by several different ways, and some algorithms have been developed to predict protein secretion via either a signal peptide- (the SignalP program with hidden Markov models)39,40 or a nonsignal peptide-triggered mechanism (the SecretomeP program).41 The protease-mediated shedding of fragment(s) of proteins with transmembrane helices, predictable using the TMHMM program,42 also contributes to the formation of secretome. The body fluid-accessible biomarkers are important tools in clinics for management of various cancers. However, the number of body fluid-accessible cancer biomarkers currently approved by official health agents is very small. Very few biomarkers have actually been tested, and the sensitivity and/or 38
dx.doi.org/10.1021/pr300829r | J. Proteome Res. 2013, 12, 33−44
39
Malignant lymphoma
Lung cancer
Liver cancer
Cervix cancer
IPI00002135.1 IPI00164610.4 IPI00060419.3 IPI00394676.1 IPI00004942.1
IPI00022865.1 IPI00004655.1 IPI00739940.4 IPI00298949.1 IPI00022239.7 IPI00303063.7 IPI00016377.2 IPI00008422.5
IPI00012540.1 IPI00215621.1 IPI00296777.3 IPI00016532.3
IPI00296645.3
IPI00045106.3 IPI00012269.2 IPI00172580.4 IPI00024107.1 IPI00218180.2 IPI00292936.4 IPI00027174.1
IPI00166468.2 IPI00019943.1
Cyclin-A2 Protein FRG1 Isoform 1 of Protein furry homologue-like Cyclin G-associated kinase Methionine aminopeptidase 1 SCC-112 protein Isoform S of Ras-related protein Rab-28 Isoform 2 of SWI/SNF-related matrixassociated actindependent regulator of chromatin subfamily A containing Transforming acidic coiled-coilcontaining protein 3 TBC1 domain family member 1 TRAF2-binding protein Isoform 1 of Negative elongation factor A Zinc finger protein 330
Hedgehog-interacting protein precursor Multimerin-1 precursor Nucleoporin 54 kDa variant (Fragment) Isoform 1 of Alpha-synuclein Macrophage inflammatory protein 2-beta precursor C-X-C motif chemokine 5 precursor Isoform 1 of Fibroblast growth factor receptor 3 precursor Microsomal triglyceride transfer protein large subunit precursor Prominin-1 precursor Isoform 2 of Interleukin-8 precursor SPARC-like protein 1 precursor Uncharacterized protein C4orf27
Isoform 2 of O-phosphoseryl-tRNA(Sec) selenium transferase Transmembrane anterior posterior transformation 1 Afamin precursor
IPI00155389.2
Breast cancer
15-hydroxyprostaglandin dehydrogenase CGI-151 protein Isoform 1 of Ankyrin-2
IPI00305286.1 IPI00383645.2 IPI00007834.1
protein name
C-X-C motif chemokine 10 precursor
accession no.
IPI00022448.4
cancer type
Bladder cancer
negative (% of cases) positive rate (%) intensity
quantity (%)
detected in the HPPP
25 0 83 0 50
HPA031106 CAB025485 HPA036662 HPA044575 HPA016737 HPA022039 HPA041816 HPA045889 CAB009960 HPA015705
TACC3a TBC1D1 TIFAa WHSC2a ZNF330a
17 0 9 58 50
33
0 0
17
0
9 0 0
0
CAB000114
CAB026225 HPA043467
CAB011525
CAB004231
HPA012616 HPA035769 HPA035929
HPA017006
42
CCNA2a FRG1 FRYLa GAK METAP1a PDS5Aa RAB28 SMARCAD1a
PROM1 IL8 SPARCL1 C4orf27
MTTP
HHIP MMRN1 NUP54 SNCA CXCL3 CXCL5 FGFR3
TAPT1 AFM
SEPSECS
HPA004919
33 33 55 42 33
58 17 0 17
33
67
0 8
25
25
27 9 0
42
50
33 18 27 0 8
25 0 17 17
8
0
0 25
33
17
55 36 36
42
8
17 46 9 0 9
17 0 83 16
34
0
100 67
25
58
9 55 64
16
0
83 54 91 100 91
83 100 17 84
66
100
0 33
75
42
91 45 36
84
100
moderate moderate moderate moderate strong
moderate strong moderate strong
moderate
strong
negative weak
moderate
weak
moderate negative negative
moderate
strong
75 75−25
75−25 >75