Identification and Characterization of Proteins ... - ACS Publications

Jun 24, 2014 - Identification and Characterization of Proteins Encoded by. Chromosome 12 as Part of Chromosome-centric Human Proteome. Project...
1 downloads 0 Views 3MB Size
Article pubs.acs.org/jpr

Identification and Characterization of Proteins Encoded by Chromosome 12 as Part of Chromosome-centric Human Proteome Project Srikanth Srinivas Manda,†,‡ Raja Sekhar Nirujogi,†,‡ Sneha Maria Pinto,†,§ Min-Sik Kim,∥,⊥ Keshava K. Datta,†,# Ravi Sirdeshmukh,† T. S. Keshava Prasad,† Visith Thongboonkerd,▽ Akhilesh Pandey,∥,⊥,† and Harsha Gowda*,† †

Institute of Bioinformatics, International Technology Park, Bangalore 560066, India Centre of Excellence in Bioinformatics, Bioinformatics Centre, School of Life Sciences, Pondicherry University, Puducherry 605014, India § Manipal University, Madhav Nagar, Manipal 576104, India ∥ Department of Biological Chemistry and ⊥McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, United States # School of Biotechnology, KIIT University, Bhubaneswar, Odisha 751024, India ▽ Medical Proteomics Unit, Office for Research and Development, Faculty of Medicine Siriraj Hospital, and Center for Research in Complex Systems Science, Mahidol University, Bangkok 10700, Thailand ‡

S Supporting Information *

ABSTRACT: Chromosome-centric human proteome project (CHPP) is a global initiative to comprehensively characterize proteins encoded by genes across all human chromosomes by teams focusing on individual chromosomes. Here, we report mass spectrometrybased identification and characterization of proteins encoded by genes on chromosome 12. Our study is based on proteomic profiling of 30 different histologically normal human tissues and cell types using high-resolution mass spectrometry. In our analysis, we identified 1,535 proteins encoded by 836 genes on human chromosome 12. This includes 89 genes that are designated as “missing proteins” by “neXtProt” as they did not have any prior evidence either by mass spectrometry or by antibody-based detection methods. We identified several variant peptides that reflected coding SNPs annotated in dbSNP database. We also confirmed the start sites of ∼200 proteins by identifying protein N-terminal acetylated peptides. We also identified alternative start sites for 11 proteins that were not annotated in public databases until now. Most importantly, we identified 12 novel protein coding regions on chromosome 12 using our proteogenomics strategy. All of the 12 regions have been annotated as pseudogenes in public databases. This study demonstrates that there is scope for significantly improving annotation of protein coding genes in the human genome using mass-spectrometry-derived data. Individual efforts as part of C-HPP initiative should significantly contribute toward enriching human protein annotation. The data have been deposited to ProteomeXchange with identifier PXD000561. KEYWORDS: proteomics, proteogenomics, non-coding RNA, pseudogenes, open reading frame



INTRODUCTION

different groups as part of the international consortium from across the world would systematically annotate the human proteome. Chromosome 12 is being annotated by an international team consisting of members from Thailand, India, Singapore, and Taiwan. Chromosome 12 has a length of ∼134 MB, similar in size with chromosome 10 and chromosome 11. It accounts for 4−

Chromosome-centric human proteome project (C-HPP) is a global initiative to identify all of the proteins encoded by the human genome by teams focusing on individual chromosomes.1 The proposed goals include identification of at least one representative protein encoded by each gene along with characterization of its localization, alternate splice variants, nonsynonymous variant-containing peptides, and major posttranslational modifications using mass spectrometry and antibody based methods.2 Under the C-HPP initiative, 25 © 2014 American Chemical Society

Received: November 19, 2013 Published: June 24, 2014 3166

dx.doi.org/10.1021/pr401123v | J. Proteome Res. 2014, 13, 3166−3177

Journal of Proteome Research

Article

5% of the entire DNA content of a cell. NCBI annotation release 104 lists 1,738 genes annotated on this chromosome with 1,015 annotated as protein-coding, 490 annotated as pseudogenes, and the remaining annotated as genes for noncoding RNAs. neXtProt, a web-based resource developed by Swiss Institute of Bioinformatics that integrates 13 different resources to provide a knowledge base for proteins,3 lists 188 proteins from chromosome 12 as “missing proteins” as there is no evidence from mass spectrometry or antibody based detection methods. In this study, we report mass spectrometry based identification of 1,535 proteins from 836 genes encoded by chromosome 12. As our analysis included multiple human tissues, it also provides insights into expression pattern of these proteins. In addition, we identified several novel protein coding genes and also curated existing annotations using a proteogenomics approach. Chromosome 12 contains several genes associated with various clinical conditions.4 One of the largest blocks of linkage disequilibrium in the human genome can be seen on the q-arm of this chromosome. It harbors a number of oncogenes and genes associated with various genetic disorders (Figure 1a). The Online Mendelian Inheritance in Man (OMIM) catalog reports more than 300 genes on chromosome 12 that are associated with various diseases. Supplementary Table 1 provides a list of genes from chromosome 12 and their associated disorders based on data from OMIM (http://www. omim.org). For example, mutations in von Willebrand factor (VWF) are known to be associated with von Willebrand disease,5 mutations in protein tyrosine phosphatase, nonreceptor type 11 (PTPN11) with Noonan syndrome, mutations in phenylalanine hydroxylase (PAH) with phenylketonuria, and mutations in glycogen synthase 2 (GYS2) with glycogen storage disease. Similarly, there are a number of genes on this chromosome that are associated with cancers. This includes KRAS oncogene, which is known to be mutated in various cancers6 and several leukemias and lymphomas where translocations of chromosome 12 to other chromosomes is observed.7,8 A number of gene clusters are reported on this chromosome that include type II keratin gene cluster with 14 genes, the natural killer cell gene cluster with 9 genes, and the homeobox C gene cluster with 9 genes.8 Figure 1b shows the top 25 gene families on chromosome 12 based on HGNC gene families (http://www.genenames.org/genefamily.html). This makes chromosome 12 one of the chromosomes of prime interest for cancer biologists and geneticists.



Figure 1. (a) Chromosome 12 ideogram showing loci of some of the genes known to be associated with genetic disorders and cancers. (b) Major gene families encoded by chromosome 12. The X-axis represents top 25 gene families on chromosome 12 as per HGNC classification, and the Y-axis represents the number of genes in each gene family. Each bar in the graph shows the proportion of identified proteins from each gene family.

EXPERIMENTAL SECTION

Sample Preparation and Analysis

As part of a comprehensive human proteome profiling study,9 we sampled 17 adult tissues, 7 fetal tissues, and 6 hematopoietic cell types (fetal tissues: heart, liver, gut, ovary, testis, brain, placenta; adult tissues: frontal cortex, spinal cord, retina, heart, liver, ovary, testis, lung, adrenal gland, gallbladder, pancreas, kidney, esophagus, colon, rectum, urinary bladder, prostate; hematopoietic cells: B cells, CD4+ T cells, CD8+ T cells, NK cells, monocytes, platelets). The tissues were histologically confirmed to be normal. This study was approved by the Johns Hopkins University’s Institutional Review Board for use of human tissues. Samples were pooled from three individuals per tissue type to account for heterogeneity and lysed using a filteraided sample preparation protocol.10 Briefly, samples were

lysed in 4% SDS, 0.1 M DTT, and 0.1 M Tris pH 7.4. Tissues were homogenized in lysis buffer, and cleared lysates were estimated using BCA assay. Protein samples were resolved on SDS-PAGE and processed for in-gel digestion as described earlier.11 In addition, lysates were buffer exchanged with 9 M urea lysis buffer using Millipore 30 kDa cutoff filters, alkylated and diluted 6 times with 20 mM TEABC, and subjected to trypsin digestion using 1:20 enzyme-to-substrate ratio. Digested peptides were cleaned using Sep-Pak C18 cartridges, vacuumdried, and fractionated using basic Revese Phase Liquid 3167

dx.doi.org/10.1021/pr401123v | J. Proteome Res. 2014, 13, 3166−3177

Journal of Proteome Research

Article

Chromatography (bRPLC) fractionation method.12 The overall workflow employed is shown in Supplementary Figure 1.

from NCBI and pseudogene database from the Gerstein laboratory, (5) protein N-terminal peptide database based on RefSeq annotated sequences, and (6) variant peptide database that incorporated cSNPs based on dbSNP. For each of the databases, a decoy database was created by reversing the peptide sequences from parent databases. Custom databases (e.g., 6-frame genome translation) created for proteogenomics analyses were quite large, and these could not be indexed using Proteome Discoverer and Mascot. Because of this, X!Tandem was used for all searches related to proteogenomics analysis. X! Tandem was used as a search engine with following search parameters: precursor mass error 10 ppm, fragment mass error 0.05 Da. Modifications were used similar to protein searches. Proteome Discoverer 1.3 was used to export the unassigned peaklists in Mascot Generic Format (MGF). The MGF files were searched against the custom databases using X!Tandem and for SNPs using SEQUEST search algorithm. Peptides that passed the 1% FDR threshold were considered for further analysis. Peptide identifications that unambiguously mapped to a single region in the genome were considered to perform protegenomics-based annotation of novel coding regions. In addition to filtering using the statistical threshold, we manually verified and retained peptides for which MS/MS fragmentation could explain the identification. Synthetic peptides were used to verify MS/MS spectra of several novel peptides identified as part of our proteogenomics analysis.9 All of the databases were created in-house using python scripts. The human reference genome assembly hg19 was downloaded from NCBI and translated into six reading frames. Peptide sequences from stop codon to stop codon greater than 6 amino acids were retained in the database. All of the mRNA sequences were downloaded from NCBI RefSeq (RefSeq version 56 containing 33,580 sequences) and translated in three reading frames. Similary, pseudogene sequences from NCBI (11,160 sequences) and Gerstein’s pseudogene database (16,881 sequences from http://pseudogene.org/, version 68), noncoding RNA sequences from NONCODE (91,687 sequences, version 3) were translated in three reading frames. Database containing cSNPs was based on dbSNP version 138.16

LC−MS/MS Analysis

For each tissue, liquid chromatography and tandem mass spectrometry analysis was carried out using LTQ Orbitrap Velos and/or Orbitrap Elite mass spectrometers coupled with Easy nanoLC II. The peptide samples from each fraction were reconstituted in 0.1% formic acid and loaded onto a precolumn (2 cm, 5 μm particle and 300 Å pore size) using a flow rate of 5 μL/min. They were introduced into the mass spectrometer after being resolved on an analytical column using a flow rate of 350 nL/min and a gradient of 5% to 30% solvent B (0.1% formic acid and 90% ACN) for 70 min and 30% to 90% for 15 min. All of the columns were made in-house using C18 material (Michrom Bioscience, 5 μm, 100 Å). Peptides were introduced into the mass spectrometer using a Pico-tip emitter 10 ± 1 μm, and the heated capillary source was operated at 200 °C. Data were acquired in a data dependent manner using Xcalibur 2.1 acquisition software. The top 15 precursor ions from the survey scans were targeted for MS/MS in the full scan range of m/z 350 to 2000. MS data were acquired at a resolution of 60,000 at m/z 400 on the Orbitrap velos and 120,000 at m/z 400 on the Orbitrap Elite. Each precursor ion was fragmented by higherenergy collisional dissociation (HCD) using a normalized collision energy of 35% on the Orbitrap velos and 32% on the Orbitrap Elite. MS/MS data were acquired at a resolution of 15,000 at m/z of 400 on the Orbitrap velos and 30,000 at m/z of 400 on the Orbitrap Elite. Automatic gain control (AGC) was set to 1.0 × 106 ions for full scan and 1.0 × 105 for MS/MS. Monoisotope precursor ion selection was enabled, and dynamic exclusion was set to 30 s with a 10 ppm mass window. Singly charged ions were rejected from fragmentation. Isolation width was set to 2.0 m/z. Real time internal calibration was carried out using polycyclodimethylsiloxane ion (m/z 445.120024)13 from ambient air. Protein Database and Searching

Raw data were searched against the RefSeq 50 human protein database containing 33,833 proteins using Sequest and Mascot (version 2.2) search algorithms through the Proteome Discoverer 1.3 platform (Thermo Scientific, Bremen, Germany). Trypsin was used as the protease, allowing a maximum of two missed cleavages. Carbamidomethylation of cysteine was specified as a fixed modification, and oxidation of methionine, acetylation of protein N-termini, and cyclization of N-terminal glutamine and alkylated cysteine were included as variable modifications. The minimum peptide length was specified as 6 amino acids. The mass error of parent ions was set to 10 ppm, and for fragment ions it was set to 0.05 Da. The data were also searched against a decoy database to calculate false discovery rate.14 Peptides that passed 1% FDR threshold were used for protein identification. Protein inference was based on the rule of parsimony and required one or more unique peptides. Proteins identified only by ambiguous/non-unique peptides were not considered.

Functional Analysis of Identified Proteins

Molecular functions and primary localization information for all of the proteins were obtained from Human Protein Reference Database, HPRD (http://www.hprd.org), a Gene Ontology compliant database containing manually curated protein annotations along with protein−protein interactions related to human proteins.17 Normalization of Spectral Counts

Following steps were followed for spectral counting: 1. Number of peptide spectrum matches for unique peptides mapping to a protein coding gene were summed. 2. Spectral counts from step 1 were divided by the total number of MS/MS spectra acquired in each experiment (e.g., bRPLC, SDS-PAGE). 3. The spectral counts per protein coding gene were then averaged across multiple experiments (e.g., bRPLC, SDSPAGE) per tissue type (e.g., adult esophagus) to normalize. 4. Spectral counts from step 3 were used to plot the heat map.

Custom Databases for Proteogenomics

We used six different databases for all of the proteogenomics analyses to search unassigned spectra from protein searches. The databases used were: (1) human genome translated in six frames, (2) three frame translated noncoding RNAs from NONCODE,15 (3) three frame translated RefSeq mRNA sequences, (4) three frame translated pseudogene sequences 3168

dx.doi.org/10.1021/pr401123v | J. Proteome Res. 2014, 13, 3166−3177

Journal of Proteome Research



Article

RESULTS AND DISCUSSION

Identification of Proteins Encoded by Chromosome 12

We carried out comprehensive proteomic profiling of 30 histologically normal human tissues/cell types.9 For each tissue type, we pooled samples from three different healthy subjects to account for heterogeneity/inter-individual differences. MS/MS searches were carried out using both Mascot and Sequest search algorithms. Mapping this data to chromosome 12 resulted in identification of proteins encoded by 836 genes on chromosome 12. This accounts for more than 80% of protein coding genes currently annotated on chromosome 12. Eightyeight percent of all peptides were identified by both search algorithms, while 7.3% were identified only by Mascot and 4.7% only by Sequest. More than 40% of the identified proteins have sequence coverage of greater than 60%, and about 40 of these proteins have 100% sequence coverage (Figure 2a). Sixty-six protein coding genes on chromosome 12 were identified based on single peptide evidence. A list of all of the proteins identified in this study can be found in Supplementary Table 2. Relative expression levels of all of the identified proteins across 30 tissues/cell types is depicted as normalized spectral counts in Supplementary Figure 2. Protein expression profiles provide interesting insights into their potential function. For example, proteins of many of the type II keratins such as KRT2, KRT3, KRT5, KRT6A, KRT6B, and KRT6C are shown to be relatively highly expressed in esophagus, which is known to contain keratin as a major component in the epithelium. Proteins including ANO4, SYCP3, LTBR, and PTGES3 show relatively abundant expression in fetal tissues compared to corresponding adult tissues. Many of these genes have been previously described as developmentally regulated.18−20 Similarly, several proteins such as HVCN1, RAP1B, NCKAP1L, ANO6, and PTPN6 show a distinct expression pattern in hematopoietic cell types as compared to fetal and adult tissues (Figure 2b). HVCN1, a voltage-gated proton channel, is shown to be expressed highly in immune cells, as is the case with RAP1B, which has been shown to play a distinct role in signaling in monocytes.21,22 PTPN6, a protein tyrosine phosphatase that also showed relatively abundant expression in hematopoietic cells in our data set, is primarily known to function as an important regulator of various signaling pathways in hematopoietic cells, and its altered expression plays a key role in leukemogenesis.23

Figure 2. (a) Sequence coverage for proteins identified on chromosome 12. All of the peptides identified for each protein were used to calculate the coverage. The X-axis shows protein coverage, and the Y-axis shows number of proteins. (b) Relative expression level of a subset of proteins across all tissues/cell types analyzed in the study. Red represents relatively higher expression, and yellow represents relatively lower expression. Proteins not detected in a tissue are represented in gray.

Identification of Missing Proteins in C-HPP

One of the major goals of C-HPP is to annotate all proteincoding genes in the human genome. neXtProt release 2013-0817 contains approximately 5,000 genes in the human genome with no experimental evidence at the protein level that they classify as “missing proteins”.3 By sampling the proteome from 30 different human tissues, we identified 89 proteins classified as “missing proteins” on chromosome 12. Table 1 shows the list of genes that are currently under the category of “missing proteins” on this chromosome which have been identified in the current study. Figure 3 shows expression pattern of these proteins across human tissues sampled in this study. It is evident from the figure that many of these proteins are largely expressed in fetal and immune cells. Failure to identify these proteins in the past is likely due to undersampling of these cell and tissue types. This warrants deep proteomic profiling to obtain better coverage of proteins expressed in these tissues/ cell types.

Biological Functions Carried out by Proteins Encoded by Chromosome 12

One of the major classes of protein-coding genes on this chromosome belongs to solute carrier proteins, which include the “classical” transporter families such as ion-coupled transporters, exchangers, and passive transporters. There are 30 different solute carrier proteins that are encoded by this chromosome. The next major family of genes includes KRT1− 5, KRT6A, KRT6B, and others that constitute the basic keratins of type II intermediate filaments. These form the major component of all epithelial cells and epidermal cells.24 A wellknown AMPAR receptor (GRIN2B) and other synaptonemal complex proteins that are known to be involved in synaptic plasticity and higher brain function25 are also encoded by this chromosome. Chromosome 12 also harbors 9 genes from the 3169

dx.doi.org/10.1021/pr401123v | J. Proteome Res. 2014, 13, 3166−3177

Journal of Proteome Research

Article

Table 1. List of Proteins Designated “Missing Proteins” by neXtProt Identified in Our Study neXtProt ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

NX_A1L157 NX_A6NCE7 NX_A6NE01 NX_A6NFE2 NX_A6NFT4 NX_A6NL08 NX_A6NMB9 NX_A8MV81 NX_E9PGG2 NX_G3V0H7 NX_O15218 NX_O43908 NX_O75908 NX_O95626 NX_P09630 NX_P0C7M7 NX_P21506 NX_P31275 NX_P52738 NX_P59538 NX_Q00444 NX_Q06055 NX_Q16538 NX_Q2MV58 NX_Q32M45 NX_Q3ZCN5 NX_Q502W7 NX_Q52MB2 NX_Q53EV4 NX_Q5BKT4 NX_Q5BKY1 NX_Q6IE36 NX_Q6PF18 NX_Q6XD76 NX_Q6XYQ8 NX_Q6ZN79 NX_Q6ZP65 NX_Q6ZR37 NX_Q75WM6 NX_Q7RTY7 NX_Q7Z769 NX_Q86T29 NX_Q86WS5 NX_Q86YD7 NX_Q8IWA6 NX_Q8IXR9 NX_Q8IYJ0 NX_Q8N2C3 NX_Q8N309 NX_Q8N3J9 NX_Q8N4U5 NX_Q8N4V2 NX_Q8N812 NX_Q8N967 NX_Q8N9Z9 NX_Q8NA47 NX_Q8NA57 NX_Q8NEG0 NX_Q8NEX9 NX_Q8NG04

Entrez gene ID 441631 643246 121006 341346 387885 390323 401720 613227 647589 338821 11318 8302 8435 23519 3223 341392 7556 3228 7699 259290 3222 517 27239 79600 121601 283310 120935 387856 10233 84920 376132 144203 283385 121549 341359 440077 92558 440107 341567 341350 55 508 1E+08 283471 55138 160777 115749 196500 120863 254050 144348 255394 55530 400073 654429 160492 160762 160419 196472 121214 65012

gene symbol TSPAN11 MAP1LC3B2 FAM186A SMCO2/C12orf70 CCDC42B OR6C75 FIGNL2 HIGD1C ANHX SLCO1B7 GPR182 KLRC4 SOAT2 ANP32D HOXC6 ACSM4 ZNF10 HOXC12 ZNF140 TAS2R31 HOXC5 ATP5G2 GPR162 TCTN1 ANO4 OTOGL CCDC38 C12orf68 LRRC23 ALG10 LRRC10 OVOS2 MORN3 ASCL4 SYT10 ZNF705A CCDC64 PLEKHG7 H1FNT OVCH1 SLC35E3 ZNF605 TMPRSS12 FAM90A1 CCDC60 C12orf56 PIANP DEPDC4 LRRC43 ZNF664 TCP11L2 SVOP C12orf76 LRTM2 IFLTD1 CCDC63 C12orf50 FAM71C SDR9C7 SLC26A10

description tetraspanin 11 microtubule-associated protein 1 light chain 3 beta 2 family with sequence similarity 186, member A single-pass membrane protein with coiled-coil domains 2 coiled-coil domain containing 42B olfactory receptor, family 6, subfamily C, member 75 fidgetin-like 2 HIG1 hypoxia inducible domain family, member 1C anomalous homeobox solute carrier organic anion transporter family, member 1B7 (nonfunctional) G protein-coupled receptor 182 killer cell lectin-like receptor subfamily C, member 4 sterol O-acyltransferase 2 acidic (leucine-rich) nuclear phosphoprotein 32 family, member D homeobox C6 acyl-CoA synthetase medium-chain family member 4 zinc finger protein 10 homeobox C12 zinc finger protein 140 taste receptor, type 2, member 31 homeobox C5 ATP synthase, H+ transporting, mitochondrial Fo complex, subunit C2 (subunit 9) G protein-coupled receptor 162 tectonic family member 1 anoctamin 4 otogelin-like coiled-coil domain containing 38 chromosome 12 open reading frame 68 leucine rich repeat containing 23 ALG10, alpha-1,2-glucosyltransferase leucine rich repeat containing 10 ovostatin 2 MORN repeat containing 3 achaete-scute family bHLH transcription factor 4 synaptotagmin X zinc finger protein 705A coiled-coil domain containing 64 pleckstrin homology domain containing, family G (with RhoGef domain) member 7 H1 histone family, member N, testis-specific ovochymase 1 solute carrier family 35, member E3 zinc finger protein 605 transmembrane (C-terminal) protease, serine 12 family with sequence similarity 90, member A1 coiled-coil domain containing 60 chromosome 12 open reading frame 56 PILR alpha associated neural protein DEP domain containing 4 leucine rich repeat containing 43 zinc finger protein 664 t-complex 11, testis-specific-like 2 SV2 related protein homologue (rat) chromosome 12 open reading frame 76 leucine-rich repeats and transmembrane domains 2 intermediate filament tail domain containing 1 coiled-coil domain containing 63 chromosome 12 open reading frame 50 family with sequence similarity 71, member C short chain dehydrogenase/reductase family 9C, member 7 solute carrier family 26, member 10 3170

dx.doi.org/10.1021/pr401123v | J. Proteome Res. 2014, 13, 3166−3177

Journal of Proteome Research

Article

Table 1. continued neXtProt ID

Entrez gene ID

gene symbol

61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81

NX_Q8NGE1 NX_Q8TAP4 NX_Q8TBY9 NX_Q8TDB8 NX_Q8WUB2 NX_Q96DN6 NX_Q96DY2 NX_Q96HM7 NX_Q96JM4 NX_Q96LU7 NX_Q96MD2 NX_Q96MS3 NX_Q96N23 NX_Q96NZ1 NX_Q96RD1 NX_Q99645 NX_Q9H1C0 NX_Q9H2C1 NX_Q9H628 NX_Q9H765 NX_Q9HCQ5

341418 55885 144406 144195 29902 114785 115811 91523 84125 196446 144577 144423 144 535 121643 390321 1833 57121 64211 79785 140461 50614

OR6C4 LMO3 WDR66 SLC2A14 FAM216A MBD6 IQCD PCED1B LRRIQ1 MYRFL C12orf66 GLT1D1 C12orf55 FOXN4 OR6C1 EPYC LPAR5 LHX5 RERGL ASB8 GALNT9

82 83

NX_Q9NRX3 NX_Q9NY28

56901 26290

NDUFA4L2 GALNT8

84 85

NX_Q9NZP0 NX_Q9UBM8

254786 25834

OR6C3 MGAT4C

86 87 88 89

NX_Q9ULD8 NX_Q9UPP2 NX_C9JQL5 NX_Q8N1T3

23416 440073 not present not present

KCNH3 IQSEC3 IFITM3 MYO1H

description olfactory receptor, family 6, subfamily C, member 4 LIM domain only 3 (rhombotin-like 2) WD repeat domain 66 solute carrier family 2 (facilitated glucose transporter), member 14 family with sequence similarity 216, member A methyl-CpG binding domain protein 6 IQ motif containing D PC-esterase domain containing 1B leucine-rich repeats and IQ motif containing 1 myelin regulatory factor-like chromosome 12 open reading frame 66 glycosyltransferase 1 domain containing 1 chromosome 12 open reading frame 55 forkhead box N4 olfactory receptor, family 6, subfamily C, member 1 epiphycan lysophosphatidic acid receptor 5 LIM homeobox 5 RERG/RAS-like ankyrin repeat and SOCS box containing 8 UDP-N-acetyl-alpha-D-galactosamine:polypeptide N-acetylgalactosaminyltransferase 9 (GalNAcT9) NADH dehydrogenase (ubiquinone) 1 alpha subcomplex, 4-like 2 UDP-N-acetyl-alpha-D-galactosamine:polypeptide N-acetylgalactosaminyltransferase 8 (GalNAcT8) olfactory receptor, family 6, subfamily C, member 3 mannosyl (alpha-1,3-)-glycoprotein beta-1,4-N-acetylglucosaminyltransferase, isozyme C (putative) potassium voltage-gated channel, subfamily H (eag-related), member 3 IQ motif and Sec7 domain 3 Putative dispanin subfamily A member 2d Unconventional myosin-Ih

Identification of Peptides Containing Coding SNPs

HOXL family, which code for transcription factors that regulate the body plan of embryo. A number of G-Protein-Coupled Receptors (GPCRs) such as olfactory receptors and taste receptors are also encoded by this chromosome.26 We carried out bioinformatics analysis of all of the identified proteins on chromosome 12 to identify major gene families among identified proteins. We could identify proteins from most of the major gene families on this chromosome including the type II cytokeratin family (26 of 26), the solute carriers (26 of 30), and RNA binding motif containing proteins (10 of 11) (Figure 1b). The coverage was particularly poor for olfactory receptors and taste receptors. One of the likely reasons may be because we have not sampled the olfactory system. This again highlights the need for sampling tissues from the olfactory system. Figure 4a and b shows the distribution of identified proteins on chromosome 12 on the basis of their subcellular localization and molecular function. As evident, chromosome 12 encodes predominantly nuclear proteins followed by cytoplasmic and membrane bound proteins. A large number of proteins encoded by this chromosome show enzymatic activity, for example, hydrolases, transferases, kinases, and phosphatases among others. Cellular localization and function of significant number of proteins remains unknown and underscores the need for more studies on these proteins to functionally characterize them.

Tandem MS/MS-based proteomics experiments are capable of identifying large number of proteins. Because the search is often limited to reference protein sequences in public databases, there are always many spectra that remain unassigned. The spectra may remain unassigned for a number of reasons, including the absence of corresponding proteins in the database, modifications not specified in searching, alternate splice variants, or even because of variants arising due to single nucleotide changes (SNPs). In the past, some studies have reported searching unassigned spectra against protein databases where mutant peptides or SNPs have been incorporated. A study by Bunger et al. used a similar approach to search unassigned MS/MS spectra against dbSNP database from NCBI for breast cancer cells, and they could identify 629 coding SNPs.27 Recently, a web-based platform SysPIMP was created that uses the X!Tandem search results to identify human disease-related mutant sequences from shotgun proteomics data.28 We employed a similar strategy by searching the unassigned spectra from protein searches against a custom database that included SNP containing peptides. This resulted in identification of 337 peptides corresponding to 214 genes that reflected coding SNPs reported in dbSNP. Some other recent studies have used a similar strategy to identify variant peptides.29,30Isoleucine to leucine or vice versa (isobaric) and asparagine to aspartic acid, which accounts for the loss of an amide group, were removed as these changes cannot be 3171

dx.doi.org/10.1021/pr401123v | J. Proteome Res. 2014, 13, 3166−3177

Journal of Proteome Research

Article

Disease-Associated Genes on Chromosome 12

Chromosome 12 contains several genes associated with various diseases. For example, ETV6 that encodes the ETS-like transcription factor, plays a key role in leukemias (ALL and AML) and myelodysplastic syndromes (MDS). The ETV6 gene is often rearranged or fused in several human cancers including leukemias7 thyroid cancer, mesoblastic nephroma, congenital fibrosarcoma, and secretory breast carcinoma.31−33 Genes involved in oncogenic gene fusions include DDIT3 (DNA Damage-Inducible Transcript 3) and HMGA2 (High Mobility Group At-Hook 2). DDIT3 is a member of the CCAAT/ enhancer-binding protein family of transcription factors that functions as dominant-negative inhibitor by forming heterodimers with other C/EBP family members. It is often rearranged in myxoid liposarcomas. HMG2 is known to be fused with various genes including LHFP, RAD51L1, HEI10, and ALDH2 in lipomas, salivary adenomas, uterine leiomyomas, and multiple lipomatosis.34,8 Cancer associated genes on chromosome 12 also include YEATS4 in glioma, CDK4 in melanoma, ALK4 in pancreatic cancer, and KRAS in pancreatic cancers and colorectal cancers.35,6 Chromosome 12 also harbors the human CD4 locus, which encodes the main receptor for human immunodeficiency virus (HIV). Some of the other well-known genes associated with genetic diseases include CMT2G (Charcot-Marie-Tooth disease), PKS (Pallister-Killian syndrome), and the blood glycoprotein encoding gene von Willebrand factor (VWF), which is deficient in the von Willebrand disease. Loci on the q-arm on this chromosome have also been associated with asthma and asthma-related phenotypes.36 There are several other chromosomal disorders that harbor mutations in genes or show the presence of abnormal chromosome segments. These include PallisterKillian mosaic syndrome where there are two p-arms of the chromosome,37 Kabuki syndrome caused by mutations in MLL2, which is characterized by distinctive facial features,38 and Noonan syndrome caused by mutations in the PTPN11 and KRAS. Many other genes in conjunction with genes on other chromosomes have been reported to be associated with genetic and metabolic disorders.4 Proteogenomic Analysis To Identify Novel Protein Coding Regions on Chromosome 12

In the past, we have carried out several proteogenomics studies on various organisms. These studies resulted in identification of several novel protein coding regions and also revision of gene and exon boundaries in these previously annotated organisms.39−43 Using a similar strategy, we identified several proteogenomic events on chromosome 12 (Table 2). We identified 12 novel protein coding regions on chromosome 12. All of the 12 regions have been annotated as pseudogenes in public databases. We identified N-terminal acetylated peptides and confirmed start sites for ∼200 proteins. We also identified alternate start sites for 10 proteins encoded by this chromosome. In addition, we identified 3 novel exons for NCOR2 and NACA1 (Supplementary Table 4). We revised gene boundary for HECTD4 on the N-terminus based on peptides identified by proteogenomic analysis (Figure 5a) and orthologous evidence from other mammals. We identified peptides mapping to an upstream region in the 5′ UTR of ARNTL2 signifying the presence of an Open Reading Frame (ORF) (Figure 5b). Supplementary Figure 3 shows an example of pseudogene LOC255308 for which we found protein coding evidence. The parent gene EIF2S3 is involved in the

Figure 3. Heat map showing relative expression level of proteins designated as “missing proteins” by neXtProt.

differentiated using a mass spectrometer. A representative example of a cSNP is shown in Figure 4c where we identified a variant peptide for LPCAT3 gene along with the annotated spectra. A complete list of all identified peptides with corresponding variations is provided in Supplementary Table 3. These results clearly demonstrate that it is indeed possible to identify cSNPs using high-resolution mass spectrometry. 3172

dx.doi.org/10.1021/pr401123v | J. Proteome Res. 2014, 13, 3166−3177

Journal of Proteome Research

Article

Figure 4. (a) Subcellular localization of all identified proteins. (b) Molecular functions of all identified proteins. (c) A representative example of cSNP containing peptide from Lysophosphatidylcholine acyltransferase 3. An example of peptide sequence detected with a SNP in the protein encoded by LPCAT3. The red arrow represents the location of SNP. MS/MS spectrum serves as supporting evidence.

translated the DNA sequence encompassing this region in three reading frames, and the protein sequence containing the identified peptides were analyzed using the SMART domain prediction tool.44 Domain architecture of proteins provides significant clues to their likely function. SMART predicted functional domains for some of the sequences. For example,

recruitment of methionyl-tRNA to the 40S ribosomal subunit. Supplementary Table 4 provides all of the proteogenomic annotation on chromosome 12 supported by identified peptides. We performed bioinformatics analysis on putative novel protein coding regions to determine putative function. We 3173

dx.doi.org/10.1021/pr401123v | J. Proteome Res. 2014, 13, 3166−3177

Journal of Proteome Research

Article

thought to be nonfunctional due to loss of promoter regions and mutations in the coding regions, several reports over the years have demonstrated that many of these regions are actively transcribed.45,46 Similar to mRNAs, they show ubiquitous as well as tissue-restricted expression pattern. Our studies reveal that many of these transcribed pseudogenes also code for proteins. Focused efforts are needed to determine the biological roles of this potentially novel set of protein coding genes.

Table 2. Summary of Novel Protein Features Identified on Chromosome 12 Using a Proteogenomics Approach category

no. of cases

novel coding region (μORFs and alternate frame) translated pseudogene novel coding exons extension of gene extension of exon novel N-termini

4 12 3 3 4 11



CONCLUSIONS It is more than a decade since the first draft of the human genome was published. Annotation of protein coding genes was carried out based on gene prediction programs and full length mRNA and EST sequences that were being deposited to public databases. Protein sequences encoded by these genes were inferred based on longest ORFs resulting from conceptual translation of mRNA sequences or the ORF predicted by gene prediction programs. These sequences served as a basis for raising antibodies that became one of the most useful tools for direct measurement of many proteins. The advent of mass spectrometry further transformed identification and quantitation of proteins and enabled measurement of thousands of proteins in a single experiment. It has also revolutionized our ability to characterize post-translational modifications. While

putative protein encoded by annotated pseudogene LOC255308 revealed 3 distinct domains: 2 elongation factor Tu GTP binding domains and a eIF2 gamma, C terminal domain (Supplementary Figure 2). Similarly, analyses of VDAC1P5 revealed a Porin 3 domain. Porins that form voltage-dependent anion channels play a role in general diffusion of small molecules across the membrane. These channels are widely expressed and are highly sensitive to change in membrane potential. VDAC1P5 peptides were identified in 5 different tissues. Pseudogenes are nonfunctional copies of protein coding genes that arise due to gene duplication events or due to reverse transcription and reintegration into the genome (processed pseudogenes). Although these were initially

Figure 5. (a) Novel coding exons identified upstream of the HECT domain containing E3 ubiquitin protein ligase 4 (HECTD4). The figure shows the chromosomal location of HECTD4, reference gene model by NCBI RefSeq as well as AceView. Peptides identified upstream of the annotated gene model in NCBI are represented in their respective location. Sequence alignment shows protein level conservation between humans and closely related species. (b) A novel ORF identified in the untranslated region of Aryl hydrocarbon receptor nuclear translocator-like 2 (ARNTL2). The figure shows the chromosomal location of ARNTL2 along with the reference gene model by NCBI RefSeq as well as AceView. Identified peptides that support μORF are also represented. 3174

dx.doi.org/10.1021/pr401123v | J. Proteome Res. 2014, 13, 3166−3177

Journal of Proteome Research

Article

(UGC), India. S.M.P. and R.S.N. are recipients of Senior Research Fellowship from Council for Scientific and Industrial Research (CSIR), India. K.K.D. is recipient of a Junior Research Fellowship from the University Grants Commission (UGC), India. H.G. is a recipient of early career fellowship from Welcome Trust−DBT India alliance.

mass spectrometry itself is unbiased, the methods used for inferring the identity of proteins depend on underlying protein databases that are often taken from public repositories. This method precludes identification of any novel proteins even though it may have been sampled by mass spectrometer. Proteogenomics studies in a number of organisms have clearly demonstrated that it is possible to identify many novel features using alternative strategies. In this study, we employed similar strategies and identified several novel protein regions on chromosome 12. We were able to achieve better coverage of the human proteome by sampling a wide range of tissues. It appears that underrepresentation of several proteins in proteomics data sets is partly due to poor sampling of some of the cell/tissue types. Deep proteomic profiling of more specialized tissues/cell types in the future may further improve human proteome coverage. We could identify novel protein coding regions in the human genome by employing a unique proteogenomics analysis strategy. This should encourage commercially available search algorithms that are widely used in the community to incorporate and facilitate proteogenomics analysis. This will allow researchers generating proteomics data from various tissue types to utilize these tools and likely uncover novel protein coding regions in the human genome. This process can be quite efficient if RNA-Seq data is also available from these tissues. Generating a conceptually translated protein database using RNA-Seq data would significantly reduce potential false positives expected from six-frame translated genome databases. It would also provide access to peptides that span exon−exon junctions that will be systematically missed in translated genome. One of the main goals of human genome project was to identify and annotate all of the protein coding genes in the human genome. Chromosome-centric efforts as part of CHPP are a major step towards fulfilling that goal. Future studies aimed at functional characterization of these proteins using both antibody and mass spectrometry based methods should significantly contribute toward our understanding of protein function, which is fundamental to biomedical sciences.



(1) Paik, Y. K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Marko-Varga, G.; Aebersold, R.; Bairoch, A.; Yamamoto, T.; Legrain, P.; Lee, H. J.; Na, K.; Jeong, S. K.; He, F.; Binz, P. A.; Nishimura, T.; Keown, P.; Baker, M. S.; Yoo, J. S.; Garin, J.; Archakov, A.; Bergeron, J.; Salekdeh, G. H.; Hancock, W. S. Standard guidelines for the chromosomecentric human proteome project. J. Proteome Res. 2012, 11 (4), 2005− 13. (2) Paik, Y. K.; Jeong, S. K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Cho, S. Y.; Lee, H. J.; Na, K.; Choi, E. Y.; Yan, F.; Zhang, F.; Zhang, Y.; Snyder, M.; Cheng, Y.; Chen, R.; Marko-Varga, G.; Deutsch, E. W.; Kim, H.; Kwon, J. Y.; Aebersold, R.; Bairoch, A.; Taylor, A. D.; Kim, K. Y.; Lee, E. Y.; Hochstrasser, D.; Legrain, P.; Hancock, W. S. The Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome. Nat. Biotechnol. 2012, 30 (3), 221−3. (3) Gaudet, P.; Argoud-Puy, G.; Cusin, I.; Duek, P.; Evalet, O.; Gateau, A.; Gleizes, A.; Pereira, M.; Zahn-Zabal, M.; Zwahlen, C.; Bairoch, A.; Lane, L. neXtProt: organizing protein knowledge in the context of human proteome projects. J. Proteome Res. 2013, 12 (1), 293−8. (4) Gilbert, F.; Kauff, N. Disease genes and chromosomes: disease maps of the human genome. Chromosome 12. Genet. Test. 2000, 4 (3), 319−33. (5) Starke, R. D.; Paschalaki, K. E.; Dyer, C. E.; Harrison-Lavoie, K. J.; Cutler, J. A.; McKinnon, T. A.; Millar, C. M.; Cutler, D. F.; Laffan, M. A.; Randi, A. M. Cellular and molecular basis of von Willebrand disease: studies on blood outgrowth endothelial cells. Blood 2013, 121 (14), 2773−84. (6) Kranenburg, O. The KRAS oncogene: past, present, and future. Biochim. Biophys. Acta 2005, 1756 (2), 81−2. (7) Haferlach, C.; Bacher, U.; Schnittger, S.; Alpermann, T.; Zenger, M.; Kern, W.; Haferlach, T. ETV6 rearrangements are recurrent in myeloid malignancies and are frequently associated with other genetic events. Genes, Chromosomes Cancer 2012, 51 (4), 328−37. (8) Scherer, S. E.; Muzny, D. M.; Buhay, C. J.; Chen, R.; Cree, A.; Ding, Y.; Dugan-Rocha, S.; Gill, R.; Gunaratne, P.; Harris, R. A.; Hawes, A. C.; Hernandez, J.; Hodgson, A. V.; Hume, J.; Jackson, A.; Khan, Z. M.; Kovar-Smith, C.; Lewis, L. R.; Lozado, R. J.; Metzker, M. L.; Milosavljevic, A.; Miner, G. R.; Montgomery, K. T.; Morgan, M. B.; Nazareth, L. V.; Scott, G.; Sodergren, E.; Song, X. Z.; Steffen, D.; Lovering, R. C.; Wheeler, D. A.; Worley, K. C.; Yuan, Y.; Zhang, Z.; Adams, C. Q.; Ansari-Lari, M. A.; Ayele, M.; Brown, M. J.; Chen, G.; Chen, Z.; Clerc-Blankenburg, K. P.; Davis, C.; Delgado, O.; Dinh, H. H.; Draper, H.; Gonzalez-Garay, M. L.; Havlak, P.; Jackson, L. R.; Jacob, L. S.; Kelly, S. H.; Li, L.; Li, Z.; Liu, J.; Liu, W.; Lu, J.; Maheshwari, M.; Nguyen, B. V.; Okwuonu, G. O.; Pasternak, S.; Perez, L. M.; Plopper, F. J.; Santibanez, J.; Shen, H.; Tabor, P. E.; Verduzco, D.; Waldron, L.; Wang, Q.; Williams, G. A.; Zhang, J.; Zhou, J.; Allen, C. C.; Amin, A. G.; Anyalebechi, V.; Bailey, M.; Barbaria, J. A.; Bimage, K. E.; Bryant, N. P.; Burch, P. E.; Burkett, C. E.; Burrell, K. L.; Calderon, E.; Cardenas, V.; Carter, K.; Casias, K.; Cavazos, I.; Cavazos, S. R.; Ceasar, H.; Chacko, J.; Chan, S. N.; Chavez, D.; Christopoulos, C.; Chu, J.; Cockrell, R.; Cox, C. D.; Dang, M.; Dathorne, S. R.; David, R.; Davis, C. M.; Davy-Carroll, L.; Deshazo, D. R.; Donlin, J. E.; D’Souza, L.; Eaves, K. A.; Egan, A.; Emery-Cohen, A. J.; Escotto, M.; Flagg, N.; Forbes, L. D.; Gabisi, A. M.; Garza, M.; Hamilton, C.; Henderson, N.; Hernandez, O.; Hines, S.; Hogues, M. E.; Huang, M.; Idlebird, D. G.; Johnson, R.; Jolivet, A.; Jones, S.; Kagan, R.; King, L. M.; Leal, B.; Lebow, H.; Lee, S.; LeVan, J. M.; Lewis, L. C.; London, P.; Lorensuhewa, L. M.; Loulseged, H.; Lovett, D. A.; Lucier, A.;

Data Availability

The mass spectrometry proteomics data have been deposited to the ProteomeXchange consortium (http://proteomecentral. proteomexchange.org) via the PRIDE partner repository47 with the data set identifier PXD000561.



ASSOCIATED CONTENT

S Supporting Information *

Supplementary figures and tables as discussed in text. This material is available free of charge via the Internet at http:// pubs.acs.org.



REFERENCES

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We thank the Department of Biotechnology (DBT) of the Government of India for research support to the Institute of Bioinformatics, Bangalore. S.S.M. is recipient of a Senior Research Fellowship from the University Grants Commission 3175

dx.doi.org/10.1021/pr401123v | J. Proteome Res. 2014, 13, 3166−3177

Journal of Proteome Research

Article

Lucier, R. L.; Ma, J.; Madu, R. C.; Mapua, P.; Martindale, A. D.; Martinez, E.; Massey, E.; Mawhiney, S.; Meador, M. G.; Mendez, S.; Mercado, C.; Mercado, I. C.; Merritt, C. E.; Miner, Z. L.; Minja, E.; Mitchell, T.; Mohabbat, F.; Mohabbat, K.; Montgomery, B.; Moore, N.; Morris, S.; Munidasa, M.; Ngo, R. N.; Nguyen, N. B.; Nickerson, E.; Nwaokelemeh, O. O.; Nwokenkwo, S.; Obregon, M.; Oguh, M.; Oragunye, N.; Oviedo, R. J.; Parish, B. J.; Parker, D. N.; Parrish, J.; Parks, K. L.; Paul, H. A.; Payton, B. A.; Perez, A.; Perrin, W.; Pickens, A.; Primus, E. L.; Pu, L. L.; Puazo, M.; Quiles, M. M.; Quiroz, J. B.; Rabata, D.; Reeves, K.; Ruiz, S. J.; Shao, H.; Sisson, I.; Sonaike, T.; Sorelle, R. P.; Sutton, A. E.; Svatek, A. F.; Svetz, L. A.; Tamerisa, K. S.; Taylor, T. R.; Teague, B.; Thomas, N.; Thorn, R. D.; Trejos, Z. Y.; Trevino, B. K.; Ukegbu, O. N.; Urban, J. B.; Vasquez, L. I.; Vera, V. A.; Villasana, D. M.; Wang, L.; Ward-Moore, S.; Warren, J. T.; Wei, X.; White, F.; Williamson, A. L.; Wleczyk, R.; Wooden, H. S.; Wooden, S. H.; Yen, J.; Yoon, L.; Yoon, V.; Zorrilla, S. E.; Nelson, D.; Kucherlapati, R.; Weinstock, G.; Gibbs, R. A. The finished DNA sequence of human chromosome 12. Nature 2006, 440 (7082), 346− 51. (9) Kim, M. S.; Pinto, S. M.; Getnet, D.; Nirujogi, R. S.; Manda, S. S.; Chaerkady, R.; Madugundu, A. K.; Kelkar, D. S.; Isserlin, R.; Jain, S.; Thomas, J. K.; Muthusamy, B.; Leal-Rojas, P.; Kumar, P.; Sahasrabuddhe, N. A.; Balakrishnan, L.; Advani, J.; George, B.; Renuse, S.; Selvan, L. D. N.; Patil, A. H.; Nanjappa, V.; Radhakrishnan, A.; Prasad, S.; Subbannayya, T.; Raju, R.; Kumar, M.; Sreenivasamurthy, S. K.; Marimuthu, A.; Sathe, G. J.; Chavan, S.; Datta, K. K.; Subbannayya, Y.; Sahu, A.; Yelamanchi, S. D.; Jayaram, S.; Rajagopalan, P.; Sharma, J.; Murthy, K. R.; Syed, N.; Goel, R.; Khan, A. K.; Ahmad, S.; Dey, G.; Mudgal, K.; Chatterjee, A.; Huang, T.; Zhong, J.; Wu, X.; Shaw, P. G.; Freed, D.; Zahari, M. S.; Mukherjee, K. K.; Shankar, S.; Mahadevan, A.; Lam, H.; Mitchell, C. J.; Shankar, S. K.; Satishchandra, P.; Schroeder, J. T.; Sirdeshmukh, R.; Maitra, A.; Leach, S. D.; Drake, C. G.; Halushka, M. K.; Prasad, T. S. K.; Hruban, R. H.; Kerr, C. L.; Bader, G. D.; Iacobuzio-Donahue, C. H.; Gowda, H.; Pandey, A. A draft map of the human proteome. Nature 2014, 509, 575−581. (10) Wisniewski, J. R.; Zougman, A.; Nagaraj, N.; Mann, M. Universal sample preparation method for proteome analysis. Nat. Methods 2009, 6 (5), 359−62. (11) Harsha, H. C.; Molina, H.; Pandey, A. Quantitative proteomics using stable isotope labeling with amino acids in cell culture. Nat. Protoc. 2008, 3 (3), 505−16. (12) Selvan, L. D.; Renuse, S.; Kaviyil, J. E.; Sharma, J.; Pinto, S. M.; Yelamanchi, S. D.; Puttamallesh, V. N.; Ravikumar, R.; Pandey, A.; Prasad, T. S.; Harsha, H. C. Phosphoproteome of Cryptococcus neoformans. J. Proteomics 2014, 97, 287−95. (13) Olsen, J. V.; de Godoy, L. M.; Li, G.; Macek, B.; Mortensen, P.; Pesch, R.; Makarov, A.; Lange, O.; Horning, S.; Mann, M. Parts per million mass accuracy on an Orbitrap mass spectrometer via lock mass injection into a C-trap. Mol. Cell Proteomics 2005, 4 (12), 2010−21. (14) Spivak, M.; Weston, J.; Bottou, L.; Kall, L.; Noble, W. S. Improvements to the percolator algorithm for Peptide identification from shotgun proteomics data sets. J. Proteome Res. 2009, 8 (7), 3737− 45. (15) Bu, D.; Yu, K.; Sun, S.; Xie, C.; Skogerbo, G.; Miao, R.; Xiao, H.; Liao, Q.; Luo, H.; Zhao, G.; Zhao, H.; Liu, Z.; Liu, C.; Chen, R.; Zhao, Y. NONCODE v3.0: integrative annotation of long noncoding RNAs. Nucleic Acids Res. 2012, 40 (Database issue), D210−5. (16) Sherry, S. T.; Ward, M. H.; Kholodov, M.; Baker, J.; Phan, L.; Smigielski, E. M.; Sirotkin, K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001, 29 (1), 308−11. (17) Mishra, G. R.; Suresh, M.; Kumaran, K.; Kannabiran, N.; Suresh, S.; Bala, P.; Shivakumar, K.; Anuradha, N.; Reddy, R.; Raghavan, T. M.; Menon, S.; Hanumanthu, G.; Gupta, M.; Upendran, S.; Gupta, S.; Mahesh, M.; Jacob, B.; Mathew, P.; Chatterjee, P.; Arun, K. S.; Sharma, S.; Chandrika, K. N.; Deshpande, N.; Palvankar, K.; Raghavnath, R.; Krishnakanth, R.; Karathia, H.; Rekha, B.; Nayak, R.; Vishnupriya, G.; Kumar, H. G.; Nagini, M.; Kumar, G. S.; Jose, R.; Deepthi, P.; Mohan, S. S.; Gandhi, T. K.; Harsha, H. C.; Deshpande, K. S.; Sarker, M.;

Prasad, T. S.; Pandey, A. Human protein reference database2006 update. Nucleic Acids Res. 2006, 34 (Database issue), D411−4. (18) Suzuki, J.; Umeda, M.; Sims, P. J.; Nagata, S. Calciumdependent phospholipid scrambling by TMEM16F. Nature 2010, 468 (7325), 834−8. (19) Silva-Santos, B.; Pennington, D. J.; Hayday, A. C. Lymphotoxinmediated regulation of gammadelta cell differentiation by alphabeta T cell progenitors. Science 2005, 307 (5711), 925−8. (20) Meadows, J. W.; Eis, A. L.; Brockman, D. E.; Myatt, L. Expression and localization of prostaglandin E synthase isoforms in human fetal membranes in term and preterm labor. J. Clin. Endocrinol. Metab. 2003, 88 (1), 433−9. (21) Ramsey, I. S.; Moran, M. M.; Chong, J. A.; Clapham, D. E. A voltage-gated proton-selective channel lacking the pore domain. Nature 2006, 440 (7088), 1213−6. (22) Zhang, G.; Xiang, B.; Ye, S.; Chrzanowska-Wodnicka, M.; Morris, A. J.; Gartner, T. K.; Whiteheart, S. W.; White, G. C., 2nd; Smyth, S. S.; Li, Z. Distinct roles for Rap1b protein in platelet secretion and integrin alphaIIbbeta3 outside-in signaling. J. Biol. Chem. 2011, 286 (45), 39466−77. (23) Plutzky, J.; Neel, B. G.; Rosenberg, R. D.; Eddy, R. L.; Byers, M. G.; Jani-Sait, S.; Shows, T. B. Chromosomal localization of an SH2containing tyrosine phosphatase (PTPN6). Genomics 1992, 13 (3), 869−72. (24) Bragulla, H. H.; Homberger, D. G. Structure and functions of keratin proteins in simple, stratified, keratinized and cornified epithelia. J. Anat. 2009, 214 (4), 516−59. (25) Lee, H. K.; Barbarosie, M.; Kameyama, K.; Bear, M. F.; Huganir, R. L. Regulation of distinct AMPA receptor phosphorylation sites during bidirectional synaptic plasticity. Nature 2000, 405 (6789), 955−9. (26) Matsunami, H.; Montmayeur, J. P.; Buck, L. B. A family of candidate taste receptors in human and mouse. Nature 2000, 404 (6778), 601−4. (27) Bunger, M. K.; Cargile, B. J.; Sevinsky, J. R.; Deyanova, E.; Yates, N. A.; Hendrickson, R. C.; Stephenson, J. L., Jr. Detection and validation of non-synonymous coding SNPs from orthogonal analysis of shotgun proteomics data. J. Proteome Res. 2007, 6 (6), 2331−40. (28) Schandorff, S.; Olsen, J. V.; Bunkenborg, J.; Blagoev, B.; Zhang, Y.; Andersen, J. S.; Mann, M. A mass spectrometry-friendly database for cSNP identification. Nat. Methods 2007, 4 (6), 465−6. (29) Halvey, P. J.; Wang, X.; Wang, J.; Bhat, A. A.; Dhawan, P.; Li, M.; Zhang, B.; Liebler, D. C.; Slebos, R. J. Proteogenomic analysis reveals unanticipated adaptations of colorectal tumor cells to deficiencies in DNA mismatch repair. Cancer Res. 2014, 74, 387−97. (30) Sheynkman, G. M.; Shortreed, M. R.; Frey, B. L.; Scalf, M.; Smith, L. M. Large-scale mass spectrometric detection of variant peptides resulting from nonsynonymous nucleotide differences. J. Proteome Res. 2014, 13 (1), 228−40. (31) Leeman-Neill, R. J.; Kelly, L. M.; Liu, P.; Brenner, A. V.; Little, M. P.; Bogdanova, T. I.; Evdokimova, V. N.; Hatch, M.; Zurnadzy, L. Y.; Nikiforova, M. N.; Yue, N. J.; Zhang, M.; Mabuchi, K.; Tronko, M. D.; Nikiforov, Y. E. ETV6-NTRK3 is a common chromosomal rearrangement in radiation-associated thyroid cancer. Cancer 2014, 120, 799−807. (32) Tognon, C.; Knezevich, S. R.; Huntsman, D.; Roskelley, C. D.; Melnyk, N.; Mathers, J. A.; Becker, L.; Carneiro, F.; MacPherson, N.; Horsman, D.; Poremba, C.; Sorensen, P. H. Expression of the ETV6NTRK3 gene fusion as a primary event in human secretory breast carcinoma. Cancer Cell 2002, 2 (5), 367−76. (33) Knezevich, S. R.; Garnett, M. J.; Pysher, T. J.; Beckwith, J. B.; Grundy, P. E.; Sorensen, P. H. ETV6-NTRK3 gene fusions and trisomy 11 establish a histogenetic link between mesoblastic nephroma and congenital fibrosarcoma. Cancer Res. 1998, 58 (22), 5046−8. (34) Kubo, T.; Matsui, Y.; Naka, N.; Araki, N.; Goto, T.; Yukata, K.; Endo, K.; Yasui, N.; Myoui, A.; Kawabata, H.; Yoshikawa, H.; Ueda, T. Expression of HMGA2-LPP and LPP-HMGA2 fusion genes in lipoma: identification of a novel type of LPP-HMGA2 transcript in four cases. Anticancer Res. 2009, 29 (6), 2357−60. 3176

dx.doi.org/10.1021/pr401123v | J. Proteome Res. 2014, 13, 3166−3177

Journal of Proteome Research

Article

(35) Su, G. H.; Bansal, R.; Murphy, K. M.; Montgomery, E.; Yeo, C. J.; Hruban, R. H.; Kern, S. E. ACVR1B (ALK4, activin receptor type 1B) gene mutations in pancreatic carcinoma. Proc. Natl. Acad. Sci. U.S.A. 2001, 98 (6), 3254−7. (36) Raby, B. A.; Silverman, E. K.; Lazarus, R.; Lange, C.; Kwiatkowski, D. J.; Weiss, S. T. Chromosome 12q harbors multiple genetic loci related to asthma and asthma-related phenotypes. Hum. Mol. Genet. 2003, 12 (16), 1973−9. (37) Dufke, A.; Walczak, C.; Liehr, T.; Starke, H.; Trifonov, V.; Rubtsov, N.; Schoning, M.; Enders, H.; Eggermann, T. Partial tetrasomy 12pter-12p12.3 in a girl with Pallister-Killian syndrome: extraordinary finding of an analphoid, inverted duplicated marker. Eur. J. Hum. Genet. 2001, 9 (8), 572−6. (38) Miyake, N.; Koshimizu, E.; Okamoto, N.; Mizuno, S.; Ogata, T.; Nagai, T.; Kosho, T.; Ohashi, H.; Kato, M.; Sasaki, G.; Mabe, H.; Watanabe, Y.; Yoshino, M.; Matsuishi, T.; Takanashi, J.; Shotelersuk, V.; Tekin, M.; Ochi, N.; Kubota, M.; Ito, N.; Ihara, K.; Hara, T.; Tonoki, H.; Ohta, T.; Saito, K.; Matsuo, M.; Urano, M.; Enokizono, T.; Sato, A.; Tanaka, H.; Ogawa, A.; Fujita, T.; Hiraki, Y.; Kitanaka, S.; Matsubara, Y.; Makita, T.; Taguri, M.; Nakashima, M.; Tsurusaki, Y.; Saitsu, H.; Yoshiura, K.; Matsumoto, N.; Niikawa, N. MLL2 and KDM6A mutations in patients with Kabuki syndrome. Am. J. Med. Genet. A 2013, 161 (9), 2234−43. (39) Volkening, J. D.; Bailey, D. J.; Rose, C. M.; Grimsrud, P. A.; Howes-Podoll, M.; Venkateshwaran, M.; Westphall, M. S.; Ane, J. M.; Coon, J. J.; Sussman, M. R. A proteogenomic survey of the Medicago truncatula genome. Mol. Cell Proteomics 2012, 11 (10), 933−44. (40) Pawar, H.; Sahasrabuddhe, N. A.; Renuse, S.; Keerthikumar, S.; Sharma, J.; Kumar, G. S.; Venugopal, A.; Sekhar, N. R.; Kelkar, D. S.; Nemade, H.; Khobragade, S. N.; Muthusamy, B.; Kandasamy, K.; Harsha, H. C.; Chaerkady, R.; Patole, M. S.; Pandey, A. A proteogenomic approach to map the proteome of an unsequenced pathogen - Leishmania donovani. Proteomics 2012, 12 (6), 832−44. (41) Prasad, T. S.; Harsha, H. C.; Keerthikumar, S.; Sekhar, N. R.; Selvan, L. D.; Kumar, P.; Pinto, S. M.; Muthusamy, B.; Subbannayya, Y.; Renuse, S.; Chaerkady, R.; Mathur, P. P.; Ravikumar, R.; Pandey, A. Proteogenomic analysis of Candida glabrata using high resolution mass spectrometry. J. Proteome Res. 2012, 11 (1), 247−60. (42) Kelkar, D. S.; Kumar, D.; Kumar, P.; Balakrishnan, L.; Muthusamy, B.; Yadav, A. K.; Shrivastava, P.; Marimuthu, A.; Anand, S.; Sundaram, H.; Kingsbury, R.; Harsha, H. C.; Nair, B.; Prasad, T. S.; Chauhan, D. S.; Katoch, K.; Katoch, V. M.; Chaerkady, R.; Ramachandran, S.; Dash, D.; Pandey, A. Proteogenomic analysis of Mycobacterium tuberculosis by high resolution mass spectrometry. Mol. Cell Proteomics 2011, 10 (12), M111 011627. (43) Chaerkady, R.; Kelkar, D. S.; Muthusamy, B.; Kandasamy, K.; Dwivedi, S. B.; Sahasrabuddhe, N. A.; Kim, M. S.; Renuse, S.; Pinto, S. M.; Sharma, R.; Pawar, H.; Sekhar, N. R.; Mohanty, A. K.; Getnet, D.; Yang, Y.; Zhong, J.; Dash, A. P.; MacCallum, R. M.; Delanghe, B.; Mlambo, G.; Kumar, A.; Keshava Prasad, T. S.; Okulate, M.; Kumar, N.; Pandey, A. A proteogenomic analysis of Anopheles gambiae using high-resolution Fourier transform mass spectrometry. Genome Res. 2011, 21 (11), 1872−81. (44) Schultz, J.; Milpetz, F.; Bork, P.; Ponting, C. P. SMART, a simple modular architecture research tool: identification of signaling domains. Proc. Natl. Acad. Sci. U.S.A. 1998, 95 (11), 5857−64. (45) Pei, B.; Sisu, C.; Frankish, A.; Howald, C.; Habegger, L.; Mu, X. J.; Harte, R.; Balasubramanian, S.; Tanzer, A.; Diekhans, M.; Reymond, A.; Hubbard, T. J.; Harrow, J.; Gerstein, M. B. The GENCODE pseudogene resource. Genome Biol. 2012, 13 (9), R51. (46) Kalyana-Sundaram, S.; Kumar-Sinha, C.; Shankar, S.; Robinson, D. R.; Wu, Y. M.; Cao, X.; Asangani, I. A.; Kothari, V.; Prensner, J. R.; Lonigro, R. J.; Iyer, M. K.; Barrette, T.; Shanmugam, A.; Dhanasekaran, S. M.; Palanisamy, N.; Chinnaiyan, A. M. Expressed pseudogenes in the transcriptional landscape of human cancers. Cell 2012, 149 (7), 1622−34. (47) Vizcaino, J. A.; Cote, R. G.; Csordas, A.; Dianes, J. A.; Fabregat, A.; Foster, J. M.; Griss, J.; Alpi, E.; Birim, M.; Contell, J.; O’Kelly, G.; Schoenegger, A.; Ovelleiro, D.; Perez-Riverol, Y.; Reisinger, F.; Rios,

D.; Wang, R.; Hermjakob, H. The PRoteomics IDEntifications (PRIDE) database and associated tools: status in 2013. Nucleic Acids Res. 2013, 41 (Database issue), D1063−9.

3177

dx.doi.org/10.1021/pr401123v | J. Proteome Res. 2014, 13, 3166−3177