Surfing Transcriptomic Landscapes. A Step ... - ACS Publications

Oct 21, 2013 - Precise annotation of Chromosome 16 proteins according to C-HPP criteria is presented. Moreover,. Human Body Map 2.0 RNA-Seq and ...
0 downloads 0 Views 1MB Size
Article pubs.acs.org/jpr

Surfing Transcriptomic Landscapes. A Step beyond the Annotation of Chromosome 16 Proteome Víctor Segura,† Juan Alberto Medina-Aunon,‡ Maria I. Mora,† Salvador Martínez-Bartolomé,‡ Joaquín Abian,§ Kerman Aloria,∥ Oreto Antúnez,■ Jesús M. Arizmendi,∥ Mikel Azkargorta,⊥ Silvia Barceló-Batllori,▼,⬠ Jabier Beaskoetxea,∥ Joan J. Bech-Serra,▽ Francisco Blanco,# Mariana B. Monteiro,† David Cáceres,◆ Francesc Canals,▽ Monserrat Carrascal,§ José Ignacio Casal,○ Felipe Clemente,◆ Nuria Colomé,▽ Noelia Dasilva,□ Paula Díaz,□ Félix Elortza,⊥ Patricia Fernández-Puente,# Manuel Fuentes,□ Oscar Gallardo,§ Severine I. Gharbi,‡ Concha Gil,◆ Carmen González-Tejedo,‡ María Luisa Hernáez,◆ Manuel Lombardía,‡ Maria Lopez-Lucendo,○ Miguel Marcilla,‡ José M. Mato,⊥ Marta Mendes,○ Eliandre Oliveira,▲ Irene Orera,▼ Alberto Pascual-Montano,‡ Gorka Prieto,∥ Cristina Ruiz-Romero,# Manuel M. Sánchez del Pino,■ Daniel Tabas-Madrid,‡ Maria L. Valero,■ Vital Vialas,◆ Joan Villanueva,§ Juan Pablo Albar,◇,‡ and Fernando J. Corrales*,†,◇ †

ProteoRed-ISCIII, Center for Applied Medical Research (CIMA), University of Navarra, Pío XII, 55; Ed. CIMA, 31008 Pamplona, Spain



ProteoRed-ISCIII, Centro Nacional de Biotecnología - CSIC, UAM Campus Cantoblanco, Darwin, 3, 28049 Madrid, Spain

§

ProteoRed-ISCIII, CSIC/UAB Proteomics Laboratory, Instituto de Investigaciones Biomédicas de Barcelona-CSIC/IDIBAPS, 08193 Bellaterra, Spain



ProteoRed-ISCIII, Department of Biochemistry and Molecular Biology, University of the Basque Country, UPV/EHU, 48940 Leioa, Spain



ProteoRed-ISCIII, Proteomics Platform, CIC bioGUNE, CIBERehd, ProteoRed, Bizkaia Technology Park, 48160 Derio, Spain

#

ProteoRed-ISCIII, Osteoarticular and Aging Research Lab, Proteomics Unit, ProteoRed/ISCIII, Rheumatology Division, INIBIC−CHU A Coruña, As Xubias 84, 15006 A Coruña, Spain



ProteoRed-ISCIII, Proteomics Laboratory, Vall d’Hebron Institute of Oncology, Vall d’Hebron University Hospital, UAB, Passeig Vall d'Hebron 119, Edifici Maternoinfantil Planta 14, 08035 Barcelona, Spain



ProteoRed-ISCIII, Functional Proteomics, Department of Cellular and Molecular Medicine, Centro de Investigaciones Biológicas (CIB-CSIC), Ramiro de Maeztu 9, 28040 Madrid, Spain



ProteoRed-ISCIII, Departamento de Microbiología II, Facultad de Farmacia, Universidad Complutense de Madrid, Plaza de Ramón y Cajal, 28040 Madrid, Spain



ProteoRed-ISCIII, Centro de Investigación del Cáncer/IBMCC (USAL/CSIC), Departamento de Medicina and Servicio General de Citometría, University of Salamanca, IBSAL, Campus Miguel de Unamuno, 37007 Salamanca, Spain



ProteoRed-ISCIII, Plataforma de Proteomica, Parc Cientifıc de Barcelona, Universitat de Barcelona, Baldiri Reixac, 10, 08028 Barcelona, Spain



ProteoRed-ISCIII, Unidad de Proteómica, Instituto Aragonés de Ciencias de la Salud, Avenida San Juan Bosco, no. 13, 50009 Zaragoza, Spain



ProteoRed-ISCIII, Biochemistry Department, University of Valencia, C/ Doctor Moliner, 50, 46100 Valencia, Spain S Supporting Information *

Special Issue: Chromosome-centric Human Proteome Project Received: July 12, 2013 Published: October 21, 2013 © 2013 American Chemical Society

158

dx.doi.org/10.1021/pr400721r | J. Proteome Res. 2014, 13, 158−172

Journal of Proteome Research

Article

ABSTRACT: The Spanish team of the Human Proteome Project (SpHPP) marked the annotation of Chr16 and data analysis as one of its priorities. Precise annotation of Chromosome 16 proteins according to C-HPP criteria is presented. Moreover, Human Body Map 2.0 RNA-Seq and Encyclopedia of DNA Elements (ENCODE) data sets were used to obtain further information relative to cell/tissue specific chromosome 16 coding gene expression patterns and to infer the presence of missing proteins. Twenty-four shotgun 2D-LC−MS/MS and gel/LC−MS/MS MIAPE compliant experiments, representing 41% coverage of chromosome 16 proteins, were performed. Furthermore, mapping of large-scale multicenter mass spectrometry data sets from CCD18, MCF7, Jurkat, and Ramos cell lines into RNA-Seq data allowed further insights relative to correlation of chromosome 16 transcripts and proteins. Detection and quantification of chromosome 16 proteins in biological matrices by SRM procedures are also primary goals of the SpHPP. Two strategies were undertaken: one focused on known proteins, taking advantage of MS data already available, and the second, aimed at the detection of the missing proteins, is based on the expression of recombinant proteins to gather MS information and optimize SRM methods that will be used in real biological samples. SRM methods for 49 known proteins and for recombinant forms of 24 missing proteins are reported in this study. KEYWORDS: Human Proteome Project, Chromosome 16, proteomics, transcriptomics, RNA-Seq. ENCODE, bioinformatics



INTRODUCTION

Chromosome 16 (Chr16) was embraced by the Spanish HPP (SpHPP) consortium that belongs to the Spanish Proteomics Institute, ProteoRed-ISCIII within the Spanish Institutes of Health (Instituto de Salud Carlos III, ISCIII). The general strategy and goals were established during the kick-off meeting held in Madrid, Spain, on April 2, 2012. Adopting the general rules established for HPP,6 the Spanish initiative was constructed on a multidisciplinary basis with 15 scientific groups organized into five working sections, namely, Protein/Antibody Microarrays, Protein Expression and Peptide Standard, S/MRM, Protein Sequencing and Characterization, Bioinformatics and Clinical Healthcare, and Biobanking. The C-HPP initiative is based on the ProteoRed-ISCIII platform, a proteomics consortium integrating 21 proteomics laboratories with more than 7 years of experience in the coordination of multicenter activities,8 sharing state of art technology, data standardization,9,10 bioinformatics,11 and research.12−16 Chromosome 16 spans about 89 million base pairs, representing almost 3% of the total DNA in human cells. According to information on Ensembl (V68), we reported that more than 2300 genes have been identified on Chr16 including 870 protein-coding genes. In light of MS data on GPMDB, the figure of missing proteins was initially estimated at 305 as we decided arbitrarily to include all proteins with log(e) values above −15 assuming that their observation might have some constraints in complex matrices. Detailed molecular description of CCD18, MCF7, and Jurkat cell

The Human Proteome Organization (HUPO) has coordinated the efforts of the international community promoting several initiatives1−4 to describe the human proteome in a systematic manner during the last 12 years (http://www.hupo.org). In September 2010, during the annual HUPO conference in Sydney, Australia, the Human Proteome Project (HPP) was officially presented.5 The HPP is designed to map the entire human proteome in a systematic effort using currently available and emerging techniques. With the aim of providing a comprehensive map of human proteins in their biological context, the HPP rests on three pillars: shotgun and targeted mass spectrometry (MS), polyclonal and monoclonal antibodies (Ab), and an integrated knowledge base. The project is organized according to a chromosome-centric strategy (C-HPP) where scientific groups from different nationalities agree to characterize the proteome of a selected chromosome following the guidelines of the international consortium and an open-access policy.6,7 All 24 chromosomes plus the mitochondrial genome-encoded proteome have already been adopted by as many teams from 21 different countries. Knowledge and technical resources generated within the C-HPP initiative are expected to contribute to progress in the understanding and treatment of diseases by the integration and coordination of specific research initiatives in the Biology and Disease (B/D) − HPP initiative.6 159

dx.doi.org/10.1021/pr400721r | J. Proteome Res. 2014, 13, 158−172

Journal of Proteome Research

Article

Figure 1. Representation of the workflow for the RNA-Seq data analysis of the Human Body Map samples and the integration of proteomic and transcriptomic data.

accurate proteomics data,21 such as those being produced within the HPP. Proteogenomics has recently emerged as a field at the junction of genomics and proteomics, although pioneer studies have underlined the interest of such an integrative approach.22 The main goal is the matching of peptides identified in MS-based experiments against genome-wide gene/transcript sequence data sets for detailed gene annotation,23 a method that has been successfully used to circumvent the limited availability of reference protein databases of nonmodel species.24 The integration of large amounts of RNA-Seq and MS data poses a challenging problem, starting from the generation of efficient and nonredundant RNA-Seq databases to search MS spectra.25 However, these difficulties have been successfully circumvented to allow integration of high-throughput human proteome quantification with DNA variation and transcriptome information to reveal the multiple and diverse regulatory mechanisms of gene expression.26 Different critical points may benefit from crossing large-scale proteomics and transcriptomic/genomic data sets as proteogenomically identified peptides will provide unique information for gene annotation, such as confirmation of translation, determining frame, location of translation start and end sites, identification of exact splicing boundaries, and prediction of novel genes.27 Moreover, tissue/cell specific gene expression patterns will provide valuable information in the search of missing proteins that have not yet been identified in proteomic studies and in the identification of protein variants resulting from alternative splicing and amino acid polymorphisms that in combination with posttranslational

lines was reported by means of transcriptomic and shotgun proteomic analyses and a preliminary version of a bioinformatics utility for data analysis and uploading into PRIDE and ProteomeXchange.17 Combination of gene expression and proteomic information was initiated to suggest preferential cell lines where Chr16 proteins might be detected and is currently an active working avenue to enhance the biological annotation of Chr16 genes. In fact, this line has been further developed by proposing a methodology to generate a transcriptomic map of chromosome 16 coding genes using RNA-Seq data from the ENCODE project,18 providing an effective mechanism for identifying cell lines with expression evidence of missing proteins. Understanding of human biology in health and disease is dependent on integration of all molecular building blocks that are being unravelled in genome-wide experiments, according to a systems biology strategy. Modern deep-sequencing techniques capture the transcriptome of any organism in unprecedented detail, and there have been substantial breakthroughs in the de novo assembly of transcriptomes. This huge sequencing capacity promoted several initiatives to achieve a complete annotation of the human genome and to further understand genetic contribution to disease, including ENCODE,18 the related annotation initiative GENCODE,19 the 1000 Genomes Project,20 and the Illumina Human Body Map. The data provided by these initiatives are a detailed inventory of the parts list of the human genome, whose structural and functional annotation will be greatly promoted by complementation with large-scale and 160

dx.doi.org/10.1021/pr400721r | J. Proteome Res. 2014, 13, 158−172

Journal of Proteome Research

Article

(1) sequencing quality was checked using FastQC (http://www. bioinformatics.babraham.ac.uk/projects/fastqc/); (2) Raw sequencing files were preprocessed for removing adaptors and fractions of reads with poor quality with FASTX-Toolkit (http:// hannonlab.cshl.edu/fastx_toolkit/), and in the case of pairedend sequencings, a perl script was used to maintain pairing between files after quality filtering; (3) alignment of all reads was carried out with Tophat v2.0.928 and human genome hg19 version as reference; (4) generation of an annotation file of intergenic regions was developed with BEDTools (https://code. google.com/p/bedtools/) by calculating complementary regions not covered by genes. As described in ref 32, we only considered intergenic regions that were at least 10 kb away from annotated genes; and (5) gene quantification was carried out with htseqcount script (http://www-huber.embl.de/users/anders/ HTSeq/doc/count.html) for counting fragments, and a custom script was developed to calculate FPKM from these counts using GENCODE V15 as reference. A comparison of mapped reads between exon and intergenic regions was then used to estimate a FPKM threshold value that indicates, with high confidence, a level of gene expression above background level. Expression of all genes and intergenic regions (background) was binned across different RNA-Seq experiments from the same cell line in ENCODE, providing in this way an effective mean to handle replicates. Cumulative amounts of genes and background levels were calculated (Supplementary Figure 1A in the Supporting Information). The false discovery rate (FDR) for each expression level was calculated. The true number of expressed genes was estimated using FDR as a correction factor, and the false negative rate (FNR) was calculated for each expression level according to ref 31. Intersection of the FDR and FNR provides a FPKM threshold that can be used to classify genes as expressed or not (Supplementary Figure 1B in the Supporting Information). This method was applied for all cell lines used by the Spanish Consortium with RNA-Seq data available in ENCODE (MCF-7, K-562, and HepG2). Five levels of expression were defined according to ref 31: no expression (FPKM < threshold), weakly expressed (threshold ≤ FPKM < 3), moderately expressed (3 ≤ FPKM < 30), highly expressed (30 ≤ FPKM < 100), and very highly expressed (FPKM ≥ 100).

modifications will approach the complexity of protein isoforms in human cells.7 In the present study, detailed annotation of chromosome 16 protein coding genes using Human Body Map (HBM) RNA-Seq data is described. In-depth characterization of CCD18, MCF7, Jurkat, and Ramos cell lines by multicenter shotgun experiments is presented, and the use of PAnalyzer for protein grouping is discussed. Integration of shotgun proteomics data and HBM RNA-Seq data is also presented. Finally, SRM methods for detection and quantification of 56 chromosome 16 proteins are reported, including 15 missing proteins.



MATERIALS AND METHODS

Bioinformatic Analysis of RNA-Seq data from Human Body Map

Public data sets from the Illumina Human Body Map Project (HBM) were used. HBM integrates transcription profiles from high-throughput sequencing experiments of individual and mixtures of 16 human tissue RNAs (adipose, adrenal, brain, breast, colon, heart, kidney, liver, lung, lymph node, ovary, prostate, skeletal muscle, testes, thyroid, and white blood cells). The accession numbers were GSE30611 in GEO database, E-MTAB-513 in ArrayExpress, and ERX011226 in SRA. All of the selected samples (Supplementary Table 1 in the Supporting Information) were processed using the same pipeline (Figure 1): (1) the downloaded sra files were converted into fastq files and the quality of the samples was verified using FastQC (http://www.bioinformatics.babraham.ac.uk/projects/ fastqc/); (2) the preprocessing of reads included elimination of contaminant adapter substrings with Scythe (https://github. com/vsbuffalo/scythe) and quality-based trimming using Sickle (https://github.com/najoshi/sickle); (3) the alignment of reads to the human genome (hg19) was performed using Tophat2 mapper;28 (4) the transcript assembly and quantification using FPKM (fragments per kilobase of transcript per Million fragments mapped) of genes and transcripts was carried out with Cufflinks2;29 and (5) the annotation of the gene locus obtained was performed using Cuffmerge with Ensembl v70 as reference. Further analysis and graphical representations were performed using the R/Bioconductor packages Biostrings (manipulation of biological sequences and files), doBy (data processing), and ggplot2 (graphical representation).30

Protein Sample Preparation

Bioinformatics Analysis of RNA-Seq data from ENCODE Project

Cells were grown in the three laboratories from the SpHPP consortium, following standard growth conditions. At exponential growth, cells were collected and lysed in a CHAPS/urea lysis buffer (7 M urea, 2 M thiourea, 4% CHAPS, protease and phosphatase inhibitors) for the Jurkat T cell line, an alternative lysis buffer was adopted (8.4 M urea, 2.4 M thiourea, 5% CHAPS, 2 mM TCEP-HCl, supplemented with protease and phosphatase inhibitors). Typically, 100 μg of each cell line was used and separated by 1D-gel electrophoresis (1D-SDS-PAGE−LC− MS/MS workflow) or digested in-solution (2D-LC−MS/MS workflow). 1D-SDS-PAGE−LC−MS/MS workflow: Each participating laboratory adopted the same procedure for gel-based protein separation. In brief, total cell extracts from each cell line were loaded on a 12% SDS-PAGE gel. Proteins were visualized by Colloidal Coomassie Blue staining (CCB), and whole gel lanes were cut into 15 bands. In-gel protein digestion was carried out following the protocol described by Shevchenko et al., with minor modifications.33 Gel plugs were washed with 50 mM ammonium bicarbonate in 50% ACN prior to reduction with

NGS technology has revolutionized the world of DNA sequencing due to its lower cost and a genome wide coverage, but despite this the technology still has an observable error rate that results in background noise that affects accurate read counts. In RNA-Seq assays this results in weakly expressed genes in which it is not known if they are really expressed or the expression is the result of sequencing or aligning errors. To cope with this and determine a more quantitative approach for gene expression quantification, we followed the method of Ramsköld et al.31 to define an appropriate threshold to classify a gene as expressed. We selected fastq files from RNA-Seq experiments of long PolyA+ fraction of whole cells. Raw reads were downloaded from the ENCODE ftp site using the information listed in their RNA dashboard index file. The ENCODE consortium provided a guideline to easily access all of their available data (http:// genome.crg.es/encode_RNA_dashboard/HOWTO_batchdownload.html). Samples were processed following these steps: 161

dx.doi.org/10.1021/pr400721r | J. Proteome Res. 2014, 13, 158−172

Journal of Proteome Research

Article

ProteomeXchange repository (http://www.proteomexchange. org/) following the ProteomeXchange submission guidelines. Globally, 270 files were sent to the ProteomeXchange repository through the recently incorporated ASPERA protocol. The files were grouped according to the analyzed cell lines generating four data sets, whose ProteomeXchange accession numbers are PXD000442, PXD000443, PXD000447, and PXD000449 for MCF7, Jurkat, Ramos, and CCD18 cell lines.

10 mM DTT in 25 mM ammonium bicarbonate solution and alkylation with 55 mM IAA in 25 mM ammonium bicarbonate. Gel pieces were then washed in 50 mM ammonium bicarbonate and then in 50% ACN, and dried in speed vac. Proteomics-grade Trypsin was used at 1:50 w/w ratio (enzyme/protein) in 25 mM ammonium bicarbonate, and digestion was carried out at 37 °C for 8 to 16 h. The reaction was stopped by the addition of 0.5% TFA for peptide extraction, and extracted peptides were dried by speed-vac centrifugation and resuspended in 2% of acetonitrile in 99.9% water/0.1% formic acid. 2D-LC−MS/MS workflow: For in-solution digestion, cell lysates were precipitated with methanol/chloroform, as described elsewhere,34 and precipitated proteins were resuspended in denaturing and reducing buffer (8 M urea, 25 mM ammonium bicarbonate, 10 mM DTT) for 1 h at 37 °C and cysteins were alkylated with 50 mM iodoacetamide for 45 min in the dark. Samples were diluted with 25 mM ammonium bicarbonate to a final concentration of 2 M urea, and proteomicsgrade trypsin (Sigma Aldrich) was added at a 1:50 ratio (enzyme/protein) for 18 h at 37 °C. Samples were dried in a speed-vac and kept until off-line peptide fractionation. Off-line chromatography was carried out as described elsewhere.17 In brief, tryptic peptides were fractionated offline on a 2.1 × 100 mm C18, 5 μm XBridge column (BEH Technology, Waters), connected to a Smartline HPLC system (KNAUER). Equilibration conditions were 10 mM NH4OH, pH 9.4 (solvent A), and peptides were eluted in a 5−70% gradient of solvent B (10 mM NH4OH, pH 9.4, 80% methanol) in 55 min at a flow rate of 150 μL/min. In general, 30 fractions were collected and pooled to mimic orthogonal separation into 10 fractions that were dried in a speed-vac and kept until LC−MS/MS. Blanks were run between samples to avoid carry over.

Recombinant Protein Production

Used clones were selected from the pANT7_cGST clone collection distributed by Plasmid repository at Arizona State University Biodesign Institute. Full-length cDNA clones contain a T7 transcriptional start sequence as well as an internal ribosome entry site (IRES), which is compatible with in vitro transcriptional-translational reagents. In addition, each clone contains an in-frame fused C-terminal GST tag. Each bacterial clone was grown overnight in 5 mL of Luria broth with 100 μg/mL ampicillin. Plasmid DNA was extracted using the Mini-Prep Kit (Promega, Madison, WI) following the manufacturer’s instructions. All plasmids were sequenced using an M13 and T7 primers to confirm the identity of the insert and to ensure that there was not contamination of the plasmids stocks. Proteins were synthesized from plasmid DNA using the human in vitro protein expression kit (IVTT) (ThermoLifeScience, WI) following the manufacturer’s protocol with a few minimal modifications to adapt for 1.5 mL eppendorf tubes. Around 1 μg of plasmid DNA was incubated in 25 μL of coupled transcription/translation reaction mix at 30 °C for 90 min. To enrich the GST-fusion proteins, we washed 2 mL of glutathionesepharose 4B (GE Uppsala, Sweden) three times with DPBS buffer and resuspended it in 12.5 mL of 1× DPBS. An aliquot of 125 μL of this bead slurry was added to each tube containing the completed translation reaction; then, bead−protein mixture was rocked end-over-end for 16 h at 4 °C. Bounded protein was then washed three times with DPBS and two times with 50 mM ammonium bicarbonate.

Liquid Chromatography and Mass Spectrometry Analysis

For both workflows, the second dimension was performed using a nano liquid chromatography system coupled online to a mass spectrometer. Each laboratory used distinct chromatographic conditions, all of which are summarized in Supplementary Table 2 in the Supporting Information. Settings for the four MS/MS platforms used (Orbitrap, Q Exactive, MaXis Impact, and 5600 triple TOF) are shown in the corresponding MIAPE documents that are included as Supporting Information.

SRM Analysis

On the basis of both the MS/MS spectra observed in the LC−MS analysis of the samples, predictions derived from the sequences, and data available from databases, several peptide precursor and fragment ion masses were selected per each protein and assayed for SRM analysis. Analyses were performed on AB Sciex 4000 and 5500 QTRAP instruments. After precolumn desalting, tryptic digests (1 to 2 μg) were separated on C18 nanocolumns (75 μm id, 15 cm, 3 μm particle size) (LC Packings, Netherlands; Eksigent, USA; Thermo Scientific, USA) at a flow rate of 300 nL/min, with a 90 min linear gradient from 5 to 40% ACN in 0.1% formic acid. The mass spectrometer was interfaced with nanospray sources equipped with uncoated fused silica emitter tips (20 μm inner diameter, 10 μm tip, NewObjective, Woburn, MA) and was operated in the positive ion mode. MS Source parameters were as follows: capillary voltage 2800 V, source temperature 150 °C, declustering potential (DP) 135 V, curtain and ion source gas (nitrogen) 20 psi, and collision gas (nitrogen) medium. The dwell time for each transition was 20 ms. Collision energies for each peptide were automatically computed using the embedded rolling collision energy equations of the MRM Pilot software. Proteotypic peptides were selected from the information available in Peptide Atlas and GPMDB. To confirm the identity of the peptides, an MRM-initiated detection and sequencing (MIDAS) experiment was performed for each peptide. The mass spectrometer was instructed to switch

Data Analysis

Raw MS and MS/MS data were translated into mascot general file (mgf) format and searched against the UniProtKB/SwissProt human database (release 2013_06, June 13) that contains 36 852 proteins and their corresponding reversed sequences using an inhouse Mascot Server v. 2.4 (Matrix Science, London, U.K.). Search parameters were set as follows: carbamidomethyl cysteine as fixed modification and oxidized methionines and acetylation of the peptide amino termini as a variable modifications. Peptide mass tolerance was set to 50 ppm, in both MS and MS/MS modes, and two missed cleavages were allowed. Typically, an accuracy of ±10 ppm was found for both MS and MS/MS spectra. False discovery rates (FDR ≤ 1% at the protein level) for protein identification were manually calculated.35 Data reporting was conducted as cited in our previous work.17 MS mgf files and their corresponding mzidentML results were submitted to the ProteoRed MIAPE web repository11 to create both MIAPE MS and MSI reports through the ProteoRed MIAPE web toolkit.9 Afterward, MIAPE MS and MSI reports were translated to PRIDE XML and, together with the raw MS file for each sample fraction, were submitted to the 162

dx.doi.org/10.1021/pr400721r | J. Proteome Res. 2014, 13, 158−172

Journal of Proteome Research

Article

corresponding to 819 protein coding genes (94.57% of all the protein coding genes annotated in this chromosome). The number of protein-coding transcripts found using ENCODEv15 was 3095, corresponding to 808 protein coding genes (93.31% of the total number of protein coding genes). The tissue-specific abundance of transcripts is a useful guide on which to base the decision of which are the biological samples where a given protein might most probably be found. This is especially relevant for searching for missing proteins. Even though a quantitative correlation between transcripts and proteins levels should not always be necessarily expected, a binary presence/absence transcript analysis across tissues might provide a first outline of the tissue specific protein expression map. To extend our previous analysis based on microarray transcriptomic data,17 the expression of Chr16 protein coding genes was studied across the selected 16 tissue specific RNA-Seq data of the HBM. The quantification of transcripts is reported using FPKM as a measure of abundance. The resulting bimodal distribution for each of the 16 tissues (Supplementary Figure 2 in the Supporting Information) is in good agreement with data previously reported.21,36 The expression of Chr16 protein coding genes is widely distributed across the tissues analyzed (Figure 2B). Interestingly, the expression levels of those genes coding for missing proteins (187 genes) are statistically (p < 0.05) lower than the expression of genes coding for known proteins (699). Some exceptions are, however, observed, in particular, colon, lung, lymph nodes, skeletal muscle, and white blood cells, where both Chr16-associated missing and known protein coding genes are expressed at similar levels. These data further support our previous analysis17 and provide novel tissues to performed new shotgun experiments in the search for Chr16 missing proteins.

from MRM to enhanced product ion (EPI) scanning mode when an individual MRM signal exceeded 500 counts. Each precursor was fragmented a maximum of two times before being excluded for 30 s. Data were analyzed by submitting the MS/MS data to the ProteinPilot software (ABSciex, version 4.0). All searches were performed against the HPP chromosome 16 database. SRM data were analyzed using the quantitation module of the Analyst 1.5.1 software (ABSciex) with the IntelliQuant algorithm for peak integration.



RESULTS AND DISCUSSION

Annotation of Chromosome 16

Detailed annotation of chromosomes is a dynamic aspect of pivotal relevance in the elaboration of the strategies focusing on the definition of the human proteome parts list. Chr16 contains 886 protein coding genes (Ensembl v70) including 669 known and 187 missing proteins (Supplementary Table 3 in the Supporting Information). The number of missing proteins is calculated as the difference between the total protein coding genes in Chr16 and the number of protein coding genes with experimental evidence at the level of protein in at least one of the following databases, according to the criteria adopted within the C-HPP: (1) neXtProt (release Nov 2012), gold proteins mapping 642 protein coding genes in Ensembl; (2) Human Peptide Atlas (release Dec 2012), canonical proteins mapping 423 protein coding genes in Ensembl; and (3) GPMdb (release 26 Nov 2012), green proteins mapping 630 protein coding genes in Ensembl (Figure 2A). This Figure closely correlates with the current figure provided by neXtProt (139 missing proteins) that discounts hypothetical protein coding genes. Moreover, Human Protein Atlas references (Supplementary Table 4 in the Supporting Information) are available for 109 missing proteins. (No filtering criterion was considered, all positive antibodies in any tissue or cell line were included.) To provide further insight into Chr16 annotation, we analyzed 16 tissue specific RNA-Seq data sets from the Illumina HBM initiative. The number of transcripts detected in the analysis of the HBM ranged from 39 970 isoforms recovered in liver sample to 160 710 transcript assemblies obtained in lymph-node samples. The mean number of transcripts for the 16 tissues was 88 578, including known and novel transcripts. The comparison of these transcript structures with the annotation of the genome available in the Ensemblv70 database allowed assignment of 82 942 nonredundant transcripts to the HBM. The minimum number of known isoforms was detected in liver samples (20 429 transcripts), while the maximum number of isoforms was found in lymph-node samples (36 613 transcripts). The mean number of known isoforms in Ensemblv70 in the tissues sequenced in the HBM was 29 101 transcripts. Similar calculations were performed for novel isoforms. A minimum of 3218 transcripts in the liver samples and a maximum of 18 913 for lymph nodes samples were found, with a mean number of redundant novel transcripts for the 16 tissues of 10 518 transcripts. Using ENCODEv15 instead of Ensemblv70 for comparison purposes, we mapped 83 704 known transcripts in the HBM data set ranging from 23 123 transcripts in liver samples to 38 979 transcripts in testes sample. The mean number of known isoforms for the 16 tissues was 31 592 transcripts. The novel transcripts ranged from 1823 in liver samples to 14 412 in lymph nodes, with a mean value of 7737 redundant novel isoforms. The coverage of known isoforms detected in the HBM samples for Chr16 was 3230 protein coding transcripts

Integration of Chromosome 16 Proteomic and Transcriptomic Profiles

To characterize in detail the proteome of Chr16, we selected four cell lines, MCF7 breast cancer human epithelial cells, CCD18 human colon fibroblasts, Jurkat human T lymphocytes, and Ramos B lymphocytes. Shotgun proteomics were conducted in parallel by six independent laboratories resulting in 24 2D-LC− MS/MS and gel/LC−MS/MS experiments following the standardization principles established previously.17 Assuming a FDR below 1% at the protein level, 8766 protein groups (PGs) were identified, 5072, 4508, 3276, and 6744 in MCF7, CCD18, Ramos, and Jurkat cells (Supplementary Table 5 in the Supporting Information). In terms of accuracy, 78.3% of the identified protein groups were identified with two or more peptides and 69.4% with three peptides or more. Regarding Chromosome 16, 385 PGs were identified, corresponding to 360 genes, 83.4% identified with two or more peptides and 73.2% with three or more. Interestingly, two missing proteins, ZNF200 (P98182) and CCDC154 (A6NI56), were detected in these experiments. One of the most important variations regarding the previous data analysis pipeline is the inclusion of the PAnalyzer algorithm.37 This new feature reinforced the reliability of the results because it considered the peptide overlap between proteins and proposed a rearrangement of identified proteins into evidence categories according to their common peptides. The four differentiable categories according to Prieto’s work37 are: conclusive, nonconclusive, ambiguous group, and indistinguishable proteins. (In this study, proteins within the indistinguishable category were excluded to avoid overinterpretations.) The PAnalyzer algorithm was inserted in the 163

dx.doi.org/10.1021/pr400721r | J. Proteome Res. 2014, 13, 158−172

Journal of Proteome Research

Article

Figure 2. Annotation of chromosome 16 using HUPO guidelines. (A) Difference between the protein coding genes in Ensemblv70 and the proteins with experimental evidence in GPMdb, neXtProt, and Human Peptide Atlas result in the 187 missing proteins group. (B) Distribution of FPKM in the 16 tissue samples of HBM comparing known and missing protein genes. The p values show the statistical significance of the decrease in the expression of genes coding for Chr16 missing proteins compared with expression levels of genes coding for Chr16 known proteins.

99.7% overlapping in terms of protein coverage. Furthermore, the reported entries were PGs with a limited number containing more than one proteoform such as different isoforms. In terms of accuracy, 76% of the identified PGs were identified with two or more peptides and 64.7% with three or more, indicating a substantial increase in terms of protein identification confidence.

data analysis pipeline as a customized library built by PAnalyzer’s developers, and it is called once the data are stored in the database and prior to the final report within the ProteoRed MIAPE Extractor tool (http://www.proteored.org/miape) data workflow (Supplementary Figure 3 in the Supporting Information). To test the suitability of the new pipeline, we analyzed data from previous experiments,17 resulting in a 164

dx.doi.org/10.1021/pr400721r | J. Proteome Res. 2014, 13, 158−172

Journal of Proteome Research

Article

Figure 3. Venn diagram representing shotgun proteomic results. Protein coding genes coding for proteins detected in the shotgun experiments on CCD18, MCF7, Jurkat, and Ramos cell lines in (A) and those from Chr16 (B). Globally, 360 Chr16 proteins were identified and only 27% were found to be common to the four cell types.

restricted to other tissues or to different biological/pathological conditions. We then mapped the set of Chr16 proteins identified in the shotgun experiment (Figure 4B,D) with 354 successful annotations, finding that 6 protein coding genes lacked transcriptional evidence in the RNA-Seq experiment (BOLA2B, HBA1, NOMO3, HBQ1, SEPT1 and SULT1A4), although all are known protein-coding genes. Only 19 genes (5.27%) are very low or low expressed, while 275 genes (76.4%) are highly expressed, pointing out the bias of proteomic experiments to detect products of highly expressed genes (Supplementary Figure 4 in the Supporting Information).

We next wanted to link the identified proteins with the transcriptomic landscape provided by the HBM data. Uniprot accession numbers from the MS/MS queries were mapped onto Ensemblv70 gene codes that include HBM RNA-Seq annotation. The biomart system provided by Ensembl was initially used, but roughly only one-half of the protein coding genes of Chr16 were mapped on Uniprot_Swissprot and 80% on Uniprot_KB. Therefore, to increase mapping efficiency, the application PICR, which uses the protein sequence as the base for performing the mappings in addition to the accession code of the entries,38 was used. The shotgun experiments identified a total of 7979 protein coding genes, 4434 genes detected in the CCD18 cell line, 6612 genes in Jurkat, 4976 genes in MCF7, and 3246 genes in Ramos (Figure 3A). For Chr16, 360 protein coding genes (40.63% of the chromosome) were found, with 177, 311, 245, and 141 genes detected in CCD18, Jurkat, MCF7, and Ramos, respectively (Figure 3B). The number of protein coding genes detected in all cell lines was 100, and the number of proteins detected in only one of the four cell lines analyzed was 112, suggesting that a significant number of Chr16 proteins must be searched in specific tissues/cell lines. To further address this issue, we then analyzed data using the 16 tissues HBM RNA-Seq data set. To simplify the expression analysis, we first divided the detected transcripts concerning the proteins identified in our shotgun analysis into different categories based on the quartiles of the overall FPKM distribution:36 low-expression genes (>5.78 FPKM); genes with medium expression (>20.43 FPKM); and genes with high expression (>128.53 FPKM). The application of this thresholding method established that 1189 transcripts were highly expressed and 6659 genes were very low or low-expressed in all tissues, while the rest of the detected genes had tissuespecific expression levels. This result highlights the need to make a careful selection of the cell-line or tissue sample on which to concentrate the experimental efforts for the detection of certain proteins. For Chr16 protein coding genes, 33 (3.72%) were low or very low expressed, and only 7 genes (0.79%) were highly expressed in all samples (Figure 4A and Supplementary Table 3 in the Supporting Information), while the expression indices for the remaining genes were tissue-dependent. Regardless of expression level, 425 genes (47.97%) were detected in all tissues, while 67 genes (13.88%) were not detected in any of the tissues of the HBM (Figure 4C and Supplementary Table 3 in the Supporting Information). This last group of undetected transcripts includes Chr16 genes coding for both known (34) and missing (43) proteins, suggesting that its expression is

Transcriptomic Map for Chromosome 16 using RNA-Seq Data from ENCODE

The previous bioinformatics approach can be complemented with a different way to extract valuable information from the ENCODE, which provides a vast amount of data on experiments on different human cell lines, including RNA-Seq assays. A first logic approximation to integrating data from ENCODE in the HPP is the identification of those cell lines with a high level of expression values for protein-coding genes, especially those classified as “missing”, where no strong proteomic expression evidence is available. This transcriptomic map can allow the selection of the best cell lines to conduct proteomic studies for this project. The Spanish consortium has previously conducted a similar approach with microarray data.17 This is the first initial step on this effort whose final goal would be to have a transcriptomics dashboard where each cell line can be explored to check for expression evidence of each cell line and supplement this information with protein expression data produced in the C-HPP. We started the analysis by checking the availability of RNASeq data in all cell lines used by the Spanish Consortium. Data were available for MCF-7, K-562, and HepG2. Table 1 shows the number of expressed genes in the whole genome and in Chr16 using this quantification approach. In line with previous results, the expression level of most of the missing proteins is generally very low, with some exceptions. Interestingly, the number of missing protein coding genes that are detected as expressed using this approach varies from cell line to cell line, as was expected. This provides valuable information for selection of the proper experimental setting in the context of HPP. Overall, 112 transcripts coding for missing proteins were identified, 96 in MCF7, 81 in HepG2, and 74 in K-562, and 59 were found to be common to the three cell lines. Despite the high degree of overlapping, there are some unique expressed genes that encode 165

dx.doi.org/10.1021/pr400721r | J. Proteome Res. 2014, 13, 158−172

Journal of Proteome Research

Article

Figure 4. Integration of proteomic and transcriptomic landscape of Chr16. Expression abundance of all Chr16 protein coding genes in the 16 HBM tissues analyzed in this study (A) and for Chr16 protein coding genes whose encoded proteins were detected in our shotgun experiments (B). The distributions of detection as a function of the tissues in which the genes are quantified are also shown (C,D).

Table 1. Gene Expression Results Using the Quantification Method Described in This Papera

MCF7 (threshold = 0.030)

K-562 (threshold = 0.035)

HepG2 (threshold = 0.038)

expression level

no expression

weak expression

moderate expression

high expression

extremely high expression

total expressed

total

# genes WG # genes Chr16 # PC genes Chr1B # PC missing # genes WG # genes Chr16 # PC genes Chr16 # PC missing # genes WG # genes Chr16 # PC genes Chr16 # PC missing

34967 1320 195 91 36418 1420 270 113 38241 1423 229 106

12716 624 314 86 12206 606 317 70 11433 609 360 75

7718 359 337 9 6902 289 273 4 6142 287 271 6

984 37 34 1 892 30 24 0 637 27 25 0

295 8 6 0 262 3 2 0 227 2 1 0

21713 1028 691 96 20262 928 616 74 18439 925 657 81

56680 2348 886 187 56680 2348 886 187 56680 2348 886 187

For each of the three cell lines, we report the number of genes detected at each expression level (columns). The first two rows for each cell line correspond to the number of genes in the whole genome (WG) and in Chromosome 16, respectively. The third row corresponds to the number of protein coding (PC) genes in chromosome 16, while the fourth column reflects the expression status of the genes encoding the missing proteins of chromosome 16. The expression threshold for each cell line is shown in parentheses. a

missing proteins differentially among the cell lines. Supplementary Table 6 in the Supporting Information contains a functional analysis of the missing proteins common to the three cell lines as well as those that are unique to each one. Little overlapping between the groups is observed, reflecting the need to have a complete picture of the expressed genes in all cell lines.

Similar information is shown in the heatmap of Supplementary Figure 5 in the Supporting Information, where the nonoverlapped functions are more evident. Of the three cell lines previously studied by microarray technology (MCF7, CCD18, and Jurkat), only MCF-7 was available in ENCODE. There is a 78% agreement between 166

dx.doi.org/10.1021/pr400721r | J. Proteome Res. 2014, 13, 158−172

Journal of Proteome Research

Article

Table 2. Proteins Detected by Targeted SRM Proteomic Analyses protein accession

gene_Id

log E

no. laboratories

no. peptides

Ramos

A5YKK6 B5ME19 O14983 O43809 O60884 O75150 P00505 P00738 P04075 P15170 P15559 P15880 P22695 P31146 P35637 P49411 P49588 P63279 P69849 P80404 Q08AM6 Q12789 Q12931 Q13509 Q14019 Q14694 Q14807 Q15393 Q16775 Q49A26 Q53FZ2 Q68EM7 Q6P2E9 Q86W42 Q8TBB5 Q92793 Q93009 Q96DA0 Q96QK1 Q9NUI1 Q9NUU7 Q9UMR2 Q9UQ35

CNOT1 EIF3CL ATP2A1 NUDT21 DNAJA2 RNF40 GOT2 HP ALDOA GSPT1 NQO1 RPS2 UQCRC2 CORO1A FUS TUFM AARS UBE2I NOMO3 ABAT VAC14 GTF3C1 TRAP1 TUBB3 COTL1 USP10 KIF22 SF3B3 HAGH GLYR1 ACSM3 ARHGAP17 EDC4 THOC6 KLHDC4 CREBBP USP7 ZG16B VPS35 DECR2 DDX19A DDX19B SRRM2

−1186.1 −339.5 −392.6 −206.8 −173.1 −187.3 −564.2 −1400.2 −1076.2 −284.5 −384.3 −357.3 −361.5 −598.1 −365 −411.3 −543 −210.9 −369.4 −230.7 −187.1 −1039.8 −392.6 −901.2 −236.4 −177.7 −387.7 −702 −196.9 −271.3 −198.8 −1107.8 −435.1 −179.2 −285.4 −408.5 −312.2 −265 −335.9 −185 −203.5 −219.6 −1962.2

2 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 2 3 2 3 2 3

1 4 1 4 1 2 3 4 3 3 3 4 4 5 4 9 3 1 2 1 4 1 1 7 2 5 1 4 2 2 1 1 3 4 1 1 5 1 3 2 7 1 3

× × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × ×

× ×

Mcf7

Ccd18

Tc28

Huh7

Jurkat

Huaecs

× × × × × × × × × × × × × × × × × ×

× × × × × ×

× × × ×

× ×

×

× ×

× × × × ×

× × × × × × × × ×

× × ×

× ×

×

× × × × ×

× × × ×

× × ×

× ×

×

× ×

× ×

× ×

×

× × × × ×

×

×

×

× ×

× × ×

× ×

×

×

×

×

×

× ×

× × × ×

× ×

× ×

× × ×

× × × × × × × ×

× × × × ×

×

× ×

× × × ×

×

×

×

×

×

×

sharing and cross-validation across different laboratories, and this process must be optimized under standard conditions. With the aim of establishing workflows and SOPs within the consortium, a set of Chr16 proteins was selected and distributed to six different laboratories for SRM assay development. An initial set of 120 Chr-16 proteins was selected on the basis of their GPMDB scores, log e in the range −175 to −6000, belonging to the group defined as “known” Chr16 proteins. These were sorted by their log e, ranked, and distributed sequentially to six different laboratories so that a group of 20 proteins spanning a similar range of log e values was assigned to each laboratory. Each laboratory explored the detectability by SRM of the assigned proteins in digests from at least three cell lines used in the Chr16 SpHPP consortium: MCF7, CCD18, and Ramos, which are expected to provide a large coverage of expressed Chr16 proteins according to our previous characterization.17 Additional cell lines

Microarrays and RNA-Seq analysis for this sample, which is considerably high. When we compared our results with protein expression, as described in ref 17, RNA-Seq assay matched in 42% of cases, which is higher than the 35% of agreement with microarray data. This higher level of agreement is expected due to the quality and coverage of the RNA sequencing data. SRM/MRM Data

The development of quantitative methods for Chr16 proteins is one of our main focuses, with special attention to the missing proteins group and proteoforms that may be uncovered by the combined analysis of large-scale transcriptomic and proteomics data sets. Moreover, as has been previously mentioned, transcriptomic profiles provide valuable information relative to tissues and cell lines, where the missing proteins can be most likely detected by targeted approaches. To deliver reliable SRM methods, we proposed a stepwise strategy that involves data 167

dx.doi.org/10.1021/pr400721r | J. Proteome Res. 2014, 13, 158−172

Journal of Proteome Research

Article

Table 3. SRM Results for 24 Missing Proteins Assayeda

a

protein accession

gene_id peptides

selected peptides

detected peptides

detected transitions

identified peptides (MS/MS)

O60359 Q6UXU4 Q8WV35 Q96A59 Q9NXF8 Q6PL45 Q8TEW6 Q8TB05 Q9NWW0 Q9HBE5 O75324 Q96B96 Q9BTX3 Q96H86 Q8TAZ6 Q8IZF4 Q6PII5 Q8TDN1 Q8N635 Q8WTQ4 Q8IUW3 Q8WVE7 P17023 A8K8V0

CACNG3 GSG1L LRRC29 MARVELD3 ZDHC7 C16ORF79 DOK4 FAM100A HCFC1R1 IL21R SNN TMEM159 TMEM208 ZNF764 CKLFSF2 GPR114 HAGHL KCNG4 C16ORF73 C16ORF78 SPATA2L TMEM170A ZNF19 ZNF785

12 7 9 6 7 7 11 1 7 12 3 3 3 12 6 11 10 11 14 10 9 6 15 10

9 5 9 6 2 5 9 1 4 8 2 3 1 6 2 3 7 3 4 8 9 1 3 7

43 33 53 34 10 22 44 8 17 36 28 28 5 22 12 18 43 12 26 40 64 6 11 58

8 7 9 6 2 0 8 2 2 5 2 2 0 7 1 1 6 2 1 2 6 0 0 7

Peptide and transition data foi each protein are shown in Supplementary Table S1 in the Supporting Information.

validated to the distribution of the 120 proteins initially selected, and no major differences were observed. Therefore, there is no bias in the validated proteins toward high log e values. Overall, we have been able to develop reliable SRM assays to monitor 49 Chr16 proteins (40.8% of those initially selected, representing 5% of all Chr16 proteins). The corresponding peptide and transition information is given in Supplementary Table 7 in the Supporting Information. All relevant experimental conditions and peptide and transitions data from all laboratories have been gathered in a MySQL relational database constructed to keep track of and query all Chr16 SRM information and from which the curated data will be transferred to public repositories over the course of the project. So far we have attained a 41% fulfilment ratio with our current experimental strategy. Although this could be considered as a reasonable ratio taking into account that total cell lysate or plasma digests were used for the analysis, it is clear that other experimental options should be explored to complete the SRM goal. We envisage that in addition to the selection of other tissue/ cell lines according to the available transcriptomic data fractionation of samples will be an alternative strategy to pursue in the future development of the Chr16 project, as this should increase the detectability of those proteins underrepresented in total tissue/cell extracts. Moreover, the use of our accurate MS/MS data set might significantly increase the efficiency of the targeted experiments. Accordingly, one of our next goals is to build a library including all MIAPE compliant MS/MS spectra generated within the consortium. To examine the suitability of this approach, we performed a simple test using Skyline software (MacCoss Lab, Department of Genome Sciences, University of Washington). All MS/MS spectra reported from one of the shotgun experiments (CNB_Jurkat_R1_HPLCRP) were used to create the library, and the SRM-detected proteins were added to map all peptides onto the available spectra. Matching was

or plasma samples were also assayed in some laboratories (Table 2). Initial selection of proteotypic peptides and transitions to monitor for each protein was made from database available data, results, or in silico prediction from the sequence. For 106 out of the 120 proteins selected (88.3%), a minimum of three coeluting SRM transition signals were observed for a number of peptides ranging from 1 to 10 per protein, with an average of 4.16 peptides/protein. While it is probable that many of these signals correspond to the detection of the targeted peptides, only those peptides for which SRM-signal-triggered MS2 spectra clearly confirmed the sequence were considered as bona fide SRM assayable peptides. These amounted to 172 peptides from 51 proteins (42.5% of those initially selected). It is worth mentioning that as an additional validation 47 peptides, belonging to 12 of the assayed proteins, were obtained by chemical synthesis and analyzed using the developed SRM assays. In all cases, the results confirmed the transitions selected, and the observed retention times closely matched those of the endogenous peptides. The use of synthetic peptides to corroborate the SRM methods developed will be restricted to those cases where results are not conclusive. Synthetic labeled peptides will be used for selected proteins of interest to be monitored in the associated B/D Chr16 projects. After the initial round of SRM method development at the six laboratories, a second round of cross-validation was performed, assigning each of the final 51 proteins detected to two laboratories different from the laboratory that initially developed the SRM method. After this second round of analysis validation, a total of 149 peptides from 49 proteins (Table 2) met the criteria of having been successfully detected, with at least three transitions per peptide by at least two different laboratories. As indicated in Table 2, in most cases proteins were also detected in more than one different sample type. We compared the distribution of loge values of the 49 proteins that were 168

dx.doi.org/10.1021/pr400721r | J. Proteome Res. 2014, 13, 158−172

Journal of Proteome Research

Article

verified for all peptides pertaining to 81.3% of the SRM proteins further supporting the interest of using this approach to design targeted SRM methods. Finally, for those proteins that cannot be detected directly upon fractionation as well as for many proteins in the missing group we will rely, as discussed, on data from recombinant protein to design suitable SRM assays. It is worth mentioning that we have decided to use this strategy for proteins with log e values above −15, as it was considered that below this threshold their detection would be challenging.17 Up to 28 Chr16 proteins from this group have been expressed on a cell-free system and characterized. Data were accepted when the proteins were detected with at least three precursor ions with three transitions each. According to this criterion, assays for 24 proteins were selected (Table 3 and Supplementary Table 8 in the Supporting Information), most of them verified by MIDAS MS/MS data (exceptions were TMEM208, TMEM170A, ZNF19, and c16orf79). Interestingly, 11 of these 24 proteins are among the 139 missing proteins reported in neXtProt for Chr16. In general, the agreement between the theoretical and observed retention times of peptides in the reversed-phase chromatography as well as the good correlation between the SRM transitions and the intensity of the corresponding fragment ions in the MIDAS experiment reinforce the reliability of the results. These methods are currently being used for the detection of these proteins in whole digest of the four cell lines adopted in the consortium.

out with the HBM, using RNA-Seq information as a search space for MS data. Targeted methods for detection of 49 proteins of Chr16 have been already developed. All of these methods have been validated in at least two independent laboratories and in three distinct cell lines. For Chr16 missing proteins, optimized SRM methods for 24 recombinant Chr16 proteins are described in this report and tests on complex biological matrices are currently ongoing. The next phase has already started, involving five independent research units that will focus on delivering the quantitative methods for the above-mentioned missing proteins and on targeting 100 additional Chr16 proteins, based on the MS/MS data from our shotgun experiments. Bioinformatics for the management of proteomics data is also an active area within the Chr16 consortium. The MIAPE extractor pipeline has been improved, and we are working together with the ProteomeXchange initiative to interface both applications and to allow automatic MIAPE compliant data deposition. Moreover, the beta version of a database for storing and consulting SRM is already in place and will be open access in the next few months. In summary, Chr16 annotation is being explored in depth following a proteogenomic approach. The resulting information, in combination with PTMs analysis and the development of targeted methods for detection and quantitation of protein isoforms, will surely provide valuable biological information that will inspire more ambitious biomedical initiatives for the benefit of diseased individuals.





PERSPECTIVES AND CONCLUSIONS The understanding of human biology relies on the precise definition of all of the molecular building blocks and their interactions to configure cellular, tissue, and organ functional maps. Detailed protein/gene chromosome annotation arises then as a pivotal issue that must be addressed by innovative approaches such as those focusing on the integration of data from large-scale genomic, transcriptomic, and proteomic experiments. We have developed all bioinformatics workflows to combine data from UniprotKB, Ensembl, HBM, and our own shotgun proteomics data set from four different cell lines to obtain insight into the Chr16 proteome. The number of protein-coding genes is 886, including 187 that encode proteins without experimental MS evidence, according to the adopted C-HPP criteria. Tissuespecific expression patterns for Chr16 are estimated, providing valuable information to select tissues/cell lines to detect and quantitate the missing proteins. Interestingly, expression of 67 genes coding for Chr16 proteins were not detected in the collection of 16 tissue-specific RNA-Seq data sets analyzed, not specifically pertaining to the group of missing proteins (roughly 65%). Databases are currently being implemented to search MS/MS spectral data and to explore our capacity to assign novel transcripts and isoforms taking advantage of the large data collection resulting from the multicenter shotgun proteomic analysis performed on four independent cell lines that allowed mapping of 41.4% of Chr16 proteins. Our bioinformatics analysis of RNA-Seq experiments from ENCODE projects has shown the potential to identify those cell lines where strong evidence of missing protein-coding genes expression is detected. This approach would allow us to create a global dashboard for transcriptomic evidence of protein coding genes in all cell lines used in the ENCODE project. We are planning to extend this approach to all chromosomes in C-HPP, providing the community with a valuable exploratory tool. Furthermore, the RNA-Seq analysis can also be used to extend the analysis carried

ASSOCIATED CONTENT

* Supporting Information S

Binned expression of all genes and background regions across all samples for each cell line and false discovery rate (green) and false negative rates (blue) at each expression level represented for each cell line. Distribution of abundance (FPKM) for the different tissues of HBM. Data management flow-chart. Expression level distribution of protein coding genes coding for the proteins detected in CCD18, Jurkat, MCF7 and Ramos cell lines according to the RNASeq data from the group of 16 tissues analyzed in the HBM, from low expression to high expression. Heatmap that shows the GO Biological Processes terms with those missing proteins that are unique to each MCF-7, K-562 and HepG2. Samples of Human Body Map 2.0 considered in this study. NanoLC conditions used in the SG proteomics analysis. ANNOTATION CHR16 ENSEMBLv70. Validated SRM methods. SRM methods details for 24 recombinant missing proteins from chr 16. This material is available free of charge via the Internet at http://pubs.acs.org.



AUTHOR INFORMATION

Corresponding Author

*Tel: +34948194700. Fax: +34948194717. E-mail: fjcorrales@ unav.es. Present Address ⬠

Proteomics Unit, Cancer Epigenetics and Biology Program (PEBC), Bellvitge Biomedical Research Institute (IDIBELL), Barcelona, Spain.

Author Contributions ◇

169

J.P.A. and F.J.C. share senior authorship. dx.doi.org/10.1021/pr400721r | J. Proteome Res. 2014, 13, 158−172

Journal of Proteome Research

Article

Notes

targets for the diagnosis of colorectal cancer by using phage microarrays. Mol. Cell. Proteomics 2011, 10, M110.001784. (13) Pitarch, A.; Nombela, C.; Gil, C. Prediction of the clinical outcome in invasive candidiasis patients based on molecular fingerprints of five anti-Candida antibodies in serum. Mol. Cell. Proteomics 2011, 10, M110.004010. (14) Calamia, V.; Fernández-Puente, P.; Mateos, J.; Lourido, L.; Rocha, B.; Montell, E.; Vergés , J.; Ruiz-Romero, C.; Blanco, F. J. Pharmacoproteomic study of three different chondroitin sulfate compounds on intracellular and extracellular human chondrocyte proteomes. Mol. Cell. Proteomics 2012, 11, M111.013417. (15) de la Cuesta, F.; Alvarez-Llamas, G.; Maroto, A. S.; Donado, A.; Zubiri, I.; Posada, M.; Padial, L. R.; Pinto, A. G.; Barderas, M. G.; Vivanco, F. A proteomic focus on the alterations occurring at the human atherosclerotic coronary intima. Mol. Cell. Proteomics 2011, 10, M110.003517. (16) Sánchez-Quiles, V.; Mora, M. I.; Segura, V.; Greco, A.; Epstein, A. L.; Foschini, M. G.; Dayon, L.; Sánchez, J.-C.; Prieto, J.; Corrales, F. J.; Santamaria, E. HSV-1 Cgal+ infection promotes quaking RNA binding protein production and induces nuclear-cytoplasmic shuttling of quaking I-5 isoform in human hepatoma cells. Mol. Cell. Proteomics 2011, 10, M111.009126. (17) Segura, V.; Medina-Aunon, J. A.; Guruceaga, E.; Gharbi, S. I.; González-Tejedo, C.; Sanchez del Pino, M. M.; Canals, F.; Fuentes, M.; Casal, J. I.; Martínez-Bartolomé, S.; Elortza, F.; Mato, J. M.; Arizmendi, J. M.; Abian, J.; Oliveira, E.; Gil, C.; Vivanco, F.; Blanco, F.; Albar, J. P.; Corrales, F. J. Spanish human proteome project: dissection of chromosome 16. J. Proteome Res. 2013, 12, 112−122. (18) ENCODE Project Consortium; Dunham, I.; Kundaje, A.; Aldred, S. F.; Collins, P. J.; Davis, C. A.; Doyle, F.; Epstein, C. B.; Frietze, S.; Harrow, J.; Kaul, R.; Khatun, J.; Lajoie, B. R.; Landt, S. G.; Lee, B.-K.; Pauli, F.; Rosenbloom, K. R.; Sabo, P.; Safi, A.; Sanyal, A.; Shoresh, N.; Simon, J. M.; Song, L.; Trinklein, N. D.; Altshuler, R. C.; Birney, E.; Brown, J. B.; Cheng, C.; Djebali, S.; Dong, X.; Dunham, I.; Ernst, J.; Furey, T. S.; Gerstein, M.; Giardine, B.; Greven, M.; Hardison, R. C.; Harris, R. S.; Herrero, J.; Hoffman, M. M.; Iyer, S.; Kelllis, M.; Khatun, J.; Kheradpour, P.; Kundaje, A.; Lassmann, T.; Li, Q.; Lin, X.; Marinov, G. K.; Merkel, A.; Mortazavi, A.; Parker, S. C. J.; Reddy, T. E.; Rozowsky, J.; Schlesinger, F.; Thurman, R. E.; Wang, J.; Ward, L. D.; Whitfield, T. W.; Wilder, S. P.; Wu, W.; Xi, H. S.; Yip, K. Y.; Zhuang, J.; Bernstein, B. E.; Birney, E.; Dunham, I.; Green, E. D.; Gunter, C.; Snyder, M.; Pazin, M. J.; Lowdon, R. F.; Dillon, L. A. L.; Adams, L. B.; Kelly, C. J.; Zhang, J.; Wexler, J. R.; Green, E. D.; Good, P. J.; Feingold, E. A.; Bernstein, B. E.; Birney, E.; Crawford, G. E.; Dekker, J.; Elinitski, L.; Farnham, P. J.; Gerstein, M.; Giddings, M. C.; Gingeras, T. R.; Green, E. D.; Guigó, R.; Hardison, R. C.; Hubbard, T. J.; Kellis, M.; Kent, W. J.; Lieb, J. D.; Margulies, E. H.; Myers, R. M.; Snyder, M.; Starnatoyannopoulos, J. A.; Tennebaum, S. A.; Weng, Z.; White, K. P.; Wold, B.; Khatun, J.; Yu, Y.; Wrobel, J.; Risk, B. A.; Gunawardena, H. P.; Kuiper, H. C.; Maier, C. W.; Xie, L.; Chen, X.; Giddings, M. C.; Bernstein, B. E.; Epstein, C. B.; Shoresh, N.; Ernst, J.; Kheradpour, P.; Mikkelsen, T. S.; Gillespie, S.; Goren, A.; Ram, O.; Zhang, X.; Wang, L.; Issner, R.; Coyne, M. J.; Durham, T.; Ku, M.; Truong, T.; Ward, L. D.; Altshuler, R. C.; Eaton, M. L.; Kellis, M.; Djebali, S.; Davis, C. A.; Merkel, A.; Dobin, A.; Lassmann, T.; Mortazavi, A.; Tanzer, A.; Lagarde, J.; Lin, W.; Schlesinger, F.; Xue, C.; Marinov, G. K.; Khatun, J.; Williams, B. A.; Zaleski, C.; Rozowsky, J.; Röder, M.; Kokocinski, F.; Abdelhamid, R. F.; Alioto, T.; Antoshechkin, I.; Baer, M. T.; Batut, P.; Bell, I.; Bell, K.; Chakrabortty, S.; Chen, X.; Chrast, J.; Curado, J.; Derrien, T.; Drenkow, J.; Dumais, E.; Dumais, J.; Duttagupta, R.; Fastuca, M.; Fejes-Toth, K.; Ferreira, P.; Foissac, S.; Fullwood, M. J.; Gao, H.; Gonzalez, D.; Gordon, A.; Gunawardena, H. P.; Howald, C.; Jha, S.; Johnson, R.; Kapranov, P.; King, B.; Kingswood, C.; Li, G.; Luo, O. J.; Park, E.; Preall, J. B.; Presaud, K.; Ribeca, P.; Risk, B. A.; Robyr, D.; Ruan, X.; Sammeth, M.; Sandu, K. S.; Schaeffer, L.; See, L.-H.; Shahab, A.; Skancke, J.; Suzuki, A. M.; Takahashi, H.; Tilgner, H.; Trout, D.; Walters, N.; Wang, H.; Wrobel, J.; Yu, Y.; Hayashizaki, Y.; Harrow, J.; Gerstein, M.; Hubbard, T. J.; Reymond, A.; Antonarakis, S. E.; Hannon, G. J.; Giddings, M. C.; Ruan, Y.; Wold, B.; Carninci, P.; Guigó, R.; Gingeras, T. R.; Rosenbloom, K. R.; Sloan, C. A.; Learned, K.;

The authors declare no competing financial interests.



ACKNOWLEDGMENTS All participating laboratories are members of ProteoRed-ISCIII. This work was supported by: ProteoRed and the Carlos III National Health Institute Agreement, ProteoRed-ISCIII; the agreement between FIMA and the “UTE project CIMA”; grants SAF2011-29312 from Ministerio de Ciencia e Innovación and ISCIII-RETIC RD06/0020 to FJC and EU FP7 grant ProteomeXchange [grant number 260558]. APM and DTM have been funded by Spanish grants from Ministerio de Ciencia e Innovación BIO2010-17527 and the Government of Madrid (P2010/BMD-2305). BBVA Foundation for its support to HUPO initiatives.



REFERENCES

(1) Orchard, S.; Hermjakob, H.; Taylor, C. F.; Potthast, F.; Jones, P.; Zhu, W.; Julian, R. K.; Apweiler, R. Second proteomics standards initiative spring workshop. Expert. Rev. Proteomics 2005, 2, 287−289. (2) Omenn, G. S. Exploring the human plasma proteome. Proteomics 2005, 5, 3223−3225. (3) Hamacher, M.; Marcus, K.; Stephan, C.; Klose, J.; Park, Y. M.; Meyer, H. E. HUPO Brain Proteome Project: toward a code of conduct. Mol. Cell. Proteomics 2008, 7, 457. (4) Yamamoto, T.; Langham, R. G.; Ronco, P.; Knepper, M. A.; Thongboonkerd, V. Towards standard protocols and guidelines for urine proteomics. Proteomics 2008, 8, 2156−2159. (5) Legrain, P.; Aebersold, R.; Archakov, A.; Bairoch, A.; Bala, K.; Beretta, L.; Bergeron, J.; Borchers, C. H.; Corthals, G. L.; Costello, C. E.; Deutsch, E. W.; Domon, B.; Hancock, W.; He, F.; Hochstrasser, D.; Marko-Varga, G.; Salekdeh, G. H.; Sechi, S.; Snyder, M.; Srivastava, S.; Uhlen, M.; Wu, C. H.; Yamamoto, T.; Paik, Y.-K.; Omenn, G. S. The human proteome project: current state and future direction. Mol. Cell. Proteomics 2011, 10, M111.009993. (6) Paik, Y.-K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Marko-Varga, G.; Aebersold, R.; Bairoch, A.; Yamamoto, T.; Legrain, P.; Lee, H.-J.; Na, K.; Jeong, S.-K.; He, F.; Binz, P.-A.; Nishimura, T.; Keown, P.; Baker, M. S.; Yoo, J. S.; Garin, J.; Archakov, A.; Bergeron, J.; Salekdeh, G. H.; Hancock, W. S. Standard Guidelines for the Chromosome-Centric Human Proteome Project. J. Proteome Res. 2012, 120326073003006. (7) Paik, Y.-K.; Jeong, S.-K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Cho, S. Y.; Lee, H.-J.; Na, K.; Choi, E.-Y.; Yan, F.; Zhang, F.; Zhang, Y.; Snyder, M.; Cheng, Y.; Chen, R.; Marko-Varga, G.; Deutsch, E. W.; Kim, H.; Kwon, J.-Y.; Aebersold, R.; Bairoch, A.; Taylor, A. D.; Kim, K. Y.; Lee, E.-Y.; Hochstrasser, D.; Legrain, P.; Hancock, W. S. The Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome. Nat. Biotechnol. 2012, 30, 221−223. (8) Bech-Serra, J.-J.; Borthwick, A.; Colomé, N.; ProteoRed Consortium; Albar, J.-P.; Wells, M.; Sánchez del Pino, M.; Canals, F. A multi-laboratory study assessing reproducibility of a 2D-DIGE differential proteomic experiment. J. Biomol. Tech. 2009, 20, 293−296. (9) Medina-Aunon, J. A.; Martínez-Bartolomé, S.; Lopez-Garcia, M. A.; Salazar, E.; Navajas, R.; Jones, A. R.; Paradela, A.; Albar, J. P. The ProteoRed MIAPE web toolkit: a user-friendly framework to connect and share proteomics standards. Mol. Cell. Proteomics 2011, 10, M111.008334. (10) Martínez-Bartolomé, S.; Blanco, F.; Albar, J.-P. Relevance of proteomics standards for the ProteoRed Spanish organization. J. Proteomics 2010, 73, 1061−1066. (11) Martínez-Bartolomé, S.; Medina-Aunon, J. A.; Jones, A. R.; Albar, J. P. Semi-automatic tool to describe, store and compare proteomics experiments based on MIAPE compliant reports. Proteomics 2010, 10, 1256−1260. (12) Babel, I.; Barderas, R.; Diaz-Uriarte, R.; Moreno, V.; Suarez, A.; Fernandez-Aceñ ero, M. J.; Salazar, R.; Capellá, G.; Casal, J. I. Identification of MST1/STK4 and SULF1 proteins as autoantibody 170

dx.doi.org/10.1021/pr400721r | J. Proteome Res. 2014, 13, 158−172

Journal of Proteome Research

Article

Malladi, V. S.; Wong, M. C.; Barber, G. P.; Cline, M. S.; Dreszer, T. R.; Heitner, S. G.; Karolchik, D.; Kent, W. J.; Kirkup, V. M.; Meyer, L. R.; Long, J. C.; Maddren, M.; Raney, B. J.; Furey, T. S.; Song, L.; Grasfeder, L. L.; Giresi, P. G.; Lee, B.-K.; Battenhouse, A.; Sheffield, N. C.; Simon, J. M.; Showers, K. A.; Safi, A.; London, D.; Bhinge, A. A.; Shestak, C.; Schaner, M. R.; Kim, S. K.; Zhang, Z. Z.; Mieczkowski, P. A.; Mieczkowska, J. O.; Liu, Z.; McDaniell, R. M.; Ni, Y.; Rashid, N. U.; Kim, M. J.; Adar, S.; Zhang, Z.; Wang, T.; Winter, D.; Keefe, D.; Birney, E.; Iyer, V. R.; Lieb, J. D.; Crawford, G. E.; Li, G.; Sandhu, K. S.; Zheng, M.; Wang, P.; Luo, O. J.; Shahab, A.; Fullwood, M. J.; Ruan, X.; Ruan, Y.; Myers, R. M.; Pauli, F.; Williams, B. A.; Gertz, J.; Marinov, G. K.; Reddy, T. E.; Vielmetter, J.; Partridge, E. C.; Trout, D.; Varley, K. E.; Gasper, C.; Bansal, A.; Pepke, S.; Jain, P.; Amrhein, H.; Bowling, K. M.; Anaya, M.; Cross, M. K.; King, B.; Muratet, M. A.; Antoshechkin, I.; Newberry, K. M.; McCue, K.; Nesmith, A. S.; Fisher-Aylor, K. I.; Pusey, B.; DeSalvo, G.; Parker, S. L.; Balasubramanian, S.; Davis, N. S.; Meadows, S. K.; Eggleston, T.; Gunter, C.; Newberry, J. S.; Levy, S. E.; Absher, D. M.; Mortazavi, A.; Wong, W. H.; Wold, B.; Blow, M. J.; Visel, A.; Pennachio, L. A.; Elnitski, L.; Margulies, E. H.; Parker, S. C. J.; Petrykowska, H. M.; Abyzov, A.; Aken, B.; Barrell, D.; Barson, G.; Berry, A.; Bignell, A.; Boychenko, V.; Bussotti, G.; Chrast, J.; Davidson, C.; Derrien, T.; Despacio-Reyes, G.; Diekhans, M.; Ezkurdia, I.; Frankish, A.; Gilbert, J.; Gonzalez, J. M.; Griffiths, E.; Harte, R.; Hendrix, D. A.; Howald, C.; Hunt, T.; Jungreis, I.; Kay, M.; Khurana, E.; Kokocinski, F.; Leng, J.; Lin, M. F.; Loveland, J.; Lu, Z.; Manthravadi, D.; Mariotti, M.; Mudge, J.; Mukherjee, G.; Notredame, C.; Pei, B.; Rodriguez, J. M.; Saunders, G.; Sboner, A.; Searle, S.; Sisu, C.; Snow, C.; Steward, C.; Tanzer, A.; Tapanari, E.; Tress, M. L.; van Baren, M. J.; Walters, N.; Washieti, S.; Wilming, L.; Zadissa, A.; Zhengdong, Z.; Brent, M.; Haussler, D.; Kellis, M.; Valencia, A.; Gerstein, M.; Raymond, A.; Guigó, R.; Harrow, J.; Hubbard, T. J.; Landt, S. G.; Frietze, S.; Abyzov, A.; Addleman, N.; Alexander, R. P.; Auerbach, R. K.; Balasubramanian, S.; Bettinger, K.; Bhardwaj, N.; Boyle, A. P.; Cao, A. R.; Cayting, P.; Charos, A.; Cheng, Y.; Cheng, C.; Eastman, C.; Euskirchen, G.; Fleming, J. D.; Grubert, F.; Habegger, L.; Hariharan, M.; Harmanci, A.; Iyenger, S.; Jin, V. X.; Karczewski, K. J.; Kasowski, M.; Lacroute, P.; Lam, H.; LarnarreVincent, N.; Leng, J.; Lian, J.; Lindahl-Allen, M.; Min, R.; Miotto, B.; Monahan, H.; Moqtaderi, Z.; Mu, X. J.; O’Geen, H.; Ouyang, Z.; Patacsil, D.; Pei, B.; Raha, D.; Ramirez, L.; Reed, B.; Rozowsky, J.; Sboner, A.; Shi, M.; Sisu, C.; Slifer, T.; Witt, H.; Wu, L.; Xu, X.; Yan, K.K.; Yang, X.; Yip, K. Y.; Zhang, Z.; Struhl, K.; Weissman, S. M.; Gerstein, M.; Farnham, P. J.; Snyder, M.; Tenebaum, S. A.; Penalva, L. O.; Doyle, F.; Karmakar, S.; Landt, S. G.; Bhanvadia, R. R.; Choudhury, A.; Domanus, M.; Ma, L.; Moran, J.; Patacsil, D.; Slifer, T.; Victorsen, A.; Yang, X.; Snyder, M.; White, K. P.; Auer, T.; Centarin, L.; Eichenlaub, M.; Gruhl, F.; Heerman, S.; Hoeckendorf, B.; Inoue, D.; Kellner, T.; Kirchmaier, S.; Mueller, C.; Reinhardt, R.; Schertel, L.; Schneider, S.; Sinn, R.; Wittbrodt, B.; Wittbrodt, J.; Weng, Z.; Whitfield, T. W.; Wang, J.; Collins, P. J.; Aldred, S. F.; Trinklein, N. D.; Partridge, E. C.; Myers, R. M.; Dekker, J.; Jain, G.; Lajoie, B. R.; Sanyal, A.; Balasundaram, G.; Bates, D. L.; Byron, R.; Canfield, T. K.; Diegel, M. J.; Dunn, D.; Ebersol, A. K.; Ebersol, A. K.; Frum, T.; Garg, K.; Gist, E.; Hansen, R. S.; Boatman, L.; Haugen, E.; Humbert, R.; Jain, G.; Johnson, A. K.; Johnson, E. M.; Kutyavin, T. M.; Lajoie, B. R.; Lee, K.; Lotakis, D.; Maurano, M. T.; Neph, S. J.; Neri, F. V.; Nguyen, E. D.; Qu, H.; Reynolds, A. P.; Roach, V.; Rynes, E.; Sabo, P.; Sanchez, M. E.; Sandstrom, R. S.; Sanyal, A.; Shafer, A. O.; Stergachis, A. B.; Thomas, S.; Thurman, R. E.; Vernot, B.; Vierstra, J.; Vong, S.; Wang, H.; Weaver, M. A.; Yan, Y.; Zhang, M.; Akey, J. A.; Bender, M.; Dorschner, M. O.; Groudine, M.; MacCoss, M. J.; Navas, P.; Stamatoyannopoulos, G.; Kaul, R.; Dekker, J.; Stamatoyannopoulos, J. A.; Dunham, I.; Beal, K.; Brazma, A.; Flicek, P.; Herrero, J.; Johnson, N.; Keefe, D.; Lukk, M.; Luscombe, N. M.; Sobral, D.; Vaquerizas, J. M.; Wilder, S. P.; Batzoglou, S.; Sidow, A.; Hussami, N.; Kyriazopoulou-Panagiotopoulou, S.; Libbrecht, M. W.; Schaub, M. A.; Kundaje, A.; Hardison, R. C.; Miller, W.; Giardine, B.; Harris, R. S.; Wu, W.; Bickel, P. J.; Banfai, B.; Boley, N. P.; Brown, J. B.; Huang, H.; Li, Q.; Li, J. J.; Noble, W. S.; Bilmes, J. A.; Buske, O. J.; Hoffman, M. M.; Sahu, A. O.; Kharchenko, P. V.; Park, P. J.; Baker, D.; Taylor, J.; Weng, Z.; Iyer, S.; Dong, X.; Greven,

M.; Lin, X.; Wang, J.; Xi, H. S.; Zhuang, J.; Gerstein, M.; Alexander, R. P.; Balasubramanian, S.; Cheng, C.; Harmanci, A.; Lochovsky, L.; Min, R.; Mu, X. J.; Rozowsky, J.; Yan, K.-K.; Yip, K. Y.; Birney, E. An integrated encyclopedia of DNA elements in the human genome. Nature 2012, 489, 57−74. (19) Harrow, J.; Frankish, A.; Gonzalez, J. M.; Tapanari, E.; Diekhans, M.; Kokocinski, F.; Aken, B. L.; Barrell, D.; Zadissa, A.; Searle, S.; Barnes, I.; Bignell, A.; Boychenko, V.; Hunt, T.; Kay, M.; Mukherjee, G.; Rajan, J.; Despacio-Reyes, G.; Saunders, G.; Steward, C.; Harte, R.; Lin, M.; Howald, C.; Tanzer, A.; Derrien, T.; Chrast, J.; Walters, N.; Balasubramanian, S.; Pei, B.; Tress, M.; Rodriguez, J. M.; Ezkurdia, I.; van Baren, J.; Brent, M.; Haussler, D.; Kellis, M.; Valencia, A.; Reymond, A.; Gerstein, M.; Guigó, R.; Hubbard, T. J. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 2012, 22, 1760−1774. (20) 1000 Genomes Project Consortium;; Abecasis, G. R.; Auton, A.; Brooks, L. D.; DePristo, M. A.; Durbin, R. M.; Handsaker, R. E.; Kang, H. M.; Marth, G. T.; McVean, G. A. An integrated map of genetic variation from 1,092 human genomes. Nature 2012, 491, 56−65. (21) Nagaraj, N.; Wisniewski, J. R.; Geiger, T.; Cox, J.; Kircher, M.; Kelso, J.; Päab̈ o, S.; Mann, M. Deep proteome and transcriptome mapping of a human cancer cell line. Mol. Syst. Biol. 2011, 7, 548. (22) Yates, J. R.; Eng, J. K.; McCormack, A. L. Mining genomes: correlating tandem mass spectra of modified and unmodified peptides to sequences in nucleotide databases. Anal. Chem. 1995, 67, 3202−3210. (23) Ansong, C.; Purvine, S. O.; Adkins, J. N.; Lipton, M. S.; Smith, R. D. Proteogenomics: needs and roles to be filled by proteomics in genome annotation. Briefings Funct. Genomics Proteomics 2008, 7, 50−62. (24) Evans, V. C.; Barker, G.; Heesom, K. J.; Fan, J.; Bessant, C.; Matthews, D. A. De novo derivation of proteomes from transcriptomes for transcript and protein identification. Nat. Methods 2012, 9, 1207− 1211. (25) Woo, S.; Cha, S. W.; Merrihew, G.; He, Y.; Castellana, N.; Guest, C. C.; McCoss, M.; Bafna, V. Proteogenomic database construction driven from large scale RNA-seq data. J. Proteome Res. 2013, DOI: 10.1021/pr400294c. (26) Wu, L.; Candille, S. I.; Choi, Y.; Xie, D.; Jiang, L.; Li-Pook-Than, J.; Tang, H.; Snyder, M. Variation and genetic control of protein abundance in humans. Nature 2013, 499, 79−82. (27) Castellana, N.; Bafna, V. Proteogenomics to discover the full coding content of genomes: a computational perspective. J Proteomics 2010, 73, 2124−2135. (28) Kim, D.; Pertea, G.; Trapnell, C.; Pimentel, H.; Kelley, R.; Salzberg, S. L. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013, 14, R36. (29) Trapnell, C.; Williams, B. A.; Pertea, G.; Mortazavi, A.; Kwan, G.; van Baren, M. J.; Salzberg, S. L.; Wold, B. J.; Pachter, L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 2010, 28, 511−515. (30) Gentleman, R. C.; Carey, V. J.; Bates, D. M.; Bolstad, B.; Dettling, M.; Dudoit, S.; Ellis, B.; Gautier, L.; Ge, Y.; Gentry, J.; Hornik, K.; Hothorn, T.; Huber, W.; Iacus, S.; Irizarry, R.; Leisch, F.; Li, C.; Maechler, M.; Rossini, A. J.; Sawitzki, G.; Smith, C.; Smyth, G.; Tierney, L.; Yang, J. Y.; Zhang, J. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5, R80. (31) Ramsköld, D.; Wang, E. T.; Burge, C. B.; Sandberg, R. An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Comput. Biol. 2009, 5, e1000598. (32) van Bakel, H.; Nislow, C.; Blencowe, B. J.; Hughes, T. R. Most “dark matter” transcripts are associated with known genes. PLoS Biol. 2010, 8, e1000371. (33) Shevchenko, A.; Jensen, O. N.; Podtelejnikov, A. V.; Sagliocco, F.; Wilm, M.; Vorm, O.; Mortensen, P.; Shevchenko, A.; Boucherie, H.; Mann, M. Linking genome and proteome by mass spectrometry: largescale identification of yeast proteins from two dimensional gels. Proc. Natl. Acad. Sci. U.S.A. 1996, 93, 14440−14445. 171

dx.doi.org/10.1021/pr400721r | J. Proteome Res. 2014, 13, 158−172

Journal of Proteome Research

Article

(34) Marcilla, M.; Alpizar, A.; Paradela, A.; Albar, J.-P. A systematic approach to assess amino acid conversions in SILAC experiments. Talanta 2011, 84, 430−436. (35) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 2007, 4, 207−214. (36) Klein, I. A.; Resch, W.; Jankovic, M.; Oliveira, T.; Yamane, A.; Nakahashi, H.; Di Virgilio, M.; Bothmer, A.; Nussenzweig, A.; Robbiani, D. F.; Casellas, R.; Nussenzweig, M. C. Translocation-capture sequencing reveals the extent and nature of chromosomal rearrangements in B lymphocytes. Cell 2011, 147, 95−106. (37) Prieto, G.; Aloria, K.; Osinalde, N.; Fullaondo, A.; Arizmendi, J. M.; Matthiesen, R. PAnalyzer: a software tool for protein inference in shotgun proteomics. BMC Bioinf. 2012, 13, 288. (38) Côté, R. G.; Jones, P.; Martens, L.; Kerrien, S.; Reisinger, F.; Lin, Q.; Leinonen, R.; Apweiler, R.; Hermjakob, H. The Protein Identifier Cross-Referencing (PICR) service: reconciling protein identifiers across multiple source databases. BMC Bioinf. 2007, 8, 401.

172

dx.doi.org/10.1021/pr400721r | J. Proteome Res. 2014, 13, 158−172