Article pubs.acs.org/jpr
Computational and Mass-Spectrometry-Based Workflow for the Discovery and Validation of Missing Human Proteins: Application to Chromosomes 2 and 14
J. Proteome Res. 2015.14:3621-3634. Downloaded from pubs.acs.org by UNIV OF NEBRASKA-LINCOLN on 09/10/15. For personal use only.
Christine Carapito,† Lydie Lane,‡,§ Mohamed Benama,∥,⊥,# Alisson Opsomer,† Emmanuelle Mouton-Barbosa,▽,○ Luc Garrigues,▽,○ Anne Gonzalez de Peredo,▽,○ Alexandre Burel,† Christophe Bruley,∥,⊥,# Alain Gateau,‡,§ David Bouyssié,▽,○ Michel Jaquinod,∥,⊥,# Sarah Cianferani,† Odile Burlet-Schiltz,▽,○ Alain Van Dorsselaer,† Jérôme Garin,∥,⊥,# and Yves Vandenbrouck*,∥,⊥,# †
Laboratoire de Spectrométrie de Masse BioOrganique (LSMBO), IPHC, Université de Strasbourg, CNRS, UMR7178, 25 Rue Becquerel, 67087 Strasbourg, France ‡ CALIPHO Group, SIB-Swiss Institute of Bioinformatics, CMU, rue Michel-Servet 1, CH-1211 Geneva 4, Switzerland § Department of Human Protein Sciences, Faculty of Medicine, rue Michel-Servet 1, CH-1211 Geneva 4, Switzerland ∥ CEA, DSV, iRTSV, Laboratoire de Biologie à Grande Echelle, 17 rue des martyrs, Grenoble, F-38054, France ⊥ INSERM U1038, 17, rue des Martyrs, Grenoble F-38054, France # Université Grenoble, Grenoble F-38054, France ▽ CNRS UMR5089 Institut de Pharmacologie et de Biologie Structurale, 118 route de Narbonne, 31077 Toulouse, France ○ Université de Toulouse, 205, route de Narbonne, 31077 Toulouse, France S Supporting Information *
ABSTRACT: In the framework of the C-HPP, our FrancoSwiss consortium has adopted chromosomes 2 and 14, coding for a total of 382 missing proteins (proteins for which evidence is lacking at protein level). Over the last 4 years, the French proteomics infrastructure has collected high-quality data sets from 40 human samples, including a series of rarely studied cell lines, tissue types, and sample preparations. Here we described a step-by-step strategy based on the use of bioinformatics screening and subsequent mass spectrometry (MS)-based validation to identify what were up to now missing proteins in these data sets. Screening database search results (85 326 dat files) identified 58 of the missing proteins (36 on chromosome 2 and 22 on chromosome 14) by 83 unique peptides following the latest release of neXtProt (2014-09-19). PSMs corresponding to these peptides were thoroughly examined by applying two different MS-based criteria: peptide-level false discovery rate calculation and expert PSM quality assessment. Synthetic peptides were then produced and used to generate reference MS/MS spectra. A spectral similarity score was then calculated for each pair of reference-endogenous spectra and used as a third criterion for missing protein validation. Finally, LC−SRM assays were developed to target proteotypic peptides from four of the missing proteins detected in tissue/cell samples, which were still available and for which sample preparation could be reproduced. These LC−SRM assays unambiguously detected the endogenous unique peptide for three of the proteins. For two of these, identification was confirmed by additional proteotypic peptides. We concluded that of the initial set of 58 proteins detected by the bioinformatics screen, the consecutive MS-based validation criteria led to propose the identification of 13 of these proteins (8 on chromosome 2 and 5 on chromosome 14) that passed at least two of the three MS-based criteria. Thus, a rigorous step-by-step approach combining bioinformatics screening and MS-based validation assays continued... Special Issue: The Chromosome-Centric Human Proteome Project 2015 Received: October 6, 2014 Published: July 1, 2015 © 2015 American Chemical Society
3621
DOI: 10.1021/pr5010345 J. Proteome Res. 2015, 14, 3621−3634
Article
Journal of Proteome Research
is particularly suitable to obtain protein-level evidence for proteins previously considered as missing. All MS/MS data have been deposited in ProteomeXchange under identifier PXD002131. KEYWORDS:
Human Proteome Project (C-HPP), missing proteins identification, mass spectrometry, LC−SRM assays, bioinformatics
J. Proteome Res. 2015.14:3621-3634. Downloaded from pubs.acs.org by UNIV OF NEBRASKA-LINCOLN on 09/10/15. For personal use only.
■
INTRODUCTION The Chromosome-Centric Human Proteome Project (C-HPP)1 is an international collaborative effort in which 25 teams currently participate to map and annotate the entire human protein set encoded by the genes on each human chromosome. An essential tool in this work is the neXtProt database2 developed and maintained by the Swiss team participating in the present study, which plays an important role as a reference knowledgebase for the C-HPP consortium. This database integrates extensive proteomics data sets, either contributed through Peptide Atlas or independently curated from selected publications. Information relating to human genes and proteins produced by other methods (genomics, transcriptomics, antibody-based, functional, etc.) is also included. On the basis of all the data integrated, neXtProt provides a “protein existence” (PE) score for each protein, which combines different validation criteria. In September 2013, neXtProt listed 20 128 proteins products of the human genome. At the 2013 Yokohama HPP meeting, the C-HPP executive committee agreed on an explicit definition of missing proteins as proteins annotated in neXtProt as PE2−4.3 According to this definition, a protein entry is said to be “missing” if the protein has not been shown to exist by any experimental technique so far, and if the protein sequence does not result from a dubious translation of a genetic region. (This is the case for most of the PE5 entries.) To guide the C-HPP community in the quest for missing proteins, a chromosome-bychromosome metrics table based on five public resources (Ensembl, neXtProt, PeptideAtlas, GPMdb, and HPA) is updated annually and released by the C-HPP Executive Committee.3 They also propose guidelines and a roadmap for proteomics profiling studies appropriate for the detection of these missing proteins.4 A number of reasons as to why proteins may not be visible in proteomics profiling analysis have already been discussed.5 Thus, some missing proteins may be particularly difficult to detect by mass spectrometry (MS) due to their physicochemical properties (e.g., basicity, hydrophobicity) or atypical post-translational modifications or because they are expressed at very low levels, are particularly sensitive to degradation, or simply because they are specific to cell types or tissues that have not yet been analyzed in proteomics studies. Therefore, a combination of well-suited approaches in terms of sample preparation techniques, antibody-based assays, and MS and IT workflows will be required to extensively characterize these missing proteins. Within the framework of the C-HPP, the French and Swiss proteomics groups have adopted chromosomes 14 and 2, respectively.1 When we initiated this study, 240 out of 1241 proteins potentially expressed from chromosome 2 and 142 out of 626 proteins from chromosome 14 were considered as missing (PE2−5) by neXtProt. Among these, 37 proteins from chromosomes 2 and 14 were annotated as PE5 (“dubious” or “uncertain”); the majority of these might result from the erroneous translation of DNA sequences that do not encode proteins. The Swiss team chose to use bioinformatics and biocuration strategies to extend the coverage of validated proteins, while the Proteomics French Infrastructure (ProFI; www.profiproteomics.fr) adopted an approach focused more on the analysis of novel samples. In 2013, French and Swiss teams
decided to combine their efforts by sharing resources and skills (samples, data sets, databases, MS instrumentation, IT) to increase the probability of finding missing proteins. We describe a step-by-step strategy combining bioinformatics and MS-based experiments to identify and validate missing proteins based on database search results (85 326 dat files) from a compendium of MS/MS data sets generated using 40 human cell line/tissue type/body fluid samples. Data analysis and MS-based postvalidation of proteins detected in this set of cell lines and tissues and corresponding to genes present on chromosomes 2 and 14 are discussed with regard to the issue of validating missing proteins, as is recent progress toward completion of the human proteome.
■
MATERIALS AND METHODS
Generation of a Peptide Sequence Database Specific for Human Chromosomes 2 and 14
The peptide database was built by performing in silico trypsin digestion (with one miscleavage allowed) of all unique sequences for the 382 protein entries (240 protein entries from Chr2 and 142 protein entries from Chr14) annotated as non-PE1 in neXtProt (Release July 2013). We retained for this study 345 of these entries, which were annotated PE2−4 and 37 entries labeled PE5 (dubious status). The resulting database contained 30 952 unique peptide sequences (available upon request). MS/MS Data Sets
Data sets used in this study correspond to LC−MS/MS runs acquired on human samples using various instruments over 4 years in three different laboratories. Data sets were originally processed using Mascot database searches with search parameters, constituting a compendium of 85 326 dat files and associated LC−MS/MS raw files. The different kinds of samples, tissues, cell type names, and types of fraction are described in Table 1. A detailed description of the MS/MS data sets compendium used in this study is provided in the Supporting Information (SI). MS/MS Data Processing (Bioinformatics Screen and PSM Quality Assessment)
85 326 Mascot (dat) files were processed by systematically screening each file against the Chr2−Chr14 peptides database (see above) by applying an exact pattern matching algorithm. Exact matches against the first-rank query level were retained. Filters were then applied in the following order: Peptide Length >6; Mascot Ion Score ≥30; Mascot E value = 100), the fragmentation tables for each peptide and its synthetic reference spectrum were exported from Proline (inhouse software) and fed into KNIME 2.9.0 for automated data manipulation and calculation. The SDPscores were individually manually verified at each ProFI’s site for a random subset of peptides. A minimum of four ions shared between the endogenous and the reference spectra were used to calculate the SDPscore by considering all singly charged b and y ion series. When required, doubly charged b and y ions or ions generated by neutral losses were taken into account. Spectral comparisons are presented in Supplementary Figure 2 in the SI, while SDPscores are listed in Table 3 and Supplementary Table 2 in the SI. Data (raw and dat files) related to synthetic peptides were added to the PX submission deposited to the ProteomeXchange Consortium6 via the PRIDE partner repository with the same data set identifier PXD002131. Targeted LC−SRM Assay Development
Four candidate missing proteins for which aliquots of initial samples were available were selected: TEX261, TMEM169, B3GALT1, and LINC00116. All of these proteins were initially identified after 1D SDS-PAGE fractionation. Each individual sample (glioblastoma cells for TMEM169, LINC00116, and TEX261 and HepaRG cells for B3GALT1) was therefore fractionated on a 1D SDS-PAGE gel (in duplicate), and the bands around the area in which the protein was initially detected were excised. Gel bands were processed using a MassPrep Station (Waters, Milford, MA), for in-gel reduction and alkylation before overnight trypsin digestion at 37 °C using a 1:100 trypsin/ protein ratio (Promega, Madison, WI). The tryptic peptides produced were extracted and analyzed by LC−SRM as follows: four samples in which TEX261, TMEM169, B3GALT1, and LINC00116 had been identified were analyzed on a microLCtriple quadrupole system (Dionex Ultimate 3000 RSLC system linked to a TSQ Vantage, Thermo Fisher Scientific, San Jose, CA), while a fifth hepatocarcinoma tissue sample in which TEX261 had been identified was analyzed on the microLC-QTrap system (Dionex Ultimate 3000 RSLC system, Thermo Fisher Scientific linked to a 6500 Q-Trap, ABSciex, Concord, Ontario, Canada). For each protein, the unique peptide initially identified and two to five additional predicted proteotypic peptides were synthesized (crude PEPotec, Thermo Fisher Scientific) (See the SI.) Initially, concentration-balanced mixtures of the crude peptides for each protein were prepared to provide homogeneous signal intensities for all peptides. These mixtures were injected into a nanoLC-Q-TOF system to acquire their CID fragmentation spectra. For each peptide, the six transitions of highest abundance in the fragmentation spectra were chosen and followed on the LC−SRM systems to determine the peptide retention times, to verify the absence of light peptide forms in the labeled peptide mixtures and to determine relative fragment ion intensities. Collision energies were individually optimized for each peptide using Skyline software.9 Finally, protein-specific peptide mixtures were spiked into each appropriate gel band extracts, and at least three light and corresponding heavy transitions were monitored in
MS/MS Analysis of Synthetic Peptides and Comparison of Reference/Endogenous Fragmentation Spectra by Calculating Spectral Correlation Scores
Synthetic peptides were purchased for all candidates selected from the first screen, in either light or heavy form as detailed in the SI (crude PEPotec, Thermo Fisher Scientific). Collisioninduced dissociation (CID) fragmentation spectra for these synthetic peptides were acquired after injections of peptide mixtures (pools of 10 to 30 peptides) on nanoLC−MS/MS systems (nanoLC-Q-TOF (Synapt G1 Waters), nanoLC-ion trap (amaZon Bruker Daltonics), nanoLC-LTQ-Orbitrap XL, nanoLC-LTQ-Orbitrap Velos, and nanoLC-Q-Exactive Plus (Thermo Fisher Scientific)). Because the concentrations of these crude peptides were unknown, various dilutions were tested until a signal of satisfactory intensity could be measured for each peptide. The system used to measure the different peptides was chosen based, as much as possible, on the instrument setup used to initially detect and fragment the endogenous peptides. 3624
DOI: 10.1021/pr5010345 J. Proteome Res. 2015, 14, 3621−3634
Article
J. Proteome Res. 2015.14:3621-3634. Downloaded from pubs.acs.org by UNIV OF NEBRASKA-LINCOLN on 09/10/15. For personal use only.
Journal of Proteome Research
Figure 1. Strategy used to discover and validate missing proteins based on bioinformatics screening and MS-based experiments.
optimized conditions to check for the presence of endogenous peptides. Further details of the gradients and instrument parameters used are provided in the SI.
■
Table 2. Results of the Initial Screen: Starting from a Compendium of 85 326 dat filesa
RESULTS
whole set
1. Detection of Missing Proteins and Validation Workflow
no. PSM 32911 no. unique 5417 peptides no. proteins 339
The overall strategy for the detection and validation of missing proteins was designed in a stepwise manner, as illustrated in Figure 1. The first part of the workflow relied on a bioinformatics approach consisting of screening an existing database of search results composed of 85 326 dat files, obtained over 4 years from the LC−MS/MS analysis of a variety of human samples described in Table1. This large compendium of dat files was first screened against a homemade sequence database composed of unique peptide sequences from proteins currently annotated as missing based on the genes predicted on chromosomes 2 and 14 (according to neXtProt, July 2013 release) to identify candidate LC−MS/MS runs potentially containing MS/MS data for missing proteins. This allowed a subset of dat files containing a PSM matching with a unique peptide sequence from a missing protein to be extracted. To reduce the output and retain only files containing reasonably good quality PSMs, we applied a series of filters commonly used in MS-based proteomics (Materials and Methods, Table 2). We then updated the “protein existence” (PE) status for each protein by using the latest neXtProt release (2014-09-19) as a standard reference resource recommended by C-HPP.3 The second part of the workflow represents the various validation strategies applied to the data from the selected files and the order in which they were applied. One of the validation criteria consisted of a classical 1% FDR validation at the peptide level.7 Another validation criterium consisted of manual quality assessment of all PSMs by experts who ranked the candidates
updated non-PE1 protein evidence Mascot Mascot PepLength IonScore E value status in neXtProt >6 > 30 =