Probing the Missing Human Proteome: A Computational Perspective

Sep 25, 2015 - The missing human proteome comprises predicted protein-coding genes with no credible protein level evidence detected so far and constit...
3 downloads 11 Views 2MB Size
Subscriber access provided by CMU Libraries - http://library.cmich.edu

Communication

PROBING THE MISSING HUMAN PROTEOME: A COMPUTATIONAL PERSPECTIVE Dhirendra Kumar, Aradhya Jain, and Debasis Dash J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.5b00728 • Publication Date (Web): 25 Sep 2015 Downloaded from http://pubs.acs.org on September 26, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

PROBING THE MISSING HUMAN PROTEOME: A COMPUTATIONAL PERSPECTIVE Dhirendra Kumar‡, Aradhya Jain‡ and Debasis Dash* G.N. Ramachandran Knowledge Centre for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, Mathura Road, Delhi 110025, India

ACS Paragon Plus Environment

1

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 36

ABSTRACT

The missing human proteome, comprises predicted protein coding genes with no credible protein level evidence detected so far, and constitutes ≈18% of the human protein coding genes (neXtProt release 19/9/2014). The missing proteins may be of pharmacological interest as many of these are membrane receptors, thus requiring comprehensive characterization. In the present study, we explored various computational parameters, crucial during protein searches from tandem mass spectrometry (MS) data, for their impact on missing protein identification. Variables taken into consideration are differences in search database composition, shared peptides, semi-tryptic searches, post-translational modifications (PTMs) and transcriptome guided proteogenomic searches. We used a multi-algorithmic approach for protein detection from publicly available mass spectra from recent studies covering diverse human tissues and cell types. Using the aforementioned approaches, we successfully detected 24 missing proteins (22PE2, 1-PE4 and 1-PE5). Maximum of these identifications could be attributed to differences in reference proteome databases, exemplifying use of a single standard database for human protein detection from MS data. Our results suggest that search strategies with modified parameters can be rewarding alternatives for extensive profiling of missing proteins. We conclude that using complementary spectral data searches incorporating different parameters like PTMs, against a comprehensive and compact search database, might lead to discoveries of the proteins attributed so far as the missing human proteome.

KEYWORDS Proteomics, neXtProt, missing proteins, Post-translational modifications, Semi-tryptic searches

ACS Paragon Plus Environment

2

Page 3 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

INTRODUCTION Recent advances and global initiatives in genome sequencing have led to the accumulation of tens of thousands of genomes from a variety of species across different taxonomic groups. The Human Genome Project has provided us with the nucleotide sequence of the entire human genome, comprising 19,814 protein coding genes (GenCode 22). However, even after two decades of efforts, a complete and comprehensive catalog of the proteins within this ≈ 3.2 billion base pair long nucleotide stretch is yet to be estimated. Major efforts are underway for the identification of the proteins that these predicted protein-coding genes encode. These include the Human Proteome Project (HPP), the Human Protein Atlas (HPA) project, and the ChromosomeCentric Human Proteome Project (c-HPP)1-3. The ProteomeXchange consortium provides a platform for the archiving and sharing of data obtained from such projects4. The c-HPP aims at the identification and cataloguing of the proteome in respect to chromosomes1,2,5. The principal aim of these projects is to identify at least one protein per protein-coding gene in the human genome. Even with the advent of such projects, there is a substantial number of predicted human protein coding genes which are yet to be characterized as expressed proteins. Such missing proteins can be defined as those that have genetic, transcript or homology based evidence, for their existence, but do not have any protein level experimental data6 (such as immunohistochemistry, 3-D structure, Edman sequencing or mass spectrometry) to substantiate it7,8. neXtProt6, a knowledgebase of curated human proteins, classifies all proteins on the basis of their current level of evidence into five categories, viz., PE1, those proteins which have credible protein level evidence, PE2 are those with transcript level evidence, PE3 includes proteins based on homology

ACS Paragon Plus Environment

3

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 36

with related species, PE4 consists of proteins which are hypothesized to exist based on predictions from gene models, and PE5 encompasses all the uncertain proteins. Proteins belonging to PE2-PE5 constitute the missing proteome comprising 3,564 proteins (neXtProt release 19/9/14). An important class of the missing human proteome constitutes of the membrane receptors, particularly the olfactory receptors7. Although humans possess a large number of olfactory receptor genes, it is believed that many of them have either been silenced or have acquired a pseudogene status, which is probably why they are not detected in proteomic analyses. This may be attributed to the fact that olfaction is more essential in distant relatives of homo sapiens such as rodents and other mammals as compared to humans7. Of late, these “missing proteins” have been of great interest to scientists from diverse fields of science like computational biology, genomics, proteomics, bioinformatics, drug discovery and pharmaceuticals etc. Nearly one third of these are likely membrane receptors and potential drug targets, as inferred from various studies at the genomic and transcriptomic levels7. Due to the absence of any protein level evidence, their potential as drug targets has not been exploited to the maximum. Thus, identification of such proteins may contribute to clinical therapeutics. Further, the knowledge of the complete proteome of a tissue will help us in the identification of diseaseassociated and tissue-specific biomarkers, which can be useful as diagnostic and prognostic markers for complex diseases8. This knowledge can also impart a better understanding of the complex mechanisms in tissues such as the brain that have a high probability of expressing these missing proteins9. Also, the intriguing fact that these proteins have not yet been reported by any of the current proteomic analytical techniques has left scientists baffled. Many approaches have been put forward for the identification of these missing proteins. In 2014, two independent studies were published in the same issue of Nature (Nature 509.7502),

ACS Paragon Plus Environment

4

Page 5 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

claiming proteomic based evidence for 2,535 of the 3,844 missing proteins and 18,097 of the 19,629 human genes annotated in SwissProt, respectively10,11. On analysis of the proteome draft maps provided by these studies, such high number of reported missing proteins was debated upon. Olfactory receptors were used to carry out quality checks on the data, and it was argued that these studies reported the presence of these receptors at high levels across multiple tissues even though nasal tissues were not analyzed in either of the studies12. The Human Protein Atlas (HPA) is a major resource which provides a tissue-based map of the human proteome13. It was a major breakthrough in the field of proteomics as it mapped the human proteome tissue wise, using immunohistochemistry and RNA-seq based approaches. Taking Ensembl as a reference, it also discussed about missing proteins. The recent article from the HPA group reported 17,132 out of the 20,356 putative protein coding genes (in Ensembl release 75) with protein level evidence, combining experimental evidence from HPA, UniProt and the two above mentioned studies by Kim et al.,10 and Wilhelm et al.11. Further, a few studies have employed variant approaches for characterization of the missing proteins. For example, in one particular study, the authors used sequential BLAST homology searches against non-human mammalian protein databases for functionally annotating missing proteins specifically on chromosome 7 and 8, while others have adopted a multi-omics approach for their identification14,15. Various approaches and strategies have been suggested and demonstrated in a number of published studies for detection and functional annotation of this subset. However, focused analysis on contribution of computational factors in improving the detection of the missing protein from MS data is limited only to a few studies. In this regard, Farrah et al. reanalyzed publicly available human proteomic datasets to identify proteins not represented in PeptideAtlas

ACS Paragon Plus Environment

5

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 36

and also proposed an observability score for the remaining missing proteins16. They further increased the human proteome coverage in PeptideAtlas by comprehensively characterizing the kidney, urine and plasma proteomes17 by Trans Proteomic Pipeline (TPP)18. In the present work, we extended these studies by reanalyzing recent, publicly available tandem mass spectral data from various human tissues including lung, liver and colon, with the modified search strategies against the neXtProt database to account for confident detection of the designated missing proteins. We explored the contribution of various computational protein identification parameters such as database differences, protease specificity, post translational modifications, uniqueness of peptides etc., to improve MS coverage of the missing human proteome. MATERIALS AND METHODS The strategy adopted for the present work is represented in the form of a workflow in Figure 1. Dataset selection Recently submitted publicly available mass spectral datasets were selected if they qualified at least one of the following criteria. (I) MS data searches were performed against a smaller database containing lesser number of entries than neXtProt. (II) The dataset is not covered in any of the major human proteome projects or reanalysis projects as of December 2014. (III) The dataset represents human samples or tissues like brain which are promising for rare protein expressions. 21 such datasets were selected, spectral files of which were downloaded from the PRIDE repository19. These were obtained in the raw format which were converted to the mascot generic format (.mgf) using msconvert utility from ProteoWizard20 suite. Table 1 describes all the datasets used in this analysis with their tissue information and total number spectra.

ACS Paragon Plus Environment

6

Page 7 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

MS data analysis MGF files were searched against a concatenated target-decoy database of 82,306 protein entries formulated from proteins across all categories as given in neXtProt (19/9/2014) and cRAP (115 entries) common contaminant proteins. A multi-step data analysis approach was implemented. First of all, the respective spectral datasets were screened against the aforementioned database using our in-house integration of OMSSA21 and X!Tandem22 algorithms as implemented in GenoSuite23. For all the searches, parameters were set as mentioned in their respective studies and are listed in Supplementary File 1. Peptide spectrum matches (PSMs) with a 1% or lower FDR (False Discovery Rate) were considered for further analysis. FDR for a given score threshold was calculated using following formula.

‫ܴܦܨ‬ሺ%ሻ =

‫ܦ‬ × 100 ܶ

Where D = count of DECOY PSMs passing the score threshold T= Count of TARGET PSMs passing the score threshold Proteins were grouped using ProteinAssembler; an in-house Perl implementation of parsimonious protein grouping algorithm24. Protein groups containing exclusive set of peptides were considered further and group representatives were reported. Protein groups were considered as identified only if they were supported with two or more group exclusive peptides or in case of single peptide with minimum five qualified PSMs (