Decoding the Effect of Isobaric Substitutions on Identifying Missing

Sep 20, 2017 - To confirm the existence of missing proteins, we need to identify at least two unique peptides with length of 9–40 amino acids of a m...
1 downloads 0 Views 1MB Size
Subscriber access provided by University of Pennsylvania Libraries

Article

Decoding the Effect of Isobaric Substitutions on Identifying Missing Proteins and Variant Peptides in Human Proteome Wai-Kok Choong, Tung-Shing Mamie Lih, Yu-Ju Chen, and Ting-Yi Sung J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.7b00342 • Publication Date (Web): 20 Sep 2017 Downloaded from http://pubs.acs.org on September 21, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Decoding the Effect of Isobaric Substitutions on Identifying Missing Proteins and Variant Peptides in Human Proteome Wai-Kok Choong,1, 2, ‡ Tung-Shing Mamie Lih,1, 2, ‡ Yu-Ju Chen,2 Ting-Yi Sung1,* 1

Institute of Information Science, Academia Sinica, Taipei 11529, Taiwan

2Institute

of Chemistry, Academia Sinica, Taipei 11529, Taiwan

KEYWORDS Missing proteins, variant peptides, isobaric substitutions, single amino acid variants, mass spectrometry, HPP. ABSTRACT To confirm the existence of missing proteins, we need to identify at least two unique peptides with length of 9-40 amino acids of a missing protein in bottom-up mass spectrometry-based proteomic experiments. However, an identified unique peptide of the missing protein, even identified with high level of confidence, could possibly coincide with a peptide of a commonly observed protein due to isobaric substitutions, mass modifications, alternative splice isoforms, or single amino acid variants (SAAVs). Besides unique peptides of missing proteins, identified variant peptides (SAAV-containing peptides) could also alternatively map to peptides of other proteins due to the aforementioned issues. Therefore, we conducted a thorough comparative analysis on data sets in PeptideAtlas Tiered Human Integrated Search Proteome (THISP, 2017-

ACS Paragon Plus Environment

1

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 30

03 release), including neXtProt (2017-01 release), to systematically investigate the possibility of unique peptides in missing proteins (PE2-4), unique peptides in dubious proteins, and variant peptides affected by isobaric substitutions, causing doubtful identification results. In this study, we considered eleven isobaric substitutions. From our analysis, we found 6% of variant peptides became shared with peptides of PE1 proteins after isobaric substitutions. INTRODUCTION In 2010, Human Proteome Organization (HUPO) initiated the Human Proteome Project (HPP) which is a flagship scientific project composing of international effort with the goal of developing complete knowledge of human proteome.1-3 The Chromosome-Centric Human Proteome Project (C-HPP) is part of the HPP that one of its main goals is to utilize mass spectrometry (MS) to confirm the existence of missing proteins.1 Missing proteins refer to human proteins that can be predicted from encoded genes but lack sufficient evidence at protein level.2,4-7 Each of the missing protein has been assigned with a protein existence (PE) status as PE2, 3, or 4 based on the type of evidences in neXtProt,8 whereas proteins with existence evidence at protein level are assigned with PE1 status. neXtProt is a knowledge platform with well-curated proteomic data and genetic variation data and is specifically focused on human proteins. The neXtProt database (2017-01 release) contains 20159 human proteins, where 17008 proteins have evidence at protein level (PE1), 2579 proteins at transcript level (PE2), homology level (PE3), and prediction level (PE4), and 572 proteins at “uncertain” or “dubious’ level (PE5). Excluding proteins at PE5, we have a total of 2579 missing proteins at PE2-4. MS is a well-adopted technology for proteomics studies because of its high-throughput and high sensitivity,1 yet there is still detection limitation in MS.9,10 For instance, false identification

ACS Paragon Plus Environment

2

Page 3 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

can occur because of inaccurate mass measurement, especially in low-resolution MS,11,12 and amino acids with similar masses (e.g., leucine and isoleucine) are difficult to distinguish by using current fragmentation method in MS.1,13,14 In order to ensure data quality and increase confidence in identified missing proteins as well as novel translation products from longnoncoding RNAs or pseudogenes, HPP has updated HPP MS Data Interpretation Guidelines to Version 2.1.1,15,16 There are a total of 15 items in the guidelines, where the 14th item primarily describes how to further confirm the identification results even for identified peptides with high confidence. According to the 14th guideline, alternative splice isoforms, single amino acid variants (SAAVs), isobaric substitution, and mass modifications may alter the identification result of a highly confident peptide of a missing protein to coincide with a peptide from a commonly-observed human protein or human-related protein. For instance, it will be difficult to differentiate between asparagine (N)[Deamidated] and aspartate (D) because the mass difference between N[Deamidated] and D is very subtle.1,11 Nonetheless, there seems a lack of thorough analysis to explore the effect of aforementioned issues towards identification of missing proteins. Therefore, we conducted a comprehensive comparative analysis to systematically investigate the possibility of unique peptides of missing proteins as well as PE5 proteins mapping to peptides of PE1 proteins (e.g., unique peptides of PE1 canonical sequences) and other related sources, e.g., UniProtKB17 and common Repository of Adventitious Proteins (cRAP),18 due to isobaric substitutions, mass modification variants, isoforms or SAAVs. In addition, we extended our analysis on missing proteins to peptides containing single variant sites (i.e., SAAVs) in PE1 proteins with Gold and Silver data quality. In shotgun proteomics, sequence database searching is a common approach for protein identification in which the pivot step is to correctly assign peptides present in a given reference

ACS Paragon Plus Environment

3

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 30

protein sequence database to the corresponding tandem mass spectra (MS/MS spectra).19,20 However, identifying peptides coming from, such as proteins with sequence variants, alternative spliced proteins, and proteins resulting from other genome variation (e.g., RNA editing), is still challenging.20-22 In recent years, proteogenomic approaches are applied to resolve the issue. Typically, in a proteogenomic study, the MS/MS spectra are searched against a customized database generated using genomic or transcriptomic nucleotide sequencing data to identify peptides of novel or unknown proteins.20,22-25 In addition to the aforementioned isobaric substitution analysis on missing proteins and PE5 proteins, we also analyzed variant peptides with SAAVs by performing the pairwise comparison before and after isobaric substitutions. There are different kinds of variants, but we only considered SAAVs here. When a SAAV is present, the function of a protein may alter, leading to diseases,26,27 for example, L858R mutation in epidermal growth factor receptor (EGFR) can cause abnormal activation of the EGFR kinase which has been observed in non-small cell lung cancer and showed responses to inhibition by gefitinib.28,29 Thus, investigating the impact of the aforementioned issues on variant peptides is important as well. METHODS Data Sets The protein sequence data sets used in our analysis were downloaded from PeptideAtlas Tiered Human Integrated Search Proteome (THISP, http://www.peptideatlas.org/thisp/).30 THISP is built by combining proteome information from various external sources, such as neXtProt, UniProtKB, and cRAP, to create four different tiered databases where the complexity of each tier is different. Moreover, the sequence components in THISP are organized in a way that there are no redundant protein entries. The monthly release of THISP database allows the download of

ACS Paragon Plus Environment

4

Page 5 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

individual source components used for making the database. We download all of the sources used for constructing the 2017-03 release of THISP, including neXtProt (2017-01 release). There are 16 sequence components combined to create different tiers of THISP. Among them, we directly used seven sequence components, i.e., Nh-cRAP, UPCP, IMGT, Microbe, IPIorphan, Contribs and RSDiffNP, as our protein sequence data sets. Next, we used neXtProt (2017-01 release), corresponding to nP20k and nPvarsplic in THISP, to generate four protein sequence data sets and two decoy protein data sets based on PE status and our selection criteria. To be specific, the first three protein data sets consist of proteins of canonical forms at PE1, PE2-4, and PE5, denoted by PE1_Cano, PE234_Cano, and PE5_Cano, respectively, and the other protein data set consists of PE1 proteins of both canonical sequences and isoforms, denoted by PE1_All. Two decoy protein data sets were constructed from PE1 canonical proteins. The first decoy data set was built by reversing the target protein sequences, denoted by PE1_Cano_RevDec. The second decoy data set was obtained directly from THISP, denoted by PE1_Cano_ScramDec. All of the protein sequence data sets are in silico digested by trypsin (cleaving K or R not before P at C-terminal), following the cleavage specificity rules provided by Expasy Peptide Cutter and Keil’s rules.31 Each fully-tryptic peptide, i.e., without any miscleavage, is regarded as an in silico peptide (and sometimes simply called peptide for convenience) and assigned with an index to distinguish the origin of the peptide from one data set to another. To find the unique in silico peptides in PE1_Cano, PE234_Cano, and PE5_Cano, we checked whether each peptide is unique with respect to the entire canonical protein sequences, i.e., the union of the above three data sets. By doing so, we constructed three unique peptide data sets for PE1, PE2-4, and PE5 proteins, denoted by PE1_unique, PE234_unique, and PE5_unique, respectively. Note that PE5 proteins are dubious proteins; therefore, their unique peptides are

ACS Paragon Plus Environment

5

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 30

actually potential unique peptides. Here, we use the term “unique peptides of PE5 proteins” for simplicity. To construct variant peptide data sets, we used SAAV annotations (excluding INDEL variants) of PE1 canonical protein sequences with Gold data quality (error rate 0.2% of unique peptides of PE5 proteins became shared after Q/E substitution where at most 0.327% became shared with UPCP. Similar results were obtained after S/T substitution that >0.2% of unique peptides of PE5 proteins were overlapped with PE1_All (0.27%), UPCP (0.22%), and PE1Variant_Silver (0.327%). In addition, 0.25% and 0.30% of unique peptides of PE5 alternatively mapped to PE1_All after N/D and D/E substitutions as well.

Figure 2. Pairwise comparison of unique peptides of PE5 proteins with other fourteen data sets after introducing each of eleven isobaric substitutions. The number and percentage of peptides in PE5_Unique coincided with peptides in other fourteen data sets before and after isobaric substitutions are listed in Table S3 (Supplementary File S2).

ACS Paragon Plus Environment

16

Page 17 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Analysis of Overall vs. Identical on Unique Peptides of Missing Proteins and PE5 Proteins From the above results, only a very small portion of the unique peptides in missing proteins and PE5 proteins, i.e., PE234_Unique and PE5_Unique, matched to peptides in other fourteen data sets before and after each individual isobaric substitution. Therefore, we investigated the overall effect of multiple isobaric substitutions by performing the third algorithm described in Methods on unique peptides of missing proteins and PE5 proteins compared with the fourteen data sets as mentioned above. The results of considering both all of the eleven isobaric substitutions and no substitution (i.e., Overall) are shown in Figure 3, in contrast to the results without any substitution (i.e., Identical only). When compared with the entire data sets excluding UPCP, RSDiffNP and PE1Variant_Silver, none or very small portion (