ASV-ID, a Proteogenomic Workflow to Predict Candidate Protein

Oct 5, 2018 - ASV-ID, a Proteogenomic Workflow to Predict Candidate Protein Isoforms based on Transcript Evidence. Seul-Ki Jeong , Chae-Yeon Kim , and...
0 downloads 0 Views 1MB Size
Subscriber access provided by UNIV OF NEW ENGLAND ARMIDALE

Article

ASV-ID, a Proteogenomic Workflow to Predict Candidate Protein Isoforms based on Transcript Evidence Seul-Ki Jeong, Chae-Yeon Kim, and Young-Ki Paik J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.8b00548 • Publication Date (Web): 05 Oct 2018 Downloaded from http://pubs.acs.org on October 6, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

pr-2018-005483.R1 (Clean version)

ASV-ID, a Proteogenomic Workflow to Predict Candidate Protein Isoforms based on Transcript Evidence Seul-Ki Jeong1, Chae-Yeon Kim1,2 and Young-Ki Paik1* 1

Yonsei Proteome Research Center and 2Department of Integrative Omics, Graduate School, Yonsei University, Sudaemoon-ku, Seoul, Korea

*Corresponding author: [email protected]

ABSTRACT One of the goals of the Chromosome-centric Human Proteome Project (C-HPP) is to map and characterize the functions of protein isoforms produced by alternative splicing of genes. However, identifying alternative splice variants (ASVs) via mass spectrometry remains a major challenge, because ASVs usually contain highly homologous peptide sequences. A routine protein sequence analysis suggests that more than half of the investigated proteins do not generate two or more uniquely mapping peptides that would enable their isoforms to be distinguished. Here, we develop a new proteogenomics method, named “ASV-ID” (alternative splicing variants identification), which enables identification of ASVs by using a cell type–specific protein sequence database that is supported by RNA-Seq data. Using this workflow, we identify 1,935 distinct proteins under highly stringent conditions. In fact, 1

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

transcript evidence on these 841 proteins helps us distinguish them from other isoforms, despite the fact that these proteins are not predicted to make 2 or more uniquely mapping peptides. We also demonstrate that ASV-ID enables detection of 19 differently expressed isoforms present in several cell lines. Thus, a new workflow using ASV-ID has the potential to map yet-to-be-identified difficult protein isoforms in a simple and robust way.

KEYWORDS alternative splicing variants, cell type–specific sequence database, proteogenomics, RNAsequencing

INTRODUCTION The major goals of the Chromosome-Centric Human Proteome Project (C-HPP) are to correlate the information passing between genome and proteome by detecting representative proteins encoded by genes present in human tissues.1-3 To this end, the C-HPP consortium aims to identify missing proteins that do not meet the criteria for mass spectrometry (MS) evidence set by the HPP guidelines.4 The scope of C-HPP also includes revealing various proteoforms and alternative splice variants (ASVs) encoded by representative open reading frames in the human genome, with evidence of the proteins’ existence supported by MS or antibody detection.1-5 Although the human genome produces ~20,230 predicted proteincoding genes, alternative-splicing events substantially increase transcripts and protein diversity, contribute to many biological processes and sometimes cause disease, including 2

ACS Paragon Plus Environment

Page 2 of 26

Page 3 of 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

cancer.6-8 Although two or more peptides are required to confidently identify a protein according to the HPP guidelines, it is not easy to verify proteins that belong to certain protein families, because they share highly homologous or almost identical sequences (e.g., ASVs). It is also often seen that verification with a sequence database search is not well-supported by two or more uniquely mapping peptides.9-11 For example, cellular nucleic acid–binding protein (also known as zinc finger protein 9; gene name: CNBP; neXtProt Ac: NX_P62633) has eight known ASVs that share highly similar sequences. Ideally, these ASVs could encode 10 peptides with ≥9 amino acids (a.a.), but none of them are uniquely mapping peptides. However, nine peptides appear only in ASVs of CNBP, not in other proteins. ASVs usually present difficulties in fulfilling the requirements for confident protein identification because of their highly homologous sequences.10-12 However, alternative splicing plays an important role in biological processes. There are several approaches to identifying more proteins using a single protease or a combination of proteases (e.g., LysargiNase, GluC, Chymotrypsin, Asp-N).12-14 Such approaches enable detection of diverse peptides with good sequence coverage. Detecting splice junction peptides is a very useful approach for identifying ASVs, while producing splice junctional peptides is limited by trypsin.12 We attempted to establish “ASV-ID” (alternative splicing variants identification) to identify ASVs more confidently without any additional protease treatment. Some studies have found that most protein-coding genes have a single dominant isoform depending on tissue or cell type,15 and others have tried to identify dominant, major, principal isoforms and their functionality based on isoform-level networks.16-18 Therefore, to 3

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

identify isoforms more efficiently, we used both the genomics dataset of RNA sequencing (RNA-Seq) and the proteomic dataset of MS analysis to reduce protein complexity by means of transcript expression information.19 In this study, we addressed the following questions and designed an integrated workflow to identify more ASVs. How many ASVs can be identified using current protein sequence databases? Can RNA-Seq information about transcript expression be useful for identifying ASVs? How many proteins and ASVs can be identified using ASV-ID? Finally, can any ASVs that are differentially expressed among cell types be detected? Here, we demonstrate that the ASV-ID method enables the identification of 1,935 distinct proteins present in different cell lines by analyzing publicly available datasets of five cell lines that contain both RNA-Seq and proteomics datasets.

MATERIALS AND METHODS neXtProt sequence entries analysis: We used neXtProt (2018-1-17 release) as a major reference database of protein sequences and ASVs.20 Based on neXtProt annotation, we counted the number of protein-coding genes and number of known ASVs; then we analyzed the number of confidently identifiable proteins using the peptide data processing methods. We use the term “protein group” when we refer to all ASVs of one protein-coding gene, even if there is only one protein known to exist in that protein-coding gene. Peptide data processing: To annotate peptides that represent only one ASV (uniquely mapping peptide) or several proteins (shared peptides) or are common in ASVs of one protein-coding gene that is not present in any other proteins (group peptide), we created a JAVA (version 1.8.0, https://java.com) program that enables the construction of a tryptic 4

ACS Paragon Plus Environment

Page 4 of 26

Page 5 of 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

peptide list from protein sequences and the identification of the number of proteins that contain the peptide products at given protein sequences. These two modules were used to prepare a list of peptides that appear in more than one protein and to annotate the peptides’ uniqueness.21 We considered trypsin only as a protease and ignored peptides composed of 1.0) in the five selected human cell lines, of which about 28% have two or more isoforms (2,932 of 10,539) that are known to be expressed. As expected, 91% of proteins with multiple isoforms only express one isoform, and they are credibly identified as supported by transcript (Table 1). 8

ACS Paragon Plus Environment

Page 8 of 26

Page 9 of 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 1. (A) Number of known isoforms and related proteins. (B) Number of proteotypic peptides for identifying proteins and isoforms. (C) Schematic overview of ASV-ID method, which was designed to confidently identify one protein isoform among several indistinguishable isoforms of one protein-coding gene supported by RNA-seq information. 9

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 26

Figure 2. Workflow of ASV-ID method for identifying protein isoforms.

Table 1. Number of expressed proteins in each cell line. neXtProt

A549

HEK293

10

ACS Paragon Plus Environment

HeLa

HepG2

MCF7

Page 11 of 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Human proteins (%)a Protein groups w/o Isoforms (%)

b

Protein groups with Isoforms (%)

c

Only one isoform is expressed (%)d

20,230

5,263 (26)

5,822 (29)

4,349 (22)

4,931 (24)

4,905 (24)

9,691

2,187 (23)

2,446 (25)

1,875 (19)

2,072 (21)

2,028 (21)

10,539

3,076 (29)

3,376 (32)

2,474 (23)

2,859 (27)

2,877 (27)

2,783 (91)

3,063 (91)

2,294 (93)

2,589 (91)

2,617 (91)

a

% of protein groups: percent of expressed protein groups in each cell line compared with all known 20,230 protein groups in neXtProt.

b

% of protein groups without isoforms: percent of expressed protein groups in each cell line compared with all known 9,691 protein groups in neXtProt that have no known isoforms.

c

% of protein groups with isoforms: percent of expressed protein groups in each cell line compared with all known 10,539 protein groups in neXtProt that have known isoforms. d

% of only one isoform is expressed in protein group: percent of only one isoform among several isoforms in protein group. For example, in case of the A549 cell line, among 3,076 protein groups, 91% (2,783 of 3,076) of protein groups have this dominant isoform expression pattern.

ASV-ID improves identification of peptides and proteins For the identified proteins and their isoform distribution among the five cell lines, we analyzed 30 raw MS files obtained from their peptide analysis. The raw files were converted to mzML using msconverter (TPP) and analyzed using TPP with the COMET search engine, with consideration of I/L substitution. From 462,185 peptide-spectrum match (PSM), 17,969 peptides were identified under stringent conditions (i.e., 0.1% false dicovery rate (FDR) at peptide level, length ≥9 a.a., with three or more supporting PSMs) (Table 2, Supplementary Table 2). We categorized these peptides into four groups (U, G, S and E). As described above, if only one isoform is expressed by the corresponding canonical protein, we do not count any other isoforms generated by the protein and therefore apply this information to that particular isoform to represent the canonical protein for quantification. Therefore, group U and E peptides can be used to support protein identification (Supplementary Figure 1). Proteins are identified and classified into six groups depending on the number of peptides that each 11

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

group’s proteins produce. Within these four groups of peptides, identified proteins were classified into six groups (I, Iue, Ie, O, Oe and S). The first group is named I (Identified), for when proteins are identified by two or more U group peptides. Proteins identified by only one uniquely mapping peptide are classified into group O (One-hit wonder). We also defined groups Iue (Identified by U and E) and Ie (Identified by several Es). Proteins are classified into Iue when O group proteins have one or more supporting E group peptides. Ie group proteins are identified by having two or more E group peptides. In the same manner, group Oe (One E-peptide hit) contains proteins identified by only one E group peptide. Other proteins are considered as S (Shared) group proteins. We consider proteins in the I, Iue, and Ie groups as credibly identified proteins. In most proteomic studies, only I-group proteins are usually considered as identified canonical proteins, but with transcript expression information we can identify more proteins that do not have enough supporting peptides. Without expression information, Iue-group proteins should be considered as group O, and Ie group proteins as S, and none can be identified to any great extent. Thus, we identified 1,935 proteins among 5 cell lines, and 841 proteins are improved by transcript expression information (Table 2 and Supplementary Tables 2–8). When we closely examined the identified proteins from each cell line, we found that 369 proteins are common in 5 cell lines, whereas 747 proteins are present only in 1 particular cell line (Figure 3). That is, 323, 231, and 265 proteins are common in 2, 3, and 4 cell lines, respectively.

Table 2. Number of identified peptides and proteins present in each cell line. 12

ACS Paragon Plus Environment

Page 12 of 26

Page 13 of 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

A549

HEK293

HeLa

All PSM entries

83,217

102,753

100,806

89,415

85,994

All distinct peptides

36,150

40,204

40,208

36,479

42,910

Peptides 0.1% FDR, with 3 or more PSM

7,329

9,451

9,955

8,635

7,191

Proteins 1% FDR

2,129

2,885

2,561

2,364

2,062

572

739

742

687

559

Group Iue (Improved by expression information)

55

87

67

68

54

Group Ie (Improved by expression information)

233

326

286

278

238

Group O

407

577

434

419

389

Group Oe (Improved by expression information)

217

294

211

206

184

Group S

644

862

817

703

637

860

1152

1095

1033

851

Group I

Proteins 1% FDR, with 2 or more supporting peptides (I+Iue+Ie)

HepG2

MCF7

Figure 3. Number of the commonly identified proteins in each cell line and cell type-specific proteins.

IDH2, an example case of a different isoform identified from five different cell lines We found 19 cases of differently detected isoforms from 5 different cell lines (Table 3). Isoforms of the same gene may have different functions, so it is crucial to annotate individual isoforms, and many methods are suggested for doing so.34 Because of the highly homologous 13

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

features of isoforms, machine learning, structural matching,35 and network-based approaches16,18,36 have more successful results than only using sequence homology–based methods.36 We demonstrated the analyzing function of individual isoforms of IDH2, the PE1 protein in neXtProt, with protein–protein interaction-based methods among the 19 cases of differently detected isoforms. IDH2 encodes a mitochondrial metabolic enzyme that is known to convert isocitrate into 2-oxoglutarate in the presence of NADP+ (to NADPH) and to play an important role in amino acid metabolism and energy production through the citric acid cycle.37 The IDH2 gene is located on Chr. 15q26.1 and has two known ASVs. Isoform 1 is a 452 a.a.-long protein (NX_P48735-1 as IDH2L), whereas isoform 2 is 400 a.a. long (NX_P48735-2 as IDH2S). IDH2S lacks the first 52 a.a. that are present in IDH2L. To credibly identify IDH2L, it is critical to detect the uniquely mapping peptides covering the first 52 a.a. IDH2S lacks non-nested uniquely mapping peptides when using trypsin as a protease, so it is difficult to distinguish from IDH2L (Supplementary Figure 2, Supplementary Table 9). When we used RNA-Seq expression data, IDH2L was found to be expressed in HEK293 (Read Per Kilobase of transcript per Milion mapped reads [FPKM] = 1.44) and HepG2 (FPKM = 85.62), but not in the A529 and HeLa cell lines. Similarly, IDH2S is expressed in A549 (FPKM = 19.76) and HeLa (FPKM = 7.17) but not in HEK293 and HepG2. This led us to consider that IDH2L can be identified in HEK293 and HepG2, whereas IDH2S can be identified in A549 and HeLa. The first 52 a.a. peptide that differs between the two isoforms contains mitochondrial transit sequences (from 1 to 39 a.a.) of its N-terminal presequence directing them to a mitochondrion. The abnormal expression of

IDH2 has been reported in several types of cancer,37 but the effect of N-terminal deletion of IDH2 has not yet been reported. Many studies have reported that changes in mitochondrial

14

ACS Paragon Plus Environment

Page 14 of 26

Page 15 of 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

content act as important cell signals that contribute to global regulation of gene expression and alternative splicing,38,39 so this kind of isoform change could be valuable for further study. We examined the location of 20 proteins that are known to interact with IDH2 and found that 14 are localized in the cytoplasm or nucleus, whereas 5 are localized in the mitochondrion (Supplementary Figure 2, Supplementary Table 10). Thus, the existence of transit peptides, which make the difference between IDH2L and IDH2S, could affect the localization of both IDH2 isoforms by restricting their interacting partners.

Table 3. 19 cases of differently detected isoforms from 5 different cell lines. No.

Description (Gene Name)

neXtProt Ac.

1

Cytosolic acyl coenzyme A thioester hydrolase (ACOT7)

NX_O00154-7

O

2

N(G),N(G)-dimethylarginine dimethylaminohydrolase (DDAH1)

NX_O94760-1

O

3

Glucose-6-phosphate 1-dehydrogenase (G6PD)

A549

NX_O94760-2

O

NX_P11413-3

O

NX_P16615-1

O

5

Heterogeneous nuclear ribonucleoproteins A2/B1 (HNRNPA2B1)

NX_P22626-2

O

6

DNA replication licensing factor MCM3 (MCM3)

NX_P25205-1

O

NX_P25205-2

8 9 10

NX_P16615-3

Heterogeneous nuclear ribonucleoprotein A3 (HNRNPA3) Adenylate kinase 2, mitochondrial (AK2) Caldesmon (CALD1)

12

Heterogeneous nuclear ribonucleoprotein D0 (HNRNPD)

O

O

O

O O O

NX_P46379-2 NX_P46379-5

O

NX_P48735-1

O

NX_P48735-2

O

NX_P51991-1 NX_P51991-2

O O

O O

NX_P54819-1

O

NX_P54819-2 NX_Q05682-5

O

O

O O

NX_Q05682-4

11

O

MCF7

O

NX_P22626-1

Isocitrate dehydrogenase [NADP], mitochondrial (IDH2)

HepG2

O

NX_P11413-1

Sarcoplasmic/endoplasmic reticulum calcium ATPase 2 (ATP2A2)

Large proline-rich protein BAG6 (BAG6)

HeLa

O

NX_O00154-4

4

7

Hek293

O

O O

NX_Q14103-1 NX_Q14103-3

15

ACS Paragon Plus Environment

O

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

NX_Q15637-1

13

Splicing factor 1 (SF1)

14

Probable ATP-dependent RNA helicase DDX17 (DDX17)

NX_Q92841-3

15

Translation initiation factor eIF-2B subunit gamma (EIF2B3)

NX_Q9NR50-2

16

UDP-glucose:glycoprotein glucosyltransferase 1 (UGGT1)

NX_Q9NYU2-2

17

Poly(U)-binding-splicing factor PUF60 (PUF60)

NX_Q9UHX1-2

18 19

RuvB-like 2 (RUVBL2) Cofilin-2 (CFL2)

Page 16 of 26

O

NX_Q15637-4

O

NX_Q92841-1

O O O

NX_Q9NR50-1 O O

NX_Q9NYU2-1

O O

NX_Q9UHX1-1

O O

NX_Q9Y230-1

O

NX_Q9Y230-2

O

NX_Q9Y281-1

O

NX_Q9Y281-3

O O O

CONCLUSIONS Here we demonstrate that ASV-ID, a proteogenomic analysis method, can be used to confidently identify protein isoforms by using a cell type–specific protein sequence database supported by RNA-Seq information on the total transcript expression in each cell line. From the results of proteotypic peptide analysis (via trypsin digestion) for the neXtProt human protein sequences, we calculated that 44% of proteins are canonically identifiable but others are still difficult to identify because they lack credible peptides. Thus, about half of ASVs are not easily distinguishable from their own isoforms because of the presence of highly similar sequences between them in each protein group. The RNA-Seq data covered about 25% of the known proteome, and 91% of the protein families covered by the RNA-Seq dataset show only one dominant isoform expression pattern. We could identify 1,935 distinct proteins from five cell lines under highly stringent conditions. About half of the 1,935 proteins (841 proteins) were additionally identified with proper supporting peptides that cannot usually be 16

ACS Paragon Plus Environment

Page 17 of 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

identified without transcript expression information. Using our ASV-ID containing workflow, we were able to detect 19 cases of differently expressed isoforms present in the 5 different cell lines (e.g., the IDH2 case). Note that the numbers of ASVs are varying among databases. In the case of IDH2, neXtProt (https://www.nextprot.org/entry/NX_P48735) and UniProt (https://www.uniprot.org/uniprot/P48735) show that it contains only two isoforms, whereas NCBI (https://www.ncbi.nlm.nih.gov/gene/3418) and Ensembl have three and six isoforms. And also, neXtProt contains smallest number of proteins (42,241) compared with UniProt (release 2018_07, 73,112 proteins) and Ensembl (in GRCh38 genome assembly, 79,803 protein-coding genes). Although the number of ASVs in neXtProt was smaller than other databases, we decided to use neXtProt because it has highly curated data with high quality which makes easy for us to maintain consistency for comparing these data produced from the studies of C-HPP related projects. Thus, as long as transcript expression information is provided, the ASV-ID method enables identification of protein isoforms, allowing biomedical researchers to gain better insight into the biological effects of the identified isoforms present in various tissues and cells.

ACKNOWLEDGMENTS This work was supported by grants from the Korean Ministry of Health and Welfare: [HI13C2098]-International Consortium Project and [HI16C0257] (to Y.-K. Paik).

ABBREVIATIONS

17

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

C-HPP, Chromosome-Centric Human Proteome Project; FDR, False Discovery Rate; FPKM, Fragments Per Kilobase Million; HPP, Human Proteome Project; IDH2, Isocitrate dehydrogenase; IDH2L, IDH2 Long form; IDH2S, IDH2 Short form; MS, Mass Spectrometry; PSMs, Peptide-Spectra Matches; RNA-Seq, RNA-sequencing; TPP, Transproteomic pipeline.

NOTES The authors declare no competing financial interest.

REFERENCES (1) Paik, YK.; Jeong, SK.; Omenn, GS.; Uhlen, M.; Hanash, S.; Cho, SY.; Lee, HJ.; Na, K.; Choi, EY.; Yan, F.; Zhang, F; Zhang, Y; Snyder, M.; Cheng, Y.; Chen, R.; Marko-Varga, G.; Deutsch, EW.; Kim, H.; Kwon, JY.; Aebersold, R.; Bairoch, A.; Taylor, AD.; Kim, KY.; Lee, EY.; Hochstrasser, D.; Legrain, P.; Hancock, WS. The Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome. Nat Biotechnol. 2012, 30, 221-223. (2) Paik, YK.; Omenn, GS.; Uhlen, M.; Hanash, S.; Marko-Varga, G.; Aebersold, R.; Bairoch, A.; Yamamoto, T.; Legrain, P.; Lee, HJ.; Na, K.; Jeong, SK.; He, F.; Binz, PA.; Nishimura, T.; Keown, P.; Baker, MS.; Yoo, JS.; Garin, J.; Archakov, A.; Bergeron, J.; Salekdeh, GH.; Hancock, WS. Standard guidelines for the chromosome-centric human proteome project. J

Proteome Res. 2012, 11, 2005-2013. 18

ACS Paragon Plus Environment

Page 18 of 26

Page 19 of 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(3) Hancock, W.; Omenn, GS.; Legrain, P.; Paik, YK. Proteomics, human proteome project, and chromosomes. J Proteome Res. 2011, 10, 210. (4) Deutsch, EW.; Overall, CM.; Van Eyk, JE.; Baker, MS.; Paik, YK.; Weintraub, ST.; Lane, L.; Martens, L.; Vandenbrouck, Y.; Kusebauch, U.; Hancock, WS.; Hermjakob, H.; Aebersold, R.; Moritz, RL.; Omenn, GS. Human Proteome Project Mass Spectrometry Data Interpretation Guidelines 2.1. J Proteome Res. 2016, 15, 3961-3970 (5) Aebersold, R.; Agar, JN.; Amster, IJ.; Baker, MS.; Bertozzi, CR.; Boja, ES.; Costello, CE.; Cravatt, BF.; Fenselau, C.; Garcia, BA.; Ge, Y.; Gunawardena, J.; Hendrickson, RC.; Hergenrother, PJ.; Huber, CG.; Ivanov, AR.; Jensen, ON.; Jewett, MC.; Kelleher, NL.; Kiessling, LL.; Krogan, NJ.; Larsen, MR.; Loo, JA.; Ogorzalek Loo, RR.; Lundberg, E.; MacCoss, MJ.; Mallick, P.; Mootha, VK.; Mrksich, M.; Muir, TW.; Patrie, SM.; Pesavento, JJ.; Pitteri, SJ.; Rodriguez, H.; Saghatelian, A.; Sandoval, W.; Schlüter, H.; Sechi, S.; Slavoff SA.; Smith, LM.; Snyder, MP.; Thomas, PM.; Uhlén, M.; Van Eyk, JE.; Vidal, M.; Walt, DR.; White, FM.; Williams, ER.; Wohlschlager, T.; Wysocki, VH.; Yates, NA.; Young, NL.; Zhang, B. How many human proteoforms are there?. Nat Chem Biol. 2018, 14, 206-214. (6) Klinck, R.; Bramard, A.; Inkel, L.; Dufresne-Martin, G.; Gervais-Bird, J.; Madden, R.; Paquet, ER.; Koh, C.; Venables, JP.; Prinos, P.; Jilaveanu-Pelmus, M.; Wellinger, R.; Rancourt, C.; Chabot, B. Multiple alternative splicing markers for ovarian cancer. Abou Elela

S. Cancer Res. 2008, 68, 657-663. (7) Thorsen, K.; Sørensen, KD.; Brems-Eskildsen, AS.; Modin, C.; Gaustadnes, M.; Hein, AM.; Kruhøffer, M.; Laurberg, S.; Borre, M.; Wang, K.; Brunak, S.; Krainer, AR.; Tørring, N.; Dyrskjøt, L.; Andersen, CL.; Orntoft, TF. Alternative splicing in colon, bladder, and 19

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

prostate cancer identified by exon array analysis. Mol Cell Proteomics. 2008, 7, 1214-1224. (8) Menon, R.; Zhang, Q.; Zhang, Y.; Fermin, D.; Bardeesy, N.; DePinho, RA.; Lu, C.; Hanash, SM.; Omenn, GS.; States, DJ. Identification of novel alternative splice isoforms of circulating proteins in a mouse model of human pancreatic cancer. Cancer Res. 2009, 69, 300-309. (9) Choong, WK.; Chang, HY.; Chen, CT.; Tsai, CF.; Hsu, WL.; Chen, YJ.; Sung, TY. Informatics View on the Challenges of Identifying Missing Proteins from Shotgun Proteomics. J Proteome Res. 2015, 14, 5396-5407. (10) Lane, L.; Bairoch, A.; Beavis, RC.; Deutsch, EW.; Gaudet, P.; Lundberg, E.; Omenn, GS. Metrics for the Human Proteome Project 2013-2014 and strategies for finding missing proteins. J Proteome Res. 2014, 13, 15-20. (11) Omenn, GS. The strategy, organization, and progress of the HUPO Human Proteome Project. J Proteomics. 2014, 100, 3-7. (12) Wang, X.; Codreanu, SG.; Wen, B.; Li, K.; Chambers, MC.; Liebler, DC.; Zhang, B. Detection of Proteome Diversity Resulted from Alternative Splicing is Limited by Trypsin Cleavage Specificity. Mol Cell Proteomics. 2018, 17, 422-430. (13) Wang, Y.; Chen, Y.; Zhang, Y.; Wei, W.; Li, Y.; Zhang, T.; He, F.; Gao, Y.; Xu, P. MultiProtease Strategy Identifies Three PE2 Missing Proteins in Human Testis Tissue. J Proteome

Res. 2017, 16, 4352-4363. (14) Omenn, GS.; Lane, L.; Overall, CM.; Corrales, FJ.; Schwenk, JM.; Paik, YK.; Van Eyk, JE.; Liu, S.; Snyder, M.; Baker, MS.; Deutsch, EW. Progress on Identifying and 20

ACS Paragon Plus Environment

Page 20 of 26

Page 21 of 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Characterizing the Human Proteome: 2018 Metrics from the HUPO Human Proteome Project.

J Proteome Res. 2018, DOI: 10.1021/acs.jproteome.8b00441. (15) Ezkurdia, I.; Rodriguez, JM.; Carrillo-de Santa Pau, E.; Vázquez, J.; Valencia, A.; Tress, ML. Most highly expressed protein-coding genes have a single dominant isoform. J

Proteome Res. 2015, 14, 1880-1887. (16) Li, HD.; Menon, R.; Omenn, GS.; Guan, Y. Revisiting the identification of canonical splice isoforms through integration of functional genomics and proteomics evidence.

Proteomics. 2014, 14, 2709-18. (17) Li, HD.; Omenn, GS.; Guan, Y. MIsoMine: a genome-scale high-resolution data portal of expression, function and networks at the splice isoform level in the mouse. Database

(Oxford). 2015, 2015:bav045. (18) Li, HD.; Menon, R.; Govindarajoo, B.; Panwar, B.; Zhang, Y.; Omenn, GS.; Guan, Y. Functional Networks of Highest-Connected Splice Isoforms: From The Chromosome 17 Human Proteome Project. J Proteome Res. 2015, 14,3484-91. (19) Wang, X.; Slebos, RJ.; Wang, D.; Halvey, PJ.; Tabb, DL.; Liebler, DC.; Zhang, B. Protein identification using customized protein sequence databases derived from RNA-Seq data. J Proteome Res. 2012, 11, 1009-1017. (20) Gaudet, P.; Michel, PA.; Zahn-Zabal, M.; Britan, A.; Cusin, I.; Domagalski, M.; Duek, PD.; Gateau, A.; Gleizes, A.; Hinard, V.; Rech de Laval, V.; Lin, J.; Nikitin, F.; Schaeffer, M.; Teixeira, D.; Lane, L.; Bairoch, A. The neXtProt knowledgebase on human proteins: 2017 update. Nucleic Acids Res. 2017, 45, D177-D182. 21

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(21) Jeong, SK.; Hancock, WS.; Paik, YK. GenomewidePDB 2.0: A Newly Upgraded Versatile Proteogenomic Database for the Chromosome-Centric Human Proteome Project. J

Proteome Res. 2015, 14, 3710-9. (22) Uhlén, M.; Fagerberg, L.; Hallström, BM.; Lindskog, C.; Oksvold, P.; Mardinoglu, A.; Sivertsson, Å.; Kampf, C.; Sjöstedt, E.; Asplund, A.; Olsson, I.; Edlund, K.; Lundberg, E.; Navani, S.; Szigyarto, CA.; Odeberg, J.; Djureinovic, D.; Takanen, JO.; Hober, S.; Alm, T.; Edqvist, PH.; Berling, H.; Tegel, H.; Mulder, J.; Rockberg, J.; Nilsson, P.; Schwenk, JM.; Hamsten, M.; von Feilitzen, K.; Forsberg, M.; Persson, L.; Johansson, F.; Zwahlen, M.; von Heijne, G.; Nielsen, J.; Pontén, F. Proteomics. Tissue-based map of the human proteome.

Science. 2015, 347, 1260419 (23) Geiger, T.; Wehner, A.; Schaab, C.; Cox, J.; Mann, M. Comparative proteomic analysis of eleven common cell lines reveals ubiquitous but varying expression of most proteins. Mol

Cell Proteomics. 2012, 11, M111.014050. (24) Bolger, AM.; Lohse, M.; Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014, 30, 2114-2120. (25) Trapnell, C.; Roberts, A.; Goff, L.; Pertea, G.; Kim, D.; Kelley, DR.; Pimentel, H.; Salzberg, SL; Rinn, JL.; Pachter, L. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012, 7, 562-578. (26) Kim, D.; Pertea, G.; Trapnell, C.; Pimentel, H.; Kelley, R.; Salzberg, SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions.

Genome Biol. 2013, 14, R36.

22

ACS Paragon Plus Environment

Page 22 of 26

Page 23 of 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(27) Trapnell, C.; Williams, BA.; Pertea, G.; Mortazavi, A.; Kwan, G.; van Baren, MJ.; Salzberg, SL.; Wold, BJ.; Pachter, L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat

Biotechnol. 2010, 28, 511-515. (28) Li, W.; Cowley, A.; Uludag, M.; Gur, T.; McWilliam, H.; Squizzato, S.; Park, YM.; Buso, N.; Lopez, R. The EMBL-EBI bioinformatics web and programmatic tools framework.

Nucleic Acids Res. 2015, 43(W1), W580-584. (29) Deutsch, EW.; Mendoza, L.; Shteynberg, D.; Farrah, T.; Lam, H.; Tasman, N.; Sun, Z.; Nilsson, E.; Pratt, B.; Prazen, B.; Eng, JK.; Martin, DB.; Nesvizhskii, AI.; Aebersold, R. A guided tour of the Trans-Proteomic Pipeline. Proteomics. 2010, 10, 1150-1159. (30) Eng, JK.; Jahan, TA.; Hoopmann, MR. Comet: an open-source MS/MS sequence database search tool. Proteomics. 2013, 13,22-4. (31) Nesvizhskii, A.I.; Vitek, O.; Aebersold, R. Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat. Methods, 2007, 4, 787-97 (32) Shteynberg, D.; Deutsch, EW.; Lam, H.; Eng, JK.; Sun, Z.; Tasman, N.; Mendoza, L.; Moritz, RL.; Aebersold, R.; Nesvizhskii, AI. iProphet: multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates.

Mol Cell Proteomics. 2011, 10, M111.007690. (33) Nesvizhskii, AI.; Aebersold, R. Analysis, statistical validation and dissemination of large-scale proteomics datasets generated by tandem MS. Drug Discov Today. 2004, 9, 173181. 23

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(34) Li, HD.; Omenn, GS.; Guan, Y. A proteogenomic approach to understand splice isoform functions through sequence and expression-based computational modeling. Brief Bioinform. 2016, 17, 1024-1031. (35) Roy, A.; Kucukural, A.; Zhang, Y. I-TASSER: a unified platform for automated protein structure and function prediction. Nat Protoc. 2010, 5, 725–38. (36) Zhang, C.; Freddolino, PL.; Zhang, Y. COFACTOR: improved protein function prediction by combining structure, sequence and protein-protein interaction information.

Nucleic Acids Res. 2017, 45(W1), W291-W299. (37) Lv, Q.; Xing, S.; Li, Z.; Li, J.; Gong, P.; Xu, X.; Chang, L.; Jin, X.; Gao, F.; Li, W.; Zhang, G.; Yang, J.; Zhang, X. Altered expression levels of IDH2 are involved in the development of colon cancer. Exp Ther Med. 2012, 4, 801-806. (38) Guantes, R.; Rastrojo, A.; Neves, R.; Lima, A.; Aguado, B.; Iborra, FJ. Global variability in gene expression and alternative splicing is modulated by mitochondrial content. Genome

Res. 2015, 25, 633-644 (39) Maracchioni, A.; Totaro, A.; Angelini, DF.; Di Penta, A.; Bernardi, G.; Carrì, MT.; Achsel, T. Mitochondrial damage modulates alternative splicing in neuronal cells: implications for neurodegeneration. J Neurochem. 2007, 100, 142-153

24

ACS Paragon Plus Environment

Page 24 of 26

Page 25 of 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

For Table of Contents/Graphic Abstract

25

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

SUPPORTING INFORMATION Supplementary Table 1. Proteins have any tryptic peptide with nine or more amino acids long Supplemnetary Table 2. Identified peptides present in 5 cell lines Supplementary Table 3. Identified proteins present in 5 cell lines Supplementary Table 4. Identified proteins present in A549 Cell line Supplementary Table 5. Identified proteins present in HEK293 Cell line Supplementary Table 6. Identified proteins present in HeLa Cell line Supplementary Table 7. Identified proteins present in HepG2 Cell line Supplementary Table 8. Identified proteins present in MCF7 Cell line Supplementary Table 9. All possible tryptic peptides of IDH2L and IDH2S. Supplementary Table 10. 20 known interaction partner of IDH2 and their cellular localization. Supplementary Figure 1. Schematic description of peptide classification. Supplementary Figure 2. IDH2 example.

26

ACS Paragon Plus Environment

Page 26 of 26