Proteomic Validation of Transcript Isoforms ... - ACS Publications

May 11, 2015 - Querying the TranscriptCoder-derived or Ensembl database could unambiguously identify ∼450 protein isoforms, with isoform-specific pr...
1 downloads 6 Views 2MB Size
Subscriber access provided by Yale University Library

Article

Proteomic validation of transcript isoforms, including those assembled from RNA-Seq data Aidan P Tay, Chi Nam Ignatius Pang, Natalie A. Twine, Gene HartSmith, Linda Harkness, Moustapha Kassem, and Marc R. Wilkins J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/pr5011394 • Publication Date (Web): 11 May 2015 Downloaded from http://pubs.acs.org on May 17, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Proteomic validation of transcript isoforms, including those assembled from RNA-Seq data Aidan P. Tay,1,2,‡ Chi Nam Ignatius Pang,1,2 ‡ Natalie A. Twine,1,2 Gene Hart-Smith,1,2 Linda Harkness,3 Moustapha Kassem,3 Marc R. Wilkins,1,2,* 1. Systems Biology Initiative, The University of New South Wales, Sydney, New South Wales 2052, Australia 2. School of Biotechnology and Biomolecular Sciences, The University of New South Wales, Sydney, New South Wales 2052, Australia 3. Endocrine Research Laboratory (KMEB), Department of Endocrinology and Metabolism, Odense University Hospital & University of Southern, Denmark, Odense 5230, Denmark

KEYWORDS RNA-seq, alternative splicing, splice-junction peptides, isoform-specific peptides, proteotypic peptides, mesenchymal stem cell

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ABSTRACT

Human proteome analysis now requires an understanding of protein isoforms. We recently published the PG Nexus pipeline, which facilitates high confidence validation of exons and splice junctions by integrating genomics and proteomics data. Here we comprehensively explore how RNA-seq transcriptomics data, and proteomic analysis of the same sample, can identify protein isoforms. RNA-seq data from human mesenchymal (hMSC) stem cells were analysed with our new TranscriptCoder tool to generate a database of protein isoform sequences. MS/MS data from matching hMSC samples were then matched against the TranscriptCoder-derived database, along with Ensembl and the neXtProt database. Querying the TranscriptCoder-derived or Ensembl database could unambiguously identify ~450 protein isoforms, with isoform-specific proteotypic peptides, including candidate hMSC-specific isoforms for the genes DPYSL2 and FXR1. Where isoform-specific peptides did not exist, groups of non-isoform-specific proteotypic peptides could specifically identify many isoforms. In both the above cases, isoforms will be detectable with targeted MS/MS assays. Unfortunately, our analysis also revealed that some isoforms will be difficult to identify unambiguously as they do not have peptides that are sufficiently distinguishing. We co-visualise mRNA isoforms and peptides in a genome browser to illustrate the above situations. Mass spectrometry data is available via ProteomeXchange (PXD001449).

INTRODUCTION The majority of human genes are expressed as alternatively spliced transcripts. This serves to diversify the function of protein isoforms in different cells and tissues.1, 2 One approach to better

ACS Paragon Plus Environment

Page 2 of 55

Page 3 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

understand the function of alternatively spliced transcripts and their tissue-specific regulation is through the proteogenomic integration of proteomics, genomics and other data.3, 4 For example, Li et al. (2014)5 defined mammalian canonical isoforms and inferred their functions based on expression patterns across tissues and high connectivity in biological networks. Notwithstanding this, tissue- and cell-specific protein isoforms can be difficult to detect and validate as they are often present in low abundance and/or in a minority of cells or tissues.6 An emerging challenge of the Chromosome-Centric Human Proteome Project (C-HPP) is therefore to catalogue alternatively spliced isoforms for protein coding genes in the human genome.7 Comprehensive validation of alternatively spliced transcripts, as proteins, remains a challenge. It requires the detection of isoform-specific peptides that map unambiguously to specific protein isoforms.8 The difficulty of comprehensive protein isoform validation was demonstrated by Kim et al. (2014),6 in which isoform-specific peptides for only 2,861 protein isoforms out of a total of 20,026 alternatively spliced transcripts were identified from the analysis of 30 types of normal human tissues. The small fraction of protein isoforms validated was due, at least in part, to the fact that only 15% of peptides are concomitantly isoform-specific and proteotypic and thus reliably seen in LC-MS/MS analyses.8 Proteotypic and isoform-specific peptides are nevertheless crucial, and synthetic and isotope-labelled versions of the peptides can be used for selected reaction monitoring (SRM) or equivalent targeted assays.9-12 To efficiently validate alternatively spliced transcripts and protein isoforms, software pipelines are required for co-analysis of genomics, transcriptomics and proteomics data.12-15 We recently developed the Proteomic-Genomic Nexus (PG Nexus) pipeline to enable the co-visualization and co-analysis of peptides from Mascot MS/MS ion searches with RNA-seq transcriptomics and genomics data.13 The Samifier tool, in this pipeline, maps each peptide to its exact genomic

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 55

location and converts all mapping data into a .SAM format (hence the name) for visualization in the Integrative Genomics Viewer (IGV).16 This enables co-visualization of transcript sequences, RNA-seq reads and peptides with the genome and facilitates high-confidence visual validation of exons and exon-exon junctions of spliced transcripts. The SpliceVista pipeline developed by Zhu et al. (2014)14 also visualizes the location of peptides aligned to the transcript sequence, but unlike Samifer, SpliceVista does not visualize the peptides in context of the genome. The Results Analyzer tool, part of the PG Nexus pipeline, outputs all match data as a table for analysis in relational databases or spreadsheets. The coverage and quality of human protein sequence databases are important for protein isoform discovery and validation. Databases vary in their coverage of spliced and alternatively spliced transcripts, which means that protein isoforms not included in the database cannot be identified with Mascot MS/MS searches or similar tools. Databases curated by large consortia, such as neXtProt,17 Ensembl,18 and RefSeq,19 are commonly used for MS/MS searches. The neXtProt database serves as a data integration platform for the C-HPP,20 by providing stringent curation of human protein sequences with experimental evidence of protein expression. In contrast, Ensembl contains protein isoforms predicted from transcripts that have no evidence of being translated into proteins. Due to these differences, Blakeley et al. (2010)8 suggested that the use of multiple protein sequence databases should increase the number of isoforms identified. Discovery of novel protein isoforms can also be helped using repositories, such as ECGene21, 22 and SpliceProt,23 which curate transcripts with known and predicted splice junctions. Custom protein sequence databases can be built de novo from RNA-seq data; this is especially useful when exactly the same tissue or cell type is then used for proteomics analysis.15,

24, 25

Prediction of protein isoforms from tissue-specific RNA-seq transcripts, using proteogenomics

ACS Paragon Plus Environment

Page 5 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

or machine learning approaches, can improve coverage of tissue- or cell-specific splice junctions.25 Woo et al. (2014)24 and Sheynkman et al. (2014)25 focused on the discovery of novel exon-exon junctions using RNA-seq reads that span across splice junctions. They used 3-frame translation of the junction sequences to construct protein sequence databases and used this to identify peptide evidence for novel splice junctions. From the analysis of matching transcriptomics and proteomics data from human Jurkat cell lines, Sheynkman et al. (2014)25 identified 57 novel splice junction peptide sequences, a small proportion compared to the number of known splice junctions (~67,000).6 Other studies have focused on the analysis of alternatively spliced transcript isoforms using RNA-seq analysis of specific human tissues,1, 12 cell lines26 or undifferentiated stem cells,27,

28

but few studies have focused on the validation of transcript

isoforms with proteomics data.25, 28, 29 Human mesenchymal stem cells have been extensively analysed for transcript and protein expression (e.g. 30, 31). A wide variety of alternatively spliced transcripts are present in stem cells, enabling different protein isoforms of a gene to be generated as cells differentiate into their final lineages.32 hMSCT-TERT cells are a telomerised human mesenchymal stem cell line,32 which are used in the study of osteoblast differentiation. These cells are relatively wellcharacterised33, 34 and, compared to primary cells, allow reproducible time-course differentiation experiments and system biology analyses. However, the protein isoforms present in hMSCTERT cells are largely uncharacterised. Using hMSC-TERT cells as a model, here we use our PG Nexus pipeline to explore how to best predict and validate protein isoforms arising from mRNA splice variants. To enable the analysis of transcripts assembled from RNA-seq data with the PG Nexus pipeline, we developed a new tool called TranscriptCoder. We show that the TranscriptCoder-derived database, made de

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

novo from hMSC-derived RNA-seq data, could successfully identify a large number of isoforms when queried with hMSC-derived MS/MS data. However, use of multiple protein sequence databases, specifically Ensembl and NextProt, was also useful to increase the number of isoforms identified. Our results provide a useful repository of protein isoforms, confirmed with high-confidence, for the hMSC-TERT cell line.

MATERIALS AND METHODS Cell culture of human mesenchymal stem cells The methods used to culture cells from the undifferentiated and telomerized hMSC-TERT4 have been described elsewhere.13,

34

Briefly, an immortalized bone marrow mesenchymal cell

line was grown for 14 days in media with supplements. Cells were grown using osteogenic medium to induce differentiation. See Supporting Information 1 (Method S1) for more details on the methods used to culture cells.

RNA-seq library preparation, sequencing and data processing Cells were collected at 8 time points post-osteoblast differentiation induction (0, 6, 12, 24, 72, 44, 216 and 288 hours). Details of the library preparation, sequencing and data processing of RNA-seq have been described elsewhere.13 Briefly, total RNA was isolated, purified and used as input for the Illumina TruSeq RNA Sample PrepKit (version 2). Samples were sequenced on a HiSeq 2000 with TruSeq v3 SBS reagents to generate 100 base paired-end reads. The quality of RNA-seq reads were assessed with FastQC (version 0.10.1).35 Reads were mapped to a reference human genome (UCSC hg19 build) using TopHat (version 2.04).36 Predicted alternatively spliced transcripts for each sample were assembled using Cufflinks (version 2.02), and merged

ACS Paragon Plus Environment

Page 6 of 55

Page 7 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

into a consensus GTF file using Cuffmerge.37 See Supporting Information 1 (Method S2) for more details on the RNA-seq methodology.

Protein extraction, digestion and mass spectrometry Undifferentiated cells were harvested and protein extracts were separated by 1-D Nu-PAGE Bis-Tris gel. The resulting gel was cut into sized-based slices for mass spectrometry analysis. The methods used for proteolytic digestion and the mass spectrometry have been described elsewhere.38 MS and MS/MS were performed using an LTQ Orbitrap Velos Pro (Thermo Electron, Bremen, Germany) hybrid linear ion trap and Orbitrap mass spectrometer. MS/MS was performed via collision induced dissociation (CID) and the linear ion trap was used to massanalyse fragment ions. To increase the depth of proteome coverage, each proteolytic peptide sample was subjected to 5 rounds of LC-MS/MS using MS/MS exclusion lists. The mass spectrometry data have been deposited in the ProteomeXchange39 via the PRIDE partner repository, with dataset identifier PXD001449 and DOI 10.6019/PXD001449. See Supporting Information 1 (Method S3-5) for more details on the protein extraction, digestion and mass spectrometry methodology.

Sequence database construction The corresponding GFF3 and accession files for all protein sequence databases can be downloaded via http://www.systemsbiology.org.au/downloads/database_and_input_files.zip.

Ensembl

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Non-haplotype hg19 genomic sequences, the human protein sequence FASTA file and gene GTF file (version 75) were downloaded from the FTP server of Ensembl. Transcripts that were non-protein coding, subjected to nonsense mediated decay, encoded on haplotype chromosomes, used non-standard initiation codons or had incomplete exon-intron structure annotations were excluded from our analyses. Ensembl protein sequences from the remaining transcripts in the GTF file were used for peptide searches. Custom Perl scripts were used to create an accession file and convert the filtered gene GTF file into a GFF3 file format.

neXtProt The protein information XML file and protein sequence FASTA file (September 19, 2014 release) were downloaded from the FTP server of neXtProt. Custom Perl scripts were used to extract isoform structure and neXtProt accession numbers, and format data into a GFF3 and accession file. All protein sequences in the FASTA file were used for peptide searches.

TranscriptCoder processing of RNA-seq data The TranscriptCoder (v1.00) tool was constructed in this project. It infers the correct ORFs in mRNA transcripts assembled from RNA-seq reads, and is essential in situations where an assembled transcript is incomplete and missing its initiating methionine. This is done by finding the most frequently used exons from the filtered Ensembl GTF file to infer the translation frame of the transcript. To facilitate subsequent analyses and co-visualization in IGV, TranscriptCoder produces a gene annotation file in GFF3 format and an accession file containing the Cufflink identifications for each transcript that are compatible with other PG Nexus modules. TranscriptCoder was developed in Java as a command line tool. It is available for download via

ACS Paragon Plus Environment

Page 8 of 55

Page 9 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

the Bitbucket repository https://bitbucket.org/aidantay/transcriptcoder/src and is included as Supporting Information 2. TranscriptCoder was used to generate a database of de novo transcripts from RNA-seq data, including any alternatively spliced forms. Custom Perl scripts were used to remove low-quality transcripts from the assembled RNA-seq GTF file produced from Cufflinks. These included transcripts encoded on haplotype chromosomes or where the RNA-strand for translation was unspecified. A minimal FPKM threshold was not used, as preliminary findings showed 1% of the isoforms had FPKM value of less than 10 (data not shown). These transcripts may correspond to low abundance protein isoforms that can be translated into proteins. TranscriptCoder inferred protein sequences for approximately 80% of the transcripts; these were then used for peptide searches.

Three-frame translation of RNA-seq data Filtered RNA-seq transcripts assembled from Cufflinks were translated into protein sequences in three frames with custom Java scripts and then used for peptide searches. Translation in six frames was not necessary as the translation strand for most transcripts was specified by Cufflinks. Java scripts were also used to create a gene annotation file in GFF3 format and an accession file containing identifications for the PG Nexus pipeline.

MS/MS ion searches MS/MS ion searches were performed using Mascot for all protein sequence databases. Singlenucleotide polymorphisms (SNPs) were not included in the analysis. As per all MS/MS peptide assignments, any SNPs which involve leucine to isoleucine substitutions, or vice versa, were not

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 55

resolved. Peptides with a false discovery rate (FDR) >1%, as identified using Mascot Percolator with default parameters, were removed, as were peptide that mapped to multiple genomic locations. To increase the stringency of peptide identification, we also removed peptides with Mascot score of less than 45. A protein level FDR of 1% was applied and calculated according to Wright et al. (2012).40 See Supporting Information 1 (Method S6) for more details on the MS/MS ion search methodology.

Analysis with Samifier and Results Analyzer Using the previously described gene annotation GFF3 and accession files, the Samifier and Results Analyzer were used to analyze the MS/MS ion search results produced by each protein sequence database search. The method used to analyse MS/MS ion search results with the PG Nexus pipeline has been described elsewhere.13 Briefly, the pipeline maps all peptides to their exact genomic co-ordinates, accounting for the positions of introns; this permits subsequent visualization and data analysis. The PG Nexus pipeline (v1.09) is currently available for download via https://github.com/IntersectAustralia/ap11_samifier. See Supporting Information 1 (Method S7 & Figure S1) for more details on the analysis workflow.

GO enrichment analysis and protein abundance The relative protein abundance and biological processes corresponding to all identified genes in undifferentiated hMSC were investigated. All genes in the Ensembl databases were used as the background for both analyses. The whole human integrated dataset for protein abundance was downloaded from the PaxDB database41 and plotted using R statistical software.42 DAVID tools (version 6.7) was also used for performing GO enrichment analysis.43

ACS Paragon Plus Environment

Page 11 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Identification of isoform-specific and proteotypic peptides To investigate the utility of isoform-specific and proteotypic peptides for validating hMSCspecific protein isoforms, we examined peptides from genes with 2 or more isoforms in the Ensembl and TranscriptCoder databases. Isoform-specific peptides were identified from both databases using theoretical peptide sequences generated by custom Perl scripts. Peptide sequences were based on a theoretical trypsin digest with up to 2 missed cleavages. Proteotypic peptides were identified with the PeptideSieve tool according to a 0.90 threshold and the PAGE-ESI option to indicate the use of Nu-PAGE gels and an ESI-Trap instrument.44 Mallick et al. (2007)45 has shown that using a confidence threshold of 0.90 would provide sufficient coverage of the human proteome (> 85%) with one or more proteotypic peptide.

Identification of subsets of isoform families with multiple proteotypic peptides To investigate whether groups of peptides can be used collectively to identify subsets of isoform families, we looked at a subset of proteotypic peptides from genes with 2 or more isoforms in the Ensembl database (see above). For this experiment, proteotypic peptides that were present in all isoforms of a gene were removed because they cannot unambiguously identify any isoform. In doing so, only peptides that map to one or more isoforms, but not all isoforms of an isoform family, were used. For each isoform family, groups of 2 to 5 different proteotypic peptides were mapped to groups of up to 5 different isoforms. As a further constraint, the number proteotypic peptides mapped must be equal to or greater than the number isoforms. For each isoform family, the isoform with the highest proteotypic peptide counts is most likely to be expressed.

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 55

RESULTS Open reading frame identification in assembled RNA-seq transcripts

Figure 1: Overview of TranscriptCoder algorithm. Many transcripts assembled from RNA-seq data, including splice variants, are not full length and thus lack canonical start and stop codons. Instead, they can show multiple apparent start and stop codons. TranscriptCoder uses the translation frame of a highly used exon, inferred by comparison with reference transcripts, to determine the correct open reading frame of one portion of the transcript. This is then propagated upstream and downstream, allowing the correct frame, and correct start and stop codons if present, to be confirmed.

Translating assembled transcripts from RNA-seq data in three or six frames unnecessarily increases database size, as only one sequence will encode the protein.25 Yet many transcripts assembled from RNA-seq are fragments that lack start and/or stop codons, making it a challenge to identify the correct reading frame and corresponding protein sequences.13 To address this, we developed the TranscriptCoder tool, which infers the correct ORFs in mRNA transcripts assembled from RNA-seq reads. An overview of the TranscriptCoder algorithm is shown in Figure 1. For each transcript, the tool identifies the translation frame of the most frequently used exons from a set of reference transcripts. Using the translation frame of the highly used exons, the correct ORF and, if present, the correct start and stop codons of novel transcripts can be inferred. In the event of a frame-shift, where the highly used exon has more than one translation frame, the transcript is translated in all possible frames. The TranscriptCoder only recognizes standard start codons, as non-standard initiation codons are difficult to resolve. However, only a

ACS Paragon Plus Environment

Page 13 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

handful of isoforms have non-standard initiation codons in humans.46 To ensure only high confidence isoforms are generated for MS/MS ion searches, the TranscriptCoder removes transcripts for which the protein sequence cannot be inferred. This can arise if transcripts do not contain exons in the reference, or contain exons from 5’- or 3’- untranslated regions. To run the TranscriptCoder tool, the following input files are required: 1) a reference gene file containing the exons, start and stop codons for each transcript in gene transfer format (GTF), 2) transcripts assembled from RNA-seq using Cufflinks (or similar tools) containing exons for each transcript in gene transfer format (GTF), 3) a sequenced genome sequence in FASTA format, and 4) a translation table that maps codons to amino acids and stop codons. The TranscriptCoder outputs a custom protein sequence database in FASTA format, which can be used for proteomic searches with search engines such as Mascot. The tool also outputs a gene annotation file in GFF3 format and an accession file that is compatible with the PG Nexus pipeline, enabling covisualization in IGV and downstream analysis.16 We used the TranscriptCoder tool to translate transcripts from undifferentiated hMSC-TERT4 stem cells, assembled from RNA-seq data. This resulted in the generation of a database with 77,107 protein entries, mapping back to 76,462 transcripts.

Identification of protein isoform families from four databases The proteome of undifferentiated hMSC-TERT4 stem cells was analysed by 1-D PAGE, followed by MS/MS analysis. Five rounds of exclusion list analysis were undertaken to increase depth of coverage. To understand the differences between databases, and their suitability for isoform discovery in our stem cells, MS/MS data was matched against the commonly used public databases Ensembl and neXtProt (See Supporting Information 1, Results S1 & Table S1).

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 55

These databases were assessed as high quality on the basis of their curation methods, annotation of exon-exon junctions and untranslated regions, experimental evidence and database completeness. MS/MS data was also matched against the TranscriptCoder-derived database (above), and a 3-frame translation database of the assembled RNA-seq hMSC-TERT4 transcripts. A summary of all databases used for our analyses, and a comparison of their merits, is provided in Supporting Information 1 (Results S2 & Table S2). An example of an identified protein, showing peptides and relevant RNA-seq reads is shown in Figure 2.

Figure 2: Visualization of peptides and RNA-seq reads for the ribosomal S2 protein (ENSG00000140988) in IGV. a) Gene architecture of ribosomal S2 protein transcripts. b) RNAseq reads corresponding to alternatively spliced mRNA transcripts assembled using Cufflinks. c) Peptides identified from MS/MS ion searches mapping to 4 alternatively spliced transcripts (ENST00000343262, ENST00000526522, ENST00000529806 and ENST00000530225).

The Results Analyzer, from the PG Nexus,13 was used to analyse and compare the number of isoform families identified in all databases. For this, we defined an isoform as a protein with a unique set of exons. We defined ‘identification’ on the basis of 2 or more peptide hits anywhere in an isoform’s sequence; note however that a peptide can match against more than one isoform of one protein (since many isoforms contain the same exons). We defined ‘isoform families’ as spliced or alternatively spliced isoforms from the same gene. Stringent filters were applied to include only high confidence peptides from all databases (See Supporting Information 1, Results S3 & Table S3 and S4). The Ensembl database yielded the highest number of isoform members of families, with 5,619 isoforms from 2,324 isoform families (Figure 3). It was therefore used as

ACS Paragon Plus Environment

Page 15 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

a reference for all comparisons. The TranscriptCoder-derived database yielded 52 less isoforms than the Ensembl database, with a total of 5,567. This was followed by the neXtProt database, which yielded 5,528 isoforms, and the 3-frame translation database that yielded the smallest number of isoforms (total 5,402). A similar analysis based on genes is shown in Supporting Information 1 (Figure S2).

Figure 3: Venn diagram showing the number of isoforms that were by identified by MS/MS ion searches against four different databases. The numbers in brackets, below each database name, represent the number of isoforms identified. The majority of isoforms were identified from all four databases.

A total of 5,670 protein isoform members of families were identified from all sequence databases. Interestingly, almost all of these isoforms (5,239 or 93%) were identified in all four protein sequence databases, providing unequivocal evidence for their existence. However no single database, including the TranscriptCoder or a 3-frame database generated from matching RNA-seq data, yielded any remarkable number of unique isoforms (ranging from 20 from neXtProt to 0 from the 3-frame database). By contrast, there was a difference in the degree to which each database missed isoforms present in others. Ensembl missed 51 isoforms and the TranscriptCoder database 103 whereas neXtProt and the 3-frame database missed 142 and 268 isoforms respectively. The MS/MS searches of the Ensembl database did identify peptides for all 51 isoforms missed, yet they were removed as per our filtering criteria (see Methods). Overall, these results highlight the advantages of using different approaches to identify isoform members

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 55

of families and demonstrate the value of RNA-seq data when analysed de novo for splice variants and then processed with the TranscriptCoder. The isoforms identified in hMSC-TERT4 showed a bias towards high abundance proteins (See Supporting Information 1, Figure S3). This was expected as high abundance proteins are more likely to be identified than low abundance proteins with MS/MS.41 GO enrichment analysis on identified isoforms found that transcripts were associated with the biological processes including translation, protein transport, protein catabolism, RNA processing and transport (See Supporting Information 3).43 This underscores the challenge of identifying cell-specific isoforms, which are likely to be involved in stem-cell specific functions and thus of high biological interest.

Validation of exons and splice junctions in identified protein isoform families Having established above that protein isoform families could be identified in the undifferentiated hMSC-TERT4 stem cells, we then investigated precisely where the high confidence peptides map with respect to exons and exon-exon boundaries in all isoform families. To do this, we used the Results Analyser in the PG Nexus,13 which calculates the exact genomic coordinates of peptides. A peptide hit to an exon (exonic peptide) was one where a peptide mapped to an exon but did not span across splice junctions. A peptide hit to a splice junction (junction peptides) was validated when a peptide was found to span an exon-exon junction. We analysed instances where there were 2 or more high confidence peptides per isoform. We calculated the proportion of exons and splice junctions validated per isoform, and compared their distributions among protein sequence databases, using Ensembl protein isoforms as a reference. A total of 8,014 isoform family-specific exons and 3,877 isoform family-specific splice junctions were validated in the undifferentiated hMSC-TERT4 stem cells (Figure 4). Of these,

ACS Paragon Plus Environment

Page 17 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

there were 7,026 non-redundant exons and 3,361 non-redundant splice junctions common to all databases. The Ensembl database yielded the highest number of exons and splice junctions with 7,842 and 3,761 respectively. The TranscriptCoder-derived database yielded comparable results, validating 28 less exons, but 9 more splice junctions. The neXtProt database had less coverage, with 7,600 exons and 3,632 splice junctions and the 3-frame translation database yielded 7,482 exons and 3,623 splice junctions – the least overall. Combining the results from searching several protein sequence databases was useful, as it increased the total number of exons and splice junctions validated. Specifically, a further 172 exons and 116 splice junctions were validated by matching data against the non-Ensembl databases. Frame-shifted exons occur when exons are incorporated into transcripts in more than one translation frame. This can result in transcripts losing protein-coding potential. From the isoforms identified in Ensembl, we found 33 frame-shifted exons among alternatively spliced transcripts of 30 genes (See Supporting Information 4). For all 33 frame-shifted exons, as expected, only one translation frame was validated with peptides. Interestingly, when crosscompared against the current Ensembl database (version 79), we found that 14 isoforms from 14 genes with frame-shifted exons spliced in the non-validated frame had been declared deprecated. Together, these results highlight the capacity of proteogenomics to ‘red flag’ isoforms in Ensembl that may require further revision. Through the analysis of peptides that span across alternatively spliced sites, we validated 19 alternative exon combinations for 9 genes (Supporting Information 4). Interestingly, 6 of these genes were associated with the function actin and myosin, enabling us to identify the isoforms of actin and myosin present in hMSC-TERT4 cells. The other 3 genes were involved in translation, protein transport and the regulation of apoptosis. Overall, spliced and alternatively spliced

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 55

junctions were more difficult to validate than exons (See Supporting Information 1, Figure S4). The average proportion of spliced and alternatively spliced junctions validated per gene was 17%, compared to 30% of exons validated. Our data suggests that the relative scarcity of junction peptides will be a limiting factor for the validation of protein isoforms.

Figure 4: The number of a) exons and b) splice junctions from isoforms that were validated with 2 or more peptides. Venn diagrams showing the results of MS/MS ion searches against four different databases. The numbers in bracket represent the total exons or splice junctions validated for each database.

Protein isoforms can be identified with isoform-specific proteotypic peptides In the two prior sections, we predominantly worked at the level of isoform families. Individual isoforms can be unambiguously identified in some cases, but this is much more challenging. It requires the existence of peptides that are ‘isoform-specific’ in that they can distinguish between alternatively spliced transcripts of the same gene.10 Ideally, these peptides should also be ‘proteotypic’, seen repeatedly in replicate mass spectrometry analyses of a single protein, to facilitate the identification of the protein isoforms.45 Using the Results Analyser in the PG Nexus,13 we aimed to identify isoform-specific proteotypic peptides in the hMSC-TERT4 protein set. Proteotypic peptides in the MS/MS database matches were first identified using the PeptideSieve software.45 Isoform-specific peptides were next found by examination of theoretical peptide sequences in two databases; Ensembl and TranscriptCoder were used as they performed best in the above analyses. The differences in exon usage in Ensembl and TranscriptCoder-derived sequences dictated that the

ACS Paragon Plus Environment

Page 19 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

isoform-specific peptides had to be determined separately for each database. Isoform-specific proteotypic peptides in the hMSC-TERT4 cells were finally identified by comparing the peptides in the proteotypic and isoform-specific peptide lists. The resulting peptides are labelled in Supporting Information 5.

Figure 5: Venn diagrams showing the number of isoforms that could be identified with isoformspecific proteotypic peptides in the MS/MS data. Isoforms were identified by MS/MS ion searches against the Ensembl and TranscriptCoder databases. a) The number of isoforms present in both databases. b) The number of isoforms present only in the Ensembl or TranscriptCoder database, but not in both.

Using isoform-specific proteotypic peptides, we identified 445 isoforms in the Ensembl database and 420 in the TranscriptCoder-derived database (Figure 5a). Of these, only 151 isoforms (21%) were present in both databases, due to the presence of different sequences, exons and splice events in different databases. An example of this is given in Figure 6, which shows three genes with isoform-specific peptides that are database-dependent. Peptides that were isoform-specific in the Ensembl database were not necessarily isoform-specific in the TranscriptCoder database, and vice versa. So when using isoform-specific proteotypic peptides, it must be kept in mind that they must be checked in all relevant databases before they can be reliably used to specifically validate any alternatively spliced transcript.

Figure 6: The specificity of an isoform-specific peptide can be database-dependent. To facilitate visualisation, peptides have been coloured in red. For all screenshots: i) Ensembl gene

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 55

architecture, ii) peptides found with MS/MS ion searches against the Ensembl database, iii) TranscriptCoder gene architecture and iv) peptides found with MS/MS ion searches against the TranscriptCoder database. a) A peptide (RAGELTEDEVER) in ribosomal protein S18 (ENSG00000231500) is isoform-specific in the Ensembl database but not in the RNA-seq TranscriptCoder

database.

b)

A

peptide

(FSNQETCVEIGESVR)

in

phosphoribosyl

pyrophosphate synthetase 1 (ENSG00000147224) is isoform-specific in the RNA-seq TranscriptCoder

database

but

not

in

the

Ensembl

database.

c)

A

peptide

(AAGVNVEPFWPGLFAK) in ribosomal protein, large, P1 (ENSG00000137818) is isoformspecific in both Ensembl and RNA-Seq TranscriptCoder databases. Using the isoform-specific proteotypic peptides, we found evidence for an additional 22 isoforms that were present in only 1 database (Figure 5b). This includes 12 isoforms in the Ensembl database and 10 isoforms in the TranscriptCoder-derived database. Since hMSCspecific isoforms are likely to be present in RNA-seq data, we examined the 10 isoforms in the TranscriptCoder-derived database. Interestingly, of the 10 isoforms, we identified 2 candidate hMSC-specific isoforms with isoform-specific peptides mapping to RNA-seq specific exons. These genes are the fragile X mental retardation syndrome-related protein 1 (FXR1) and the dihydropyrimidinase-related protein 2 (DPYSL2). For the FXR1 protein isoform, we identified a peptide of 16 amino acids that corresponds to the 3’ end of the intron between the 13th and 14th exons, and part of the 14th exon (Figure 7a). This suggests the intronic region could be protein coding or is a misassigned exon, in contrast to the gene model in Ensembl. For the DPYSL2 protein isoform, we identified a peptide in the 5’-end of the transcript (Figure 7b). This is in contrast to Ensembl which marks the same region as a 5’-UTR. Therefore, we have evidence that the previously annotated 5’-UTR region is in fact a translated exon consisting of 118 amino

ACS Paragon Plus Environment

Page 21 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

acids. The isoforms we have found vary in sequence compared to Ensembl, and therefore could be added into the database as novel hMSC-specific isoforms, or used to update incorrectly annotated isoforms. The protein sequence for the hMSC-specific isoforms of DPYSL2 and FXR1 are included in Supporting Information 6.

Figure 7: Visualisation of novel hMSC-specific isoforms identified in the TranscriptCoderderived database with isoform-specific peptides in IGV. Other isoforms in the TranscriptCoderderived database that were not validated with isoform-specific peptides were removed to facilitate visualization. a) A novel isoform for the fragile X mental retardation syndrome-related protein 1 (ENSG00000114416) was identified with an isoform-specific peptide (highlighted in the red box) across intron between the 13th and 14th exon and part of the 14th exon. The peptide sequence for the corresponding peptide is GYATDESTVSSVQGSR. b) One novel isoform for the dihydropyrimidinase-like 2 protein (ENSG00000092964) was identified with an isoformspecific peptide (highlighted in the red box) in the 5’-end of the transcript. The peptide sequence for the corresponding peptide is TIDFDSLSVGR.

Using groups of proteotypic peptides to identify subsets of isoform families Many protein isoforms do not have peptides that allow for their unambiguous identification from databases. Indeed, a theoretical analysis showed that 25% (9,992/39,637) of alternatively spliced transcripts in the Ensembl database do not have one or more isoform-specific tryptic peptides (with up to two missed tryptic cleavages; data not shown). Whilst not perfect, it remains very useful to narrow the identification of a protein down to the smallest possible number of its isoforms. Accordingly, we investigated whether groups of our experimentally-determined proteotypic peptides from the undifferentiated hMSC-TERT4 cells could be used to identify

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 55

proteins as members of a subset of an isoform family. As stated above, an isoform family contains spliced or alternatively spliced isoforms from the same gene. An example of how groups of experimentally-derived proteotypic peptides can be used to narrow down the identity of a protein to a subset of an isoform family is shown in Figure 8. Three proteotypic peptides mapped to 4 transcript isoforms of the pyruvate kinase, muscle protein. Each individual peptide could not uniquely identify an isoform (and were not therefore, by definition, ‘isoform-specific’). The presence of peptides A and C would be evidence for the expression of at least 1 isoform, with the most likely isoform being ENST00000319622 as it carries both peptides. However, it is also possible that peptides A and C are not from isoform ENST00000319622 but are seen due to the presence of ENST00000568883 and ENST00000335181. Similarly, identification of peptides B and C would most likely be evidence for the expression of isoform ENST00000335181, however these peptides could also be seen due to the presence of ENST00000319622 and ENST00000449901.

Figure 8: Co-visualization of the gene architecture of pyruvate kinase, muscle protein (ENSG00000067225) with peptides in IGV. Only experimentally-derived proteotypic peptides are shown in the visualization. Peptides A, B and C are shown to map to more than 1 isoform and therefore cannot be used alone to uniquely identify an isoform. However, the peptides can be used in combination to identify a protein as a member of a subset of an isoform family. For more details, see text.

Following the above, we developed a method to systematically find combinations of two or more proteotypic peptides that could identify subsets of isoform families. We analyzed groups of

ACS Paragon Plus Environment

Page 23 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

up to 5 different proteotypic peptides, and counted the number of isoforms that these could identify. We excluded subsets of families where the number of transcripts was greater than the number of proteotypic peptides, as these were not sufficiently discriminating. The use of more than 5 proteotypic peptides was not undertaken to avoid combinatorial explosion. In the hMSCTERT4 data, a total of 806 alternatively spliced transcripts from 386 isoform families could be identified using combinations of up to 5 proteotypic peptides (Figure 9). There were 82 cases where one protein isoform had the highest number of proteotypic peptide hits and was the likely isoform identity. Interestingly, 62 of these cases did not have isoform-specific peptides, illustrating that the use of combinations of peptides can help accurately identify isoforms. There were 162 cases for which two or more protein isoforms in an isoform family shared an equal number of proteotypic peptides; a unique isoform identity could therefore not be assigned. Finally, there were 142 cases where two or more proteotypic peptides mapped to one isoform only (where these peptides were thus isoform-specific), allowing unambiguous assignment of an isoform. Overall, our analysis showed that groups of carefully selected peptides can be defined to help identify protein isoforms using targeted MS/MS approaches, such as inclusion lists or SRMs. A list of the Ensembl isoforms identified with multiple proteotypic peptides is included in Supporting Information 7.

Figure 9: Groups of up to 5 proteotypic peptides were used to identify specific isoforms of alternatively spliced genes in hMSC-TERT4-derived proteins. This identified 386 protein isoform families with 2 or more members. Eighty-two isoform families had one protein isoform with the highest number of proteotypic peptide hits, thus defining the isoform most likely to be present. The exact isoforms present could not be resolved for 162 isoform families, since their

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 55

transcript isoforms shared too many peptides. A total of 142 isoform families had proteotypic peptides that mapped to a unique isoform; these cases had two or more isoform-specific peptides.

DISCUSSION In this study, using proteogenomic analysis and our PG Nexus pipeline, we generated unequivocal peptide evidence for protein isoforms, exons and splice junctions in human MSCTERT4 cells. To assist in this, we developed the TranscriptCoder tool, which predicts ORFs in transcripts assembled from RNA-seq reads. The use of the TranscriptCoder-derived database gave comparable results to use of the Ensembl database in terms of the number of protein isoforms identified with peptide evidence. We finally introduced the concept of using combinations of proteotypic peptides to provide high-confidence identification of isoforms, and showed that this could specifically identify isoforms in cases where there were no isoformspecific peptides.

Protein isoforms in the hMSC proteome The relatively small number of cell-specific isoforms identified in this study raised questions on how many alternatively spliced transcript isoforms there are per gene.8 We identified close to 1 isoform per gene in hMSC-TERT4 cells, by analysing protein matches from the Ensembl database that had isoform-specific proteotypic peptide(s), in that 457 isoforms from 455 alternatively spliced genes from hMSCs were confirmed. In comparison, Kim et al. (2014)6 identified 1.17 isoforms per gene among 30 types of tissues, with 2,861 protein isoforms from 2,450 alternatively spliced genes identified using isoform-specific peptides. The above suggests that there is roughly one, easily detected, protein product per alternatively spliced gene in the

ACS Paragon Plus Environment

Page 25 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

human proteome. Our results are consistent with large-scale transcriptomics1, 47 and proteomics48 analyses that found most genes express one canonical isoform, while minor isoforms are likely to be alternatively spliced, tissue- or cell- specific and/or low in abundance.49 From the 10 hMSC-specific protein isoforms identified from the TranscriptCoder-derived database, we identified peptides that may correspond to hMSC-specific exons of DPYSL2 and FXR1. The DPYSL2/CRMP2 gene is known to be involved in the regulation of cytoskeletal structure,50 neuronal development51 and endocytosis.52 The alternatively spliced exon at the Nterminus of our hMSC-specific DPYSL2 isoform has been identified previously in a colon carcinoma cell line.53 This N-terminal exon binds to the active form of Rho-associated protein kinases (ROCK I and II) and thereby inhibits ROCK regulated carcinoma cell migration and invasion.53 The C-terminal segment of the DPYSL2 isoform contains several glycogen synthase kinase 3 (GSK3) phosphorylation sites, and phosphorylation mimetic mutants of these sites reduces ROCK II-CRMP-2 interaction.53 Consistent with the function of a mesenchymal stem cell, FXR1 is a mRNA-binding protein involved in embryonic development.54 We identified isoform-specific peptides for a possible hMSC-specific FXR1 protein isoform, which also contains a putative GSK3 phosphorylation motif within the hMSC-specific exon.55 Interestingly, the inhibition of GSK3 kinase activity has been shown to increase osteoblastic bone formation in hMCS.56 The above evidence suggests alternative splicing and phosphorylation by GSK3 may play a role in regulating the function of hMSC-specific DPYSL2 and FXR1 isoforms and in hMSC differentiation.

Analysis with the TranscriptCoder tool

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 55

This paper presented a new tool, the TranscriptCoder. This tool infers the ORFs of transcripts assembled from RNA-seq data, by reference to the most frequently used exons in the transcripts of those genes. The most highly used exon is likely to be the one that is most evolutionarily conserved;57 this has a high chance of conserving the translation frame for all other isoforms, including the dominant isoform of the gene.48 The tool facilitates protein isoforms to be covalidated with RNA-seq and proteomic data from the same sample, serving as two parallel lines of evidence for confident isoform validation.25 Our results showed that the TranscriptCoderderived database performed as well as the Ensembl database, in that similar numbers of protein isoforms, exonic and junction peptides could be validated using either database. This is impressive as it shows that a de novo approach, based on RNA-seq, can rival that of a heavily curated resource that took many man-years to generate. Using TranscriptCoder, and proteomic data, we were able to validate alternatively spliced transcripts and the correct translation frame for exons found in ambiguous frame-shifted transcripts. A major application of proteogenomics is the validation of spliced and alternatively transcripts for transcriptomes from species other than human. The TranscriptCoder will be of great use in such projects as it can be used to analyse transcriptome data for any species. For this, however, it requires a draft or completed genome sequence and a file that contains exon annotations and their genomic locations. For validating draft transcriptomes of organisms with largely unannotated genomes, searching LC-MS/MS mass spectra against databases derived from 3frame translation of RNA-seq transcript remains a useful, alternative approach.58 Despite this, 3frame translations can lead to false positive matches, since only one translation frame codes for protein in the majority of transcripts59 and not all transcript sequences from RNA-seq are coding RNA.60 To resolve ambiguities, ribosome profiling can be used to identify coding mRNA

ACS Paragon Plus Environment

Page 27 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

sequences and their translation start sites and aid the validation of alternative translation start sites, alternatively spliced transcripts and small ORFs.61-63 Current large-scale studies of the human proteome6, 64 still lag behind transcriptomic studies (e.g. the ENCODE project) with respect to the number of spliced and alternatively spliced isoforms identified.1 Rather than aggregating data on the level of genes, human proteomics experiments will benefit from leveraging transcriptomics resources to discover tissue- or cellspecific protein isoforms and their functions.65-67 Djebali et al. (2012) suggested that the majority of novel transcripts actually lack protein-coding potential.1 This raises questions as to whether all RNA-seq derived isoforms are reliable. Transcripts assembled from short RNA-seq reads can lead to false-positives.68 High read-depth analysis of the transcriptome using long-read technology will therefore help determine whether distant splice sites are present in the same or different transcripts.69 In the same way, analysis of intact proteins using top-down mass spectrometry will also be useful for unambiguous identification of protein isoforms derived from alternatively spliced transcripts.70

More protein isoforms can be identified using multiple protein sequence databases The quality of the reference protein sequence database used in proteomic searches has a major impact on the number and types of proteins that can be identified.25 The presence or absence of relevant SNPs can also affect isoform identification, especially those that are at or proximal to splice sites. Blakeley et al. (2010) suggested that combining the results from many database searches would increase the total number of isoforms identified.8 This was supported by our comparative analyses. In our study, the Ensembl database validated the most isoforms, exons, and splice junctions overall. Although Ensembl does not provide experimental evidence to

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 55

support its entries, it is a comprehensive database with well annotated transcripts. In the absence of any matching RNA-seq data, we recommend Ensembl as a first choice for identifying spliced isoforms or validating exon-exon junctions with human proteomic datasets. While neXtProt lacks the same degree of protein isoform diversity as Ensembl, neXtProt will expand as more isoforms are confirmed as part of the human proteome project. In-depth annotation from neXtProt such as localisation, expression and SNPs may also be useful for identifying tissuespecific isoforms.

Multiple proteotypic peptides can identify protein isoforms An important issue for the analysis of the human proteome is how to accurately identify transcript isoforms of alternatively spliced genes. Identification can be trivial if all proteotypic peptide hits map to a single isoform (i.e. where proteotypic peptides are isoform-specific). In cases where there are many shared exons among a gene’s alternatively spliced transcripts, we showed that groups of two or more proteotypic peptides could identify the protein isoform that is most likely to be expressed. However, it must be kept in mind that all analyses of isoformspecific peptides are database-specific in that the family of isoform sequences for each protein can vary from database to database, and affect the uniqueness of a peptide. Some protein isoforms will be difficult to identify due to the lack of isoform-specific peptides from tryptic digests. Alternative proteases may generate isoform-specific peptides where tryptic digests cannot and increase the identification of isoforms.71,

72

Increased coverage of protein

isoforms that are low abundance or specific to types of cell or tissue in LC-MS/MS analyses will also be important for comprehensively validating spliced isoforms.6, 25 An important application of the identification of alternatively spliced transcripts will be the identification of protein

ACS Paragon Plus Environment

Page 29 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

isoforms unique to different types of cancers.73 Using transcriptomics and proteomics data from healthy and malignant cells from the same cancer patient, cancer-specific protein isoforms should be of value as biomarkers for stratifying cancer sub-types, tracking disease progression and developing personalized treatments.22, 74 In conclusion, this study comprehensively explored the proteomic validation of transcript isoforms. Using a new TranscriptCoder tool, and a combination of RNA-seq and proteomic data, we identified candidate hMSC-specific protein isoforms for the genes DPYSL2 and FXR1. We showed that Ensembl is an excellent resource for protein isoform identification, and recommend its use for human proteome analysis. We highlighted the difficulty of unambiguous isoform identification, since not all protein isoforms have unique peptides, and showed that some isoforms may not be easily resolved. Approaches such as those described here will ultimately allow protein isoforms to be identified on a large scale, from many different tissues, and aid the design of targeted assays to find development- or disease-associated isoforms.

ASSOCIATED CONTENT Supporting Information Available: Supporting Information 1: Supplementary methods, results, tables and figures. Method S1 details cell culture of human mesenchymal stem cells. Method S2 details RNA-seq library preparation, sequencing and data processing, Method S3 details protein extraction methodology. Method S4 details proteolytic digestions methodology. Method S5 details mass spectrometry methodology. Method S6 details MS/MS ion searches. Method S7 details the workflow with Results Analyser. Results S1 details a comparative analysis of commonly used databases for

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 55

peptide searches. Results S2 details a qualitative analysis of each sequence database. Results S3 details all peptide filtering steps used for assessing peptide identification quality. Table S1 compares commonly used databases for peptide searches. Table S2 details the number of genes and proteins for each database. Table S3 details the number of peptides found in each database after each filtering step. Table S4 details peptide statistics for each database after filtering. Figure S1 shows an overview of rules used to identify protein isoforms, exons and splice junctions. Figure S2 shows a Venn diagram with the number of Ensembl genes that were identified for each database. Figure S3 shows the protein abundance of identified proteins compared with all proteins in the Ensembl database. Figure S4 shows the number of identified Ensembl isoforms compared to the proportion of exons and splice junctions identified for each database. Supporting Information 2: An archived copy of the TranscriptCoder tool (v1.0), sample input and output files as a zip compressed file. Supporting Information 3: The list of GO annotations mapped to isoform identified in Ensembl. Supporting Information 4: The list of peptides from the Ensembl database that map to frame-shifted exons or span across alternatively spliced junctions. Supporting Information 5: The list of peptides identified after all filtering steps for each database. Proteotypic and isoform-specific peptides are also indicated for the Ensembl and TranscriptCoder derived databases. Supporting Information 6: Protein sequences of candidate hMSC-specific isoforms, FXR1 and DPYSL2 in FASTA format. Supporting Information 7: The list of Ensembl isoforms identified with multiple proteotypic peptides. The total number of isoforms and total number of proteotypic peptides for each gene are indicated. This material is available free of charge via the Internet at http://pubs.acs.org.

AUTHOR INFORMATION

ACS Paragon Plus Environment

Page 31 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Corresponding Author Telephone: (+61 2) 9385 3633, Fax: (+61 2) 9385 1483, Email: [email protected]

Author Contributions The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. ‡These authors contributed equally.

Funding Sources MK acknowledges funding from the Odense University Hospital, Denmark and King Abdullah City for Science and Technology (KACST), KSA (10-BIO1308-02). NAT acknowledges funding from the University of New South Wales (UNSW) IPRS scheme. MRW and MK acknowledge funding from the UNSW Visiting Fellow Scheme. MRW acknowledges funding from the Australian Government EIF Super Science Scheme, the New South Wales State Government Science Leveraging Fund scheme, and the University of New South Wales. GH-S and MRW thank the Australian Research Council (ARC) for their financial support. GH-S also acknowledges funding from the UNSW Early Career Researcher Grants Program.

Notes The author(s) declare that they have no competing interests.

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 55

ACKNOWLEDGEMENTS We thank the WEHI Systems Biology Mascot Server for providing access to the Mascot server and Simon Michnowicz for technical support. We thank Schemek Pochopien form Intersect for help in maintaining the PG Nexus pipeline and Apurv Goel for providing feedback on the pipeline. We also thank Dr. Ling Zhong, Ms. Sydney Liu Lau and A/Prof. Mark Raftery for their maintenance of the orbitrap mass spectrometers housed at the UNSW Bioanalytical Mass Spectrometry Facility.

ABBREVIATIONS C-HPP: Chromosome-Centric Human Proteome Project; CID: collision induced dissociation; ENCODE: The Encyclopedia of DNA Elements; ESI-TRAP: electrospray ion trap; FDR: false discovery rate; GTF: Gene Transfer Format; hMSC: mesenchymal stem cell; IGV: Integrative Genomics Viewer; ORF: open reading frame; PG Nexus: Proteomic-Genomic Nexus; SRM: selected reaction monitoring; SQL: Structured Query Language

REFERENCES 1.

Djebali, S.; Davis, C. A.; Merkel, A.; Dobin, A.; Lassmann, T.; Mortazavi, A.; Tanzer,

A.; Lagarde, J.; Lin, W.; Schlesinger, F.; Xue, C.; Marinov, G. K.; Khatun, J.; Williams, B. A.; Zaleski, C.; Rozowsky, J.; Roder, M.; Kokocinski, F.; Abdelhamid, R. F.; Alioto, T.; Antoshechkin, I.; Baer, M. T.; Bar, N. S.; Batut, P.; Bell, K.; Bell, I.; Chakrabortty, S.; Chen, X.; Chrast, J.; Curado, J.; Derrien, T.; Drenkow, J.; Dumais, E.; Dumais, J.; Duttagupta, R.; Falconnet, E.; Fastuca, M.; Fejes-Toth, K.; Ferreira, P.; Foissac, S.; Fullwood, M. J.; Gao, H.;

ACS Paragon Plus Environment

Page 33 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Gonzalez, D.; Gordon, A.; Gunawardena, H.; Howald, C.; Jha, S.; Johnson, R.; Kapranov, P.; King, B.; Kingswood, C.; Luo, O. J.; Park, E.; Persaud, K.; Preall, J. B.; Ribeca, P.; Risk, B.; Robyr, D.; Sammeth, M.; Schaffer, L.; See, L. H.; Shahab, A.; Skancke, J.; Suzuki, A. M.; Takahashi, H.; Tilgner, H.; Trout, D.; Walters, N.; Wang, H.; Wrobel, J.; Yu, Y.; Ruan, X.; Hayashizaki, Y.; Harrow, J.; Gerstein, M.; Hubbard, T.; Reymond, A.; Antonarakis, S. E.; Hannon, G.; Giddings, M. C.; Ruan, Y.; Wold, B.; Carninci, P.; Guigo, R.; Gingeras, T. R., Landscape of transcription in human cells. Nature 2012, 489, (7414), 101-8. 2.

Wang, E. T.; Sandberg, R.; Luo, S.; Khrebtukova, I.; Zhang, L.; Mayr, C.; Kingsmore, S.

F.; Schroth, G. P.; Burge, C. B., Alternative isoform regulation in human tissue transcriptomes. Nature 2008, 456, (7221), 470-6. 3.

Eksi, R.; Li, H. D.; Menon, R.; Wen, Y.; Omenn, G. S.; Kretzler, M.; Guan, Y.,

Systematically differentiating functions for alternatively spliced isoforms through integrating RNA-seq data. PLoS Comput Biol 2013, 9, (11), e1003314. 4.

Li, H. D.; Menon, R.; Omenn, G. S.; Guan, Y., The emerging era of genomic data

integration for analyzing splice isoform function. Trends Genet 2014, 30, (8), 340-7. 5.

Li, H. D.; Menon, R.; Omenn, G. S.; Guan, Y., Revisiting the identification of canonical

splice isoforms through integration of functional genomics and proteomics evidence. Proteomics 2014, 14, (23-24), 2709-18. 6.

Kim, M. S.; Pinto, S. M.; Getnet, D.; Nirujogi, R. S.; Manda, S. S.; Chaerkady, R.;

Madugundu, A. K.; Kelkar, D. S.; Isserlin, R.; Jain, S.; Thomas, J. K.; Muthusamy, B.; LealRojas, P.; Kumar, P.; Sahasrabuddhe, N. A.; Balakrishnan, L.; Advani, J.; George, B.; Renuse, S.; Selvan, L. D.; Patil, A. H.; Nanjappa, V.; Radhakrishnan, A.; Prasad, S.; Subbannayya, T.;

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 55

Raju, R.; Kumar, M.; Sreenivasamurthy, S. K.; Marimuthu, A.; Sathe, G. J.; Chavan, S.; Datta, K. K.; Subbannayya, Y.; Sahu, A.; Yelamanchi, S. D.; Jayaram, S.; Rajagopalan, P.; Sharma, J.; Murthy, K. R.; Syed, N.; Goel, R.; Khan, A. A.; Ahmad, S.; Dey, G.; Mudgal, K.; Chatterjee, A.; Huang, T. C.; Zhong, J.; Wu, X.; Shaw, P. G.; Freed, D.; Zahari, M. S.; Mukherjee, K. K.; Shankar, S.; Mahadevan, A.; Lam, H.; Mitchell, C. J.; Shankar, S. K.; Satishchandra, P.; Schroeder, J. T.; Sirdeshmukh, R.; Maitra, A.; Leach, S. D.; Drake, C. G.; Halushka, M. K.; Prasad, T. S.; Hruban, R. H.; Kerr, C. L.; Bader, G. D.; Iacobuzio-Donahue, C. A.; Gowda, H.; Pandey, A., A draft map of the human proteome. Nature 2014, 509, (7502), 575-81. 7.

Marko-Varga, G.; Omenn, G. S.; Paik, Y. K.; Hancock, W. S., A first step toward

completion of a genome-wide characterization of the human proteome. J Proteome Res 2013, 12, (1), 1-5. 8.

Blakeley, P.; Siepen, J. A.; Lawless, C.; Hubbard, S. J., Investigating protein isoforms via

proteomics: a feasibility study. Proteomics 2010, 10, (6), 1127-40. 9.

Picotti, P.; Lam, H.; Campbell, D.; Deutsch, E. W.; Mirzaei, H.; Ranish, J.; Domon, B.;

Aebersold, R., A database of mass spectrometric assays for the yeast proteome. Nat Methods 2008, 5, (11), 913-4. 10. Stastna, M.; Van Eyk, J. E., Analysis of protein isoforms: can we do it better? Proteomics 2012, 12, (19-20), 2937-48. 11. Krieger, J. R.; Taylor, P.; Gajadhar, A. S.; Guha, A.; Moran, M. F.; McGlade, C. J., Identification and selected reaction monitoring (SRM) quantification of endocytosis factors associated with Numb. Mol Cell Proteomics 2013, 12, (2), 499-514.

ACS Paragon Plus Environment

Page 35 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

12. Shargunov, A. V.; Krasnov, G. S.; Ponomarenko, E. A.; Lisitsa, A. V.; Shurdov, M. A.; Zverev, V. V.; Archakov, A. I.; Blinov, V. M., Tissue-specific alternative splicing analysis reveals the diversity of chromosome 18 transcriptome. J Proteome Res 2014, 13, (1), 173-82. 13. Pang, C. N.; Tay, A. P.; Aya, C.; Twine, N. A.; Harkness, L.; Hart-Smith, G.; Chia, S. Z.; Chen, Z.; Deshpande, N. P.; Kaakoush, N. O.; Mitchell, H. M.; Kassem, M.; Wilkins, M. R., Tools to covisualize and coanalyze proteomic data with genomes and transcriptomes: validation of genes and alternative mRNA splicing. J Proteome Res 2014, 13, (1), 84-98. 14. Zhu, Y.; Hultin-Rosenberg, L.; Forshed, J.; Branca, R. M.; Orre, L. M.; Lehtio, J., SpliceVista, a tool for splice variant identification and visualization in shotgun proteomics data. Mol Cell Proteomics 2014, 13, (6), 1552-62. 15. Evans, V. C.; Barker, G.; Heesom, K. J.; Fan, J.; Bessant, C.; Matthews, D. A., De novo derivation of proteomes from transcriptomes for transcript and protein identification. Nat Methods 2012, 9, (12), 1207-11. 16. Thorvaldsdottir, H.; Robinson, J. T.; Mesirov, J. P., Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Briefings in bioinformatics 2012. 17. Gaudet, P.; Argoud-Puy, G.; Cusin, I.; Duek, P.; Evalet, O.; Gateau, A.; Gleizes, A.; Pereira, M.; Zahn-Zabal, M.; Zwahlen, C.; Bairoch, A.; Lane, L., neXtProt: organizing protein knowledge in the context of human proteome projects. J Proteome Res 2013, 12, (1), 293-8. 18. Flicek, P.; Amode, M. R.; Barrell, D.; Beal, K.; Billis, K.; Brent, S.; Carvalho-Silva, D.; Clapham, P.; Coates, G.; Fitzgerald, S.; Gil, L.; Giron, C. G.; Gordon, L.; Hourlier, T.; Hunt, S.; Johnson, N.; Juettemann, T.; Kahari, A. K.; Keenan, S.; Kulesha, E.; Martin, F. J.; Maurel, T.;

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 36 of 55

McLaren, W. M.; Murphy, D. N.; Nag, R.; Overduin, B.; Pignatelli, M.; Pritchard, B.; Pritchard, E.; Riat, H. S.; Ruffier, M.; Sheppard, D.; Taylor, K.; Thormann, A.; Trevanion, S. J.; Vullo, A.; Wilder, S. P.; Wilson, M.; Zadissa, A.; Aken, B. L.; Birney, E.; Cunningham, F.; Harrow, J.; Herrero, J.; Hubbard, T. J.; Kinsella, R.; Muffato, M.; Parker, A.; Spudich, G.; Yates, A.; Zerbino, D. R.; Searle, S. M., Ensembl 2014. Nucleic Acids Res 2014, 42, (Database issue), D749-55. 19. Pruitt, K. D.; Brown, G. R.; Hiatt, S. M.; Thibaud-Nissen, F.; Astashyn, A.; Ermolaeva, O.; Farrell, C. M.; Hart, J.; Landrum, M. J.; McGarvey, K. M.; Murphy, M. R.; O'Leary, N. A.; Pujar, S.; Rajput, B.; Rangwala, S. H.; Riddick, L. D.; Shkeda, A.; Sun, H.; Tamez, P.; Tully, R. E.; Wallin, C.; Webb, D.; Weber, J.; Wu, W.; DiCuccio, M.; Kitts, P.; Maglott, D. R.; Murphy, T. D.; Ostell, J. M., RefSeq: an update on mammalian reference sequences. Nucleic Acids Res 2014, 42, (Database issue), D756-63. 20. Lane, L.; Bairoch, A.; Beavis, R. C.; Deutsch, E. W.; Gaudet, P.; Lundberg, E.; Omenn, G. S., Metrics for the Human Proteome Project 2013-2014 and strategies for finding missing proteins. J Proteome Res 2014, 13, (1), 15-20. 21. Lee, Y.; Lee, Y.; Kim, B.; Shin, Y.; Nam, S.; Kim, P.; Kim, N.; Chung, W. H.; Kim, J.; Lee, S., ECgene: an alternative splicing database update. Nucleic Acids Res 2007, 35, (Database issue), D99-103. 22. Menon, R.; Im, H.; Zhang, E. Y.; Wu, S. L.; Chen, R.; Snyder, M.; Hancock, W. S.; Omenn, G. S., Distinct splice variants and pathway enrichment in the cell-line models of aggressive human breast cancer subtypes. J Proteome Res 2014, 13, (1), 212-27.

ACS Paragon Plus Environment

Page 37 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

23. Tavares, R.; de Miranda Scherer, N.; Pauletti, B. A.; Araujo, E.; Folador, E. L.; Espindola, G.; Ferreira, C. G.; Paes Leme, A. F.; de Oliveira, P. S.; Passetti, F., SpliceProt: a protein sequence repository of predicted human splice variants. Proteomics 2014, 14, (2-3), 1815. 24. Woo, S.; Cha, S. W.; Merrihew, G.; He, Y.; Castellana, N.; Guest, C.; Maccoss, M.; Bafna, V., Proteogenomic Database Construction Driven from Large Scale RNA-seq Data. J Proteome Res 2013. 25. Sheynkman, G. M.; Shortreed, M. R.; Frey, B. L.; Smith, L. M., Discovery and mass spectrometric analysis of novel splice-junction peptides using RNA-Seq. Mol Cell Proteomics 2013. 26. Woo, S.; Cha, S. W.; Na, S.; Guest, C.; Liu, T.; Smith, R. D.; Rodland, K. D.; Payne, S.; Bafna, V., Proteogenomic strategies for identification of aberrant cancer peptides using largescale next-generation sequencing data. Proteomics 2014, 14, (23-24), 2719-30. 27. Tang, F.; Barbacioru, C.; Bao, S.; Lee, C.; Nordman, E.; Wang, X.; Lao, K.; Surani, M. A., Tracing the derivation of embryonic stem cells from the inner cell mass by single-cell RNASeq analysis. Cell Stem Cell 2010, 6, (5), 468-78. 28. Klimmeck, D.; Cabezas-Wallscheid, N.; Reyes, A.; von Paleske, L.; Renders, S.; Hansson, J.; Krijgsveld, J.; Huber, W.; Trumpp, A., Transcriptome-wide profiling and posttranscriptional analysis of hematopoietic stem/progenitor cell differentiation toward myeloid commitment. Stem Cell Reports 2014, 3, (5), 858-75.

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 38 of 55

29. Kroll, J. E.; de Souza, S. J.; de Souza, G. A., Identification of rare alternative splicing events in MS/MS data reveals a significant fraction of alternative translation initiation sites. PeerJ 2014, 2, e673. 30. Skarn, M.; Namlos, H. M.; Noordhuis, P.; Wang, M. Y.; Meza-Zepeda, L. A.; Myklebost, O., Adipocyte differentiation of human bone marrow-derived stromal cells is modulated by microRNA-155, microRNA-221, and microRNA-222. Stem Cells Dev 2012, 21, (6), 873-83. 31. Al-toub, M.; Almusa, A.; Almajed, M.; Al-Nbaheen, M.; Kassem, M.; Aldahmash, A.; Alajez, N. M., Pleiotropic effects of cancer cells' secreted factors on human stromal (mesenchymal) stem cells. Stem Cell Res Ther 2013, 4, (5), 114. 32. Salomonis, N.; Nelson, B.; Vranizan, K.; Pico, A. R.; Hanspers, K.; Kuchinsky, A.; Ta, L.; Mercola, M.; Conklin, B. R., Alternative splicing in the differentiation of human embryonic stem cells into cardiac precursors. PLoS Comput Biol 2009, 5, (11), e1000553. 33. Larsen, K. H.; Frederiksen, C. M.; Burns, J. S.; Abdallah, B. M.; Kassem, M., Identifying a molecular phenotype for bone marrow stromal cells with in vivo bone-forming capacity. J Bone Miner Res 2010, 25, (4), 796-808. 34. Twine, N. A.; Chen, L.; Pang, C. N.; Wilkins, M. R.; Kassem, M., Identification of differentiation-stage specific markers that define the ex vivo osteoblastic phenotype. Bone 2014, 67, 23-32. 35. FastQC. http://www.bioinformatics.babraham.ac.uk/projects/fastqc (April 3, 2013),

ACS Paragon Plus Environment

Page 39 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

36. Kim, D.; Pertea, G.; Trapnell, C.; Pimentel, H.; Kelley, R.; Salzberg, S. L., TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 2013, 14, (4), R36. 37. Trapnell, C.; Hendrickson, D. G.; Sauvageau, M.; Goff, L.; Rinn, J. L.; Pachter, L., Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol 2013, 31, (1), 46-53. 38. Hart-Smith, G.; Raftery, M. J., Detection and characterization of low abundance glycopeptides via higher-energy C-trap dissociation and orbitrap mass analysis. J Am Soc Mass Spectrom 2012, 23, (1), 124-40. 39. Vizcaino, J. A.; Deutsch, E. W.; Wang, R.; Csordas, A.; Reisinger, F.; Rios, D.; Dianes, J. A.; Sun, Z.; Farrah, T.; Bandeira, N.; Binz, P. A.; Xenarios, I.; Eisenacher, M.; Mayer, G.; Gatto, L.; Campos, A.; Chalkley, R. J.; Kraus, H. J.; Albar, J. P.; Martinez-Bartolome, S.; Apweiler, R.; Omenn, G. S.; Martens, L.; Jones, A. R.; Hermjakob, H., ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat Biotechnol 2014, 32, (3), 223-6. 40. Wright, J. C.; Collins, M. O.; Yu, L.; Kall, L.; Brosch, M.; Choudhary, J. S., Enhanced peptide identification by electron transfer dissociation using an improved Mascot Percolator. Mol Cell Proteomics 2012, 11, (8), 478-91. 41. Wang, M.; Herrmann, C. J.; Simonovic, M.; Szklarczyk, D.; von Mering, C., Version 4.0 of PaxDb: Protein abundance data, integrated across model organisms, tissues, and cell-lines. Proteomics 2015.

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 40 of 55

42. R Development Core Team, R: A language and environment for statistical computing. R Foundation for Statistical Computing: Vienna, Austria, 2008. 43. Huang da, W.; Sherman, B. T.; Lempicki, R. A., Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 2009, 37, (1), 1-13. 44. Bianchi, V.; Colantoni, A.; Calderone, A.; Ausiello, G.; Ferre, F.; Helmer-Citterich, M., DBATE: database of alternative transcripts expression. Database (Oxford) 2013, 2013, bat050. 45. Mallick, P.; Schirle, M.; Chen, S. S.; Flory, M. R.; Lee, H.; Martin, D.; Ranish, J.; Raught, B.; Schmitt, R.; Werner, T.; Kuster, B.; Aebersold, R., Computational prediction of proteotypic peptides for quantitative proteomics. Nat Biotechnol 2007, 25, (1), 125-31. 46. Ivanov, I. P.; Firth, A. E.; Michel, A. M.; Atkins, J. F.; Baranov, P. V., Identification of evolutionarily conserved non-AUG-initiated N-terminal extensions in human coding sequences. Nucleic Acids Res 2011, 39, (10), 4220-34. 47. Gonzalez-Porta, M.; Frankish, A.; Rung, J.; Harrow, J.; Brazma, A., Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome Biol 2013, 14, (7), R70. 48. Ezkurdia, I.; Rodriguez, J. M.; Carrillo-de Santa Pau, E.; Vazquez, J.; Valencia, A.; Tress, M. L., Most Highly Expressed Protein-Coding Genes Have a Single Dominant Isoform. J Proteome Res 2015.

ACS Paragon Plus Environment

Page 41 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

49. Lundberg, E.; Fagerberg, L.; Klevebring, D.; Matic, I.; Geiger, T.; Cox, J.; Algenas, C.; Lundeberg, J.; Mann, M.; Uhlen, M., Defining the transcriptome and proteome in three functionally different human cell lines. Mol Syst Biol 2010, 6, 450. 50. Varrin-Doyer, M.; Nicolle, A.; Marignier, R.; Cavagna, S.; Benetollo, C.; Wattel, E.; Giraudon, P., Human T lymphotropic virus type 1 increases T lymphocyte migration by recruiting the cytoskeleton organizer CRMP2. J Immunol 2012, 188, (3), 1222-33. 51. Hamajima, N.; Matsuda, K.; Sakata, S.; Tamaki, N.; Sasaki, M.; Nonaka, M., A novel gene family defined by human dihydropyrimidinase and three related proteins with differential tissue distribution. Gene 1996, 180, (1-2), 157-63. 52. Rahajeng, J.; Giridharan, S. S.; Naslavsky, N.; Caplan, S., Collapsin response mediator protein-2 (Crmp2) regulates trafficking by linking endocytic regulatory proteins to dynein motors. J Biol Chem 2010, 285, (42), 31918-22. 53. Morgan-Fisher, M.; Couchman, J. R.; Yoneda, A., Phosphorylation and mRNA splicing of collapsin response mediator protein-2 determine inhibition of rho-associated protein kinase (ROCK) II function in carcinoma cell migration and invasion. J Biol Chem 2013, 288, (43), 31229-40. 54. Mientjes, E. J.; Willemsen, R.; Kirkpatrick, L. L.; Nieuwenhuizen, I. M.; HoogeveenWesterveld, M.; Verweij, M.; Reis, S.; Bardoni, B.; Hoogeveen, A. T.; Oostra, B. A.; Nelson, D. L., Fxr1 knockout mice show a striated muscle phenotype: implications for Fxr1p function in vivo. Hum Mol Genet 2004, 13, (13), 1291-302.

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 42 of 55

55. Dinkel, H.; Van Roey, K.; Michael, S.; Davey, N. E.; Weatheritt, R. J.; Born, D.; Speck, T.; Kruger, D.; Grebnev, G.; Kuban, M.; Strumillo, M.; Uyar, B.; Budd, A.; Altenberg, B.; Seiler, M.; Chemes, L. B.; Glavina, J.; Sanchez, I. E.; Diella, F.; Gibson, T. J., The eukaryotic linear motif resource ELM: 10 years and counting. Nucleic Acids Res 2014, 42, (Database issue), D259-66. 56. Gilmour, P. S.; O'Shea, P. J.; Fagura, M.; Pilling, J. E.; Sanganee, H.; Wada, H.; Courtney, P. F.; Kavanagh, S.; Hall, P. A.; Escott, K. J., Human stem cell osteoblastogenesis mediated by novel glycogen synthase kinase 3 inhibitors induces bone formation and a unique bone turnover biomarker profile in rats. Toxicol Appl Pharmacol 2013, 272, (2), 399-407. 57. Chen, F. C.; Wang, S. S.; Chen, C. J.; Li, W. H.; Chuang, T. J., Alternatively and constitutively spliced exons are subject to different evolutionary forces. Mol Biol Evol 2006, 23, (3), 675-82. 58. Ederveen, T. H.; Overmars, L.; van Hijum, S. A., Reduce manual curation by combining gene predictions from multiple annotation engines, a case study of start codon prediction. PLoS One 2013, 8, (5), e63523. 59. Blakeley, P.; Overton, I. M.; Hubbard, S. J., Addressing statistical biases in nucleotidederived protein databases for proteogenomic search strategies. J Proteome Res 2012, 11, (11), 5221-34. 60. Banfai, B.; Jia, H.; Khatun, J.; Wood, E.; Risk, B.; Gundling, W. E., Jr.; Kundaje, A.; Gunawardena, H. P.; Yu, Y.; Xie, L.; Krajewski, K.; Strahl, B. D.; Chen, X.; Bickel, P.; Giddings, M. C.; Brown, J. B.; Lipovich, L., Long noncoding RNAs are rarely translated in two human cell lines. Genome Res 2012, 22, (9), 1646-57.

ACS Paragon Plus Environment

Page 43 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

61. Juntawong, P.; Bailey-Serres, J., Dynamic Light Regulation of Translation Status in Arabidopsis thaliana. Front Plant Sci 2012, 3, 66. 62. Menschaert, G.; Van Criekinge, W.; Notelaers, T.; Koch, A.; Crappe, J.; Gevaert, K.; Van Damme, P., Deep proteome coverage based on ribosome profiling aids mass spectrometrybased protein and peptide discovery and provides evidence of alternative translation products and near-cognate translation initiation events. Mol Cell Proteomics 2013, 12, (7), 1780-90. 63. Bazzini, A. A.; Johnstone, T. G.; Christiano, R.; Mackowiak, S. D.; Obermayer, B.; Fleming, E. S.; Vejnar, C. E.; Lee, M. T.; Rajewsky, N.; Walther, T. C.; Giraldez, A. J., Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation. EMBO J 2014, 33, (9), 981-93. 64. Wilhelm, M.; Schlegl, J.; Hahne, H.; Moghaddas Gholami, A.; Lieberenz, M.; Savitski, M. M.; Ziegler, E.; Butzmann, L.; Gessulat, S.; Marx, H.; Mathieson, T.; Lemeer, S.; Schnatbaum, K.; Reimer, U.; Wenschuh, H.; Mollenhauer, M.; Slotta-Huspenina, J.; Boese, J. H.; Bantscheff, M.; Gerstmair, A.; Faerber, F.; Kuster, B., Mass-spectrometry-based draft of the human proteome. Nature 2014, 509, (7502), 582-7. 65. Buljan, M.; Chalancon, G.; Eustermann, S.; Wagner, G. P.; Fuxreiter, M.; Bateman, A.; Babu, M. M., Tissue-specific splicing of disordered segments that embed binding motifs rewires protein interaction networks. Mol Cell 2012, 46, (6), 871-83. 66. Ellis, J. D.; Barrios-Rodiles, M.; Colak, R.; Irimia, M.; Kim, T.; Calarco, J. A.; Wang, X.; Pan, Q.; O'Hanlon, D.; Kim, P. M.; Wrana, J. L.; Blencowe, B. J., Tissue-specific alternative splicing remodels protein-protein interaction networks. Mol Cell 2012, 46, (6), 884-92.

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 44 of 55

67. Davis, M. J.; Shin, C. J.; Jing, N.; Ragan, M. A., Rewiring the dynamic interactome. Mol Biosyst 2012, 8, (8), 2054-66, 2013. 68. Engstrom, P. G.; Steijger, T.; Sipos, B.; Grant, G. R.; Kahles, A.; Ratsch, G.; Goldman, N.; Hubbard, T. J.; Harrow, J.; Guigo, R.; Bertone, P.; Consortium, R., Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods 2013, 10, (12), 1185-91. 69. Sharon, D.; Tilgner, H.; Grubert, F.; Snyder, M., A single-molecule long-read survey of the human transcriptome. Nat Biotechnol 2013, 31, (11), 1009-14. 70. Tran, J. C.; Zamdborg, L.; Ahlf, D. R.; Lee, J. E.; Catherman, A. D.; Durbin, K. R.; Tipton, J. D.; Vellaichamy, A.; Kellie, J. F.; Li, M.; Wu, C.; Sweet, S. M.; Early, B. P.; Siuti, N.; LeDuc, R. D.; Compton, P. D.; Thomas, P. M.; Kelleher, N. L., Mapping intact protein isoforms in discovery mode using top-down proteomics. Nature 2011, 480, (7376), 254-8. 71. Nagaraj, N.; Wisniewski, J. R.; Geiger, T.; Cox, J.; Kircher, M.; Kelso, J.; Paabo, S.; Mann, M., Deep proteome and transcriptome mapping of a human cancer cell line. Mol Syst Biol 2011, 7, 548. 72. Meyer, J. G.; Kim, S.; Maltby, D. A.; Ghassemian, M.; Bandeira, N.; Komives, E. A., Expanding proteome coverage with orthogonal-specificity alpha-lytic proteases. Mol Cell Proteomics 2014, 13, (3), 823-35. 73. Omenn, G. S.; Guan, Y.; Menon, R., A new class of protein cancer biomarker candidates: differentially expressed splice variants of ERBB2 (HER2/neu) and ERBB1 (EGFR) in breast cancer cell lines. J Proteomics 2014, 107, 103-12.

ACS Paragon Plus Environment

Page 45 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

74. Woo, S.; Cha, S. W.; Na, S.; Guest, C.; Liu, T.; Smith, R. D.; Rodland, K. D.; Payne, S.; Bafna, V., Proteogenomic strategies for identification of aberrant cancer peptides using largescale next generation sequencing data. Proteomics 2014.

Table of Contents figure:

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 1: Overview of TranscriptCoder algorithm. Many transcripts assembled from RNA-seq data, including splice variants, are not full length and thus lack canonical start and stop codons. Instead, they can show multiple apparent start and stop codons. TranscriptCoder uses the translation frame of a highly used exon, inferred by comparison with reference transcripts, to determine the correct open reading frame of one portion of the transcript. This is then propagated upstream and downstream, allowing the correct frame, and correct start and stop codons if present, to be confirmed. 322x106mm (96 x 96 DPI)

ACS Paragon Plus Environment

Page 46 of 55

Page 47 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 2: Visualization of peptides and RNA-seq reads for the ribosomal S2 protein (ENSG00000140988) in IGV. a) Gene architecture of ribosomal S2 protein transcripts. b) RNA-seq reads corresponding to alternatively spliced mRNA transcripts assembled using Cufflinks. c) Peptides identified from MS/MS ion searches mapping to 4 alternatively spliced transcripts (ENST00000343262, ENST00000526522, ENST00000529806 and ENST00000530225). 594x243mm (96 x 96 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3: Venn diagram showing the number of isoforms that were by identified by MS/MS ion searches against four different databases. The numbers in brackets, below each database name, represent the number of isoforms identified. The majority of isoforms were identified from all four databases. 317x147mm (96 x 96 DPI)

ACS Paragon Plus Environment

Page 48 of 55

Page 49 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 4: The number of a) exons and b) splice junctions from isoforms that were validated with 2 or more peptides. Venn diagrams showing the results of MS/MS ion searches against four different databases. The numbers in bracket represent the total exons or splice junctions validated for each database. 327x348mm (96 x 96 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 5: Venn diagrams showing the number of isoforms that could be identified with isoform-specific proteotypic peptides in the MS/MS data. Isoforms were identified by MS/MS ion searches against the Ensembl and TranscriptCoder databases. a) The number of isoforms present in both databases. b) The number of isoforms present only in the Ensembl or TranscriptCoder database, but not in both. 218x209mm (96 x 96 DPI)

ACS Paragon Plus Environment

Page 50 of 55

Page 51 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 6: The specificity of an isoform-specific peptide can be database-dependent. To facilitate visualisation, peptides have been coloured in red. For all screenshots: i) Ensembl gene architecture, ii) peptides found with MS/MS ion searches against the Ensembl database, iii) TranscriptCoder gene architecture and iv) peptides found with MS/MS ion searches against the TranscriptCoder database. a) A peptide (RAGELTEDEVER) in ribosomal protein S18 (ENSG00000231500) is isoform-specific in the Ensembl database but not in the RNA-seq TranscriptCoder database. b) A peptide (FSNQETCVEIGESVR) in phosphoribosyl pyrophosphate synthetase 1 (ENSG00000147224) is isoform-specific in the RNA-seq TranscriptCoder database but not in the Ensembl database. c) A peptide (AAGVNVEPFWPGLFAK) in ribosomal protein, large, P1 (ENSG00000137818) is isoform-specific in both Ensembl and RNA-Seq TranscriptCoder databases. 513x617mm (96 x 96 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 7: Visualisation of novel hMSC-specific isoforms identified in the TranscriptCoder-derived database with isoform-specific peptides in IGV. Other isoforms in the TranscriptCoder-derived database that were not validated with isoform-specific peptides were removed to facilitate visualization. a) A novel isoform for the fragile X mental retardation syndrome-related protein 1 (ENSG00000114416) was identified with an isoform-specific peptide (highlighted in the red box) across intron between the 13th and 14th exon and part of the 14th exon. The peptide sequence for the corresponding peptide is GYATDESTVSSVQGSR. b) One novel isoform for the dihydropyrimidinase-like 2 protein (ENSG00000092964) was identified with an isoform-specific peptide (highlighted in the red box) in the 5’-end of the transcript. The peptide sequence for the corresponding peptide is TIDFDSLSVGR. 709x297mm (96 x 96 DPI)

ACS Paragon Plus Environment

Page 52 of 55

Page 53 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 8: Co-visualization of the gene architecture of pyruvate kinase, muscle protein (ENSG00000067225) with peptides in IGV. Only experimentally-derived proteotypic peptides are shown in the visualization. Peptides A, B and C are shown to map to more than 1 isoform and therefore cannot be used alone to uniquely identify an isoform. However, the peptides can be used in combination to identify a protein as a member of a subset of an isoform family. For more details, see text. 1015x214mm (96 x 96 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 9: Groups of up to 5 proteotypic peptides were used to identify specific isoforms of alternatively spliced genes in hMSC-TERT4-derived proteins. This identified 386 protein isoform families with 2 or more members. Eighty-two isoform families had one protein isoform with the highest number of proteotypic peptide hits, thus defining the isoform most likely to be present. The exact isoforms present could not be resolved for 162 isoform families, since their transcript isoforms shared too many peptides. A total of 142 isoform families had proteotypic peptides that mapped to a unique isoform; these cases had two or more isoform-specific peptides. 171x98mm (96 x 96 DPI)

ACS Paragon Plus Environment

Page 54 of 55

Page 55 of 55

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table of Contents 352x118mm (96 x 96 DPI)

ACS Paragon Plus Environment