NextSearch: A Search Engine for Mass Spectrometry Data against a

May 25, 2015 - E-mail: [email protected]., *E.P.: Tel: +82-2-2220-2377. ... a protein sequence in FASTA format, achieving an order of magnitude red...
0 downloads 0 Views 2MB Size
Subscriber access provided by NEW YORK UNIV

Article

NextSearch: A search engine for mass spectrometry data against a compact nucleotide exon graph Hyunwoo Kim, Heejin Park, and Eunok Paek J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.5b00047 • Publication Date (Web): 25 May 2015 Downloaded from http://pubs.acs.org on June 4, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

NextSearch: A search engine for mass spectrometry data against a compact nucleotide exon graph Hyunwoo Kim1, Heejin Park2,*, and Eunok Paek2,* 1

Department of Electronics and Computer Engineering, 2Department of Computer Science and Engineering, Hanyang University, Seoul

KEYWORDS Proteogenomics, Exon graph, Alternative splicing, Junction variation

ACS Paragon Plus Environment

1

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 28

ABSTRACT

Proteogenomics research has been using six-frame translation of the whole genome or amino acid exon graphs in order to overcome the limitations of reference protein sequence database. However, six-frame translation is not suitable for annotating genes that span over multiple exons, and amino acid exon graphs are not convenient to represent novel splice variants and exon skipping events between exons of incompatible reading frames. We propose a proteogenomic pipeline NextSearch (Nucleotide EXon-graph Transcriptome Search) that is based on a nucleotide exon graph. This pipeline consists of constructing a compact nucleotide exon graph that systematically incorporates novel splice variations, and a search tool that identifies peptides by directly searching the nucleotide exon graph against tandem mass spectra. Since our exon graph stores nucleotide sequences, it can easily represent novel splice variations and exon skipping events between incompatible reading frame exons. Searching for peptide identification is performed against this nucleotide exon graph, without converting it into a protein sequence in FASTA format, achieving an order of magnitude reduction in the size of the sequence database storage. NextSearch outputs the proteome-genome/transcriptome mapping results in a general feature format (GFF) file, which can be visualized by public tools such as the UCSC Genome Browser.

ACS Paragon Plus Environment

2

Page 3 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

INTRODUCTION A correct gene annotation requires evidence from all levels of genomics, transcriptomics and proteomics. Typically, genes have been mainly predicted from genomic and transcriptomic evidence, using ab initio gene prediction, cross species sequence homology, and EST and RNASeq data alignment1. However, it is still unclear whether the predicted genes are actually translated into proteins without the existence of proteomic-level evidence. Moreover, finding accurate exon and intron boundaries based only on transcriptomic evidence can be difficult due to the limit of the current sequencing technology. Recently, proteogenomics approaches have been applied to improve gene annotations, owing to the rapid development of mass spectrometry and peptide/protein identification techniques2. Excellent surveys are available by Castellana and Bafna3 and Wang and Zhang4. These proteogenomic methods can be categorized into three approaches. The first approach is to construct a putative protein sequence database translated from the predicted genes and confirm the putative proteins and their corresponding genes by mass spectrometry. Some putative protein sequence databases include known SNPs5. Such a method is useful to confirm predicted genes, but unpredicted genes cannot be identified. Alternatively, for a gene or genes of interest, all possible single SNPs can be generated per gene, translated to protein sequences, and searched6. The second approach is to use six-frame translation of the whole genome rather than to use putative proteins as a protein sequence database7,

8, 9, 10

. This approach can find unpredicted

genes, but cannot identify peptides spanning multiple exons and thus rarely finds accurate exon and intron boundaries. The third approach is to use an exon graph instead of a protein sequence database. Tanner and colleagues presented an amino acid exon graph that inferred putative introns and exons from dbEST and translated the putative exons into amino acid sequences11.

ACS Paragon Plus Environment

3

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 28

The nodes of the graph represent amino acid sequences. The edges represent splice events between nodes of compatible reading frame exons. They also incorporated SNPs from dbSNP to the exon graph. This exon graph has been used to discover novel genes of Arabidopsis12. The amino acid exon graph, however, cannot be used to find novel splice variants or exon skipping between incompatible reading frame exons. To overcome this, a nucleotide exon graph and a proper graph search engine are required. Woo and colleagues considered a nucleotide splice graph13. However, instead of developing a nucleotide graph search engine, they converted the graph to a FASTA file to use a conventional amino acid based search tool. Since the linear FASTA file, enumerating all possible peptide translations from a graph, is much larger than the nucleotide graph, RNA-Seq results were used to select highly expressed portion of the whole FASTA file. This reduction approach based on RNA-Seq was also used by Sheynkman and colleagues14 to find novel splice junctions and translate them into amino acid sequences, and also by Wang and others15 to construct a customized protein sequence database including only highly expressed proteins. However, since RNA-Seq data is also incomplete, novel splice variants or exon skipping events can be overlooked. This paper proposes a proteogenomic data analysis pipeline that (1) builds a nucleotide exon graph, (2) identifies peptides by directly searching a protein database represented as a nucleotide exon graph, and (3) outputs the proteome-genome/transcriptome mapping results in a general feature format (GFF) file to facilitate visualization. Since our exon graph stores nucleotide sequences, it can easily represent novel splice variations and exon skipping events between incompatible reading frame exons. Searching for peptide identification from tandem mass spectra is performed against this nucleotide exon graph, without converting it into a protein

ACS Paragon Plus Environment

4

Page 5 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

sequence in FASTA format, resulting in an order of magnitude reduction in the size of the sequence database storage.

METHODS 11 Cell Line Data Set The NextSearch algorithm was run on cell cultures of A549, GAMG, HEK293, HeLa, HepG2, K562, MCF7, RKO, U2OS, LnCap, and Jurkat cells from a previous study16. These data sets were acquired on an LTQ-Orbitrap Velos mass spectrometer (Thermo Fisher Scientific) coupled with high performance liquid chromatography (HPLC). Peptide fragmentation was performed with HCD method and MS and MS/MS spectra were acquired in the Orbitrap analyzer. Triplicate analyses of each cell line were performed and in each replicate the peptides were separated into six fractions. It consists of 6,269,523 MS/MS spectra, altogether.

Database Search and FDR Estimation Peptide identification was conducted using a two-stage search method17. A multi-stage data analysis strategy has been reported to improve the sensitivity of peptide identification in proteogenomics research, especially when it aims to discover novel peptides18. First, MS-GF+19 was run on each cell line data to identify all the known peptides. Then, the identified spectra obtained at 1% FDR from MS-GF+ search results were excluded from the input MS/MS data and NextSearch was run on the remainder, the number of which is 2,914,088. MS-GF+ search was done against UniProt-human-reference20 (released in May 2013; 90,191 entries) appended with 179 common contaminants, and the following parameters were adopted: 10 ppm precursor ion tolerance, number of tryptic termini (NTT) = 1, fixed modification of

ACS Paragon Plus Environment

5

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 28

Carbamidomethyl on Cys, and variable modification of Oxidation on Met. The false discovery rate (FDR) of the peptide identification results was estimated using the target/decoy method. Peptide identifications were obtained at 1% FDR, where FDR was calculated as D/T: D is the number of decoy hits and T is the number of target hits above a threshold. 3,355,435 PSMs were identified at 1% FDR. Next Search was run with the following parameters: number of tryptic termini (NTT) = 1, the maximum number of missed cleavages = 2, fixed modification of Carbamidomethyl (C). We used 0.02 Da for fragment ion tolerance and 10 ppm precursor ion tolerance and allowed no variable modifications. Novel peptide identifications from NextSearch results were also obtained at 1% FDR.

Overview of NextSearch NextSearch is the first proteogenomic search tool to identify peptides from tandem mass spectra by searching a nucleotide exon graph, where a node and an edge represent an exon and a splice event, respectively. The overall workflow is shown in Figure 1. First, we construct a nucleotide exon graph given an input transcriptome model in a general transfer format (GTF) and a whole genome sequence in the FASTA file format. In this work, we used Ensembl21 transcriptome database (version 71) and human reference genome (hg19). The exon graph is constructed by adding novel transcript-level events, namely single exon skipping, junction variations, frame shifts, and translated UTRs (untranslated regions) to the existing reference transcriptome database. Peptides are then identified from mass spectrometry-based proteomics by searching the nucleotide exon graph as a target sequence database. A modified version of MODa22 is applied to search the nucleotide exon graph for peptide identification. Finally,

ACS Paragon Plus Environment

6

Page 7 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

identified peptides are mapped to the nucleotide exon graph using an amino acid-based search against the exon graph. Such information is converted into GFF files so that it can be manually validated using public tools such as the UCSC Genome Browser23. NextSearch will be made available at http://prix.hanyang.ac.kr, upon acceptance of the manuscript.

Compact Nucleotide Exon Graph Construction First, we construct a directed acyclic graph (DAG) for each Ensembl transcript model of each gene, where each exon and splice event of a transcript is respectively represented as a node and an edge in a DAG. (Figure 2b). We add hypothetical novel transcriptomic events to these DAGs. One type of hypothetical genomic events we included is single exon skipping shown in Figure 2c. The second type is a junction variation event, which allows ±3 position shifts from the original splice sites, but such a hypothetical junction variation is included only when there is a consensus sequence of ‘GT’ at a splice donor site or ‘AG’ at a splice acceptor site (Figure 2d). We only consider junction variations with ±3 site displacement so that downstream frame shifts are not included. We also limit junction variations such that they cannot occur simultaneously at both donor and acceptor sites of a splice position, nor simultaneously with exon skipping. It is our conjecture that junction variation does not occur frequently and thus there is little chance that it co-occurs with another novel splicing event. This representation, however, includes a lot of redundancy because different transcripts of a single gene may have many exons and splice events in common. Adding hypothetical splice events exacerbates the situation, by repeatedly adding nodes and edges of the same transcriptomic event to multiple transcripts of a single gene. To reduce the redundancy, we construct a compact exon graph by splitting and merging overlapping exons for each gene as

ACS Paragon Plus Environment

7

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 28

shown in Figure 2e. If multiple transcripts of a single gene include exons with the same start/end positions, those exons are merged. If an exon of one transcript either includes or overlaps with an exon of another transcript within the same gene, the exon is split into smaller non-overlapping subexons and edges are added between those subexons that came from the same original exon. While the nucleotide-based representation allows a node split operation at any genomic position, amino acid-based representation will allow node split only for a certain ORF and thus will not be amenable to the graph compaction described here. Constructing an exon graph based on nucleotide sequences not only makes it simple to add hypothetical splice events but also allows addition of hypothetical frame shifts in a straightforward manner. Table 1 shows the statistics of a compact exon graph built from transcript models of Ensembl version 71. When compared with the numbers in parentheses, showing the statistics of a simple exon graph before compaction, the compact exon graph displays about a twofold reduction in its size. It must also be noted that a compact exon graph is a lot smaller than the equivalent enumerated sequence database in FASTA format. We created an enumerated sequence database from the compact exon graph in the same way as Woo and colleagues did13. The resulting FASTA file is about 850 MB in size, while the compact exon graph can be stored in a text file of about 210 MB, showing a 4-fold reduction. In addition, the resulting FASTA file size enables one to estimate the size of search space. Since UniProt database is 43.5 MB when downloaded as a FASTA file, the search space of our compact exon graph is about 20 times bigger than that of the UniProt.

Nucleotide Exon Graph Search

ACS Paragon Plus Environment

8

Page 9 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

To identify peptides using a nucleotide exon graph database, MODa (a peptide identification algorithm that searches tandem mass spectra against an amino acid sequence database) is modified to analyze tandem mass spectra against a nucleotide exon graph. A short description of the original MODa is as follows: MODa is an efficient algorithm that identifies peptides using dynamic programming after filtering candidate peptides based on sequence tags. MODa first performs de novo sequencing given a tandem mass spectrum to identify all the sequence tags consisting of three amino acids. It then selects candidate peptides that can be aligned with one of these tags together with the corresponding N-term and C-term flanking masses. It finally scores the match between each candidate peptide and spectrum to select the best matching peptide sequence. MODa has been modified to search against a nucleotide exon graph as follows. First, amino acid sequence tags, identified by MODa’s de novo sequencing module, are converted into nucleotide sequence tags. Since there are about 3 codons per amino acid, this conversion makes about 27 times more nucleotide sequence tags than amino acid tags. To reduce the tag search time, the nucleotide sequence tags from each spectrum are stored in a keyword tree24. This keyword tree is searched against the exon graph using depth-first search (DFS) as shown in Algorithm 1 in Supplementary Table 1. While traversing the exon graph using DFS, each nucleotide sequence of the current node is compared with the keyword tree using Aho-Corasick algorithm. Even when the nucleotide sequence of the current node is exhausted, the tag search must be continued by visiting the adjacent nodes because tags may span over multiple nodes. It should be noted that DFS used here is a little different from the conventional DFS. Typically, DFS visits each node of a graph only once during its traversal. Nodes are called ‘visited’ when the algorithm reaches the node during execution and the node’s data content is consumed. In a

ACS Paragon Plus Environment

9

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 28

nucleotide exon graph, however, DFS must be refined such that a node is considered visited when all of its adjacent nodes are visited as well as when all of the node’s nucleotide sequence is exhausted by comparing with the keyword tree. Every time a nucleotide sequence tag is found in the graph, we determine candidate peptides by extending the nucleotide sequence tags. This time, we translate the nucleotide sequences in nodes into amino acid sequences to check whether Nterm and C-term flanking masses are matched (Algorithm 2, Supplementary Table 1). Finally, peptides are identified by aligning candidate peptides with MS/MS spectra using the original MODa’s dynamic programming.

Genomic Locations of Peptides Results of mass spectrometry data analysis not only provide an amino acid sequence that best matches a given tandem mass spectrum, also provides a list of proteins that include the matched peptide. With a regular search against protein sequence databases, such output is simple to produce, because a peptide position within a protein can easily be maintained as an integer value. In the case of an exon graph, a peptide position within a gene must be represented as a path within the graph, composed of sequences of nodes together with each node’s start/end positions, because tags may span over multiple exons and partially match subsequences of start/end exons. Keeping a record of such information during a search for peptide-spectrum match requires a lot of memory, especially if the number of candidate peptides is big. Actually, an average number of candidate peptides when using UniProt is about 10,000 per spectra while the nucleotide exon graph is expected to give a list that is an order of magnitude bigger than UniProt. In order to reduce the memory footprint of NextSearch algorithm, we chose to record only peptide

ACS Paragon Plus Environment

10

Page 11 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

sequences without their positions during the search, and then map the identified peptides back to the exon graph when we need the transcripts/genes associated with the identified peptides. For such mapping, we developed a search algorithm against a nucleotide exon graph for amino acid sequences. It does not convert peptide sequences into nucleotide sequences because the number of converted nucleotide sequences can be too many. For example, if we convert a peptide of an average length 10 into the corresponding nucleotide sequences, it results in 69,662 different sequences because one amino acid corresponds to about 3.05 codons. Instead, a nucleotide sequence of an exon graph is converted into amino acids on the fly. It must be noted that frame shifts should be considered when translating nucleotide sequences, since we do not know to which ORF the identified peptide belongs. Therefore, a node can be visited up to three times and each node keeps a record of ORFs for which it was traversed. Again, we used DFS for graph traversal and Aho-Corasick algorithm for matching the keyword tree of amino acid sequences against the nucleotide exon graph.

FDR estimation for NextSearch The false discovery rate (FDR) of the peptide identification results was estimated using the same target/decoy method as with MS-GF+ search. As for the decoy database search for NextSearch, one can think of constructing a decoy exon graph, which is equivalent to a reverse sequence amino acid database, and searching it. Instead of explicitly constructing a decoy graph database, however, a decoy search in NextSearch can be achieved by traversing the target exon graph in reverse direction using the same exon graph and the same DFS algorithm. This allows us to conduct the decoy search without doubling the database size. Actually, a simple reverse direction traversal is not equivalent to a reversed protein sequence database, because it searches

ACS Paragon Plus Environment

11

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 28

the reversed nucleotide sequences, and not the reversed target amino acid sequences. For a decoy graph search that is equivalent to using a reversed protein sequence DB, we used a ‘reverse codon table’. In a reverse codon table, a nucleotide sequence “TTA”, a codon for Leucine for example, is a codon for Isoleucine because its reverse sequence “ATT” is a codon for Isoleucine. If we apply the same graph traversal algorithm to the exon graph in reverse direction and use the reverse codon table, it will be equivalent to searching a decoy exon graph whose amino acid sequences are the reverse of the original target exon graph’s translated amino acid sequences. This method allows NextSearch to effectively search the reverse decoy database in FASTA format, without actually constructing a decoy database.

NextSearch on Hadoop/MapReduce Framework Recent high-throughput proteomics experiments often produce millions of spectra from a single sample. Obviously, we need a fast computing environment to search such large-scale proteomics data against a database whose size is supposedly an order of magnitude bigger than the usual reference sequence database. We developed a distributed version of NextSearch running on the Hadoop/MapReduce framework so that the search can be efficiently conducted on a cluster of computers25. When NextSearch is executed using a single core of 2.0GHz Intel Xeon processor, it takes about 16 seconds per spectrum. When executed in a distributed computing mode, it takes about 0.45 seconds per spectrum using 36 cores of 2.0GHz Intel Xeon processor, showing a speed up on a linear scale.

ACS Paragon Plus Environment

12

Page 13 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

When it comes to the memory requirement, NextSearch requires about 2.5 GB memory for its exon graph construction when applied to the Ensembl human database and an additional 1 GB memory when searching for peptide identification.

RESULTS Peptide Identification using NextSearch The resulting number of novel PSMs (peptides) from NextSearch at 1 % FDR was 2,738 (571). 2,738 novel PSM results are available as Supplementary Data 1. We looked into 571 novel peptides identified only from NextSearch. Out of 571, we selected 495 novel peptides after removing 76 peptides whose sequences can be mapped to multiple genome positions and 16 peptides that are matched to Ensemble CDS in the same frame, although they are not part of Uniprot sequences. In Table 2, we displayed the number of unique novel peptides for each event, identified only from NextSearch. There are 17 junction variations, 134 translated UTR peptides, 237 translated pseudo genes, 58 alternative splicing events, and 33 frame shift events. Figure 3(a) shows a PSM that includes a novel alternative splicing event. Figures 3(b) and 3(c) display PSMs including junction variation and translated UTR events, respectively. We also reviewed novel peptides to see if they were commonly observed among the triplicates and across different cell lines. On average, 3 peptides were shared among the triplicates, 14.5 peptides were shared between any two replicates (Supplementary Table 3). Moreover, 8 peptides were shared among 11 cell lines (Table 3).

Generic Feature Format using Amino Acid Search

ACS Paragon Plus Environment

13

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 28

NextSearch provides peptide identification results as a GFF file as well, so that users can easily verify to which genomic region a novel peptide is mapped and visually confirm which novel transcription/translation events turned up as the novel peptide, within the UCSC Genome Browser. Novel peptides from Figure 3 are displayed within the UCSC Genome Browser using our GFF output. Each of Figure 4 (a), (b) and (c) represents the peptide from the PSM in Figure 3 (a), (b) and (c), respectively.

DISCUSSION NextSearch is the first tool that identifies peptides against a nucleotide database in a graph form: we constructed a compact nucleotide exon graph and developed a search algorithm to traverse the graph database for identifying peptides. Our nucleotide exon graph facilitates the representation of novel splice variants and exon skipping events between exons of incompatible reading frames, which were not supported by previous amino acid exon graphs. Furthermore, NextSearch provides peptide identification results as a GFF file so that the mapping between identified peptides and gene structure can easily be visualized. We also developed a Hadoop/MapReduce version of NextSearch for its parallel execution on a cluster of computers. We classified novel peptides identified from NextSearch into five different novel transcription/translation events: junction variation, translated UTR, translated pseudo gene, alternative splicing, and frame shift. For classification purposes, we used both Ensembl and RefSeq databases, although we constructed the exon graph based on Ensembl transcript model. NextSearch aims to identify novel peptides from hypothetical transcription/translation events and the fact that more variants of transcript models were included in Ensembl than RefSeq made us

ACS Paragon Plus Environment

14

Page 15 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

start with Ensembl for novel peptide discovery. When interpreting peptide identifications as novel events, however, we took more precaution by referring to both databases. If the two databases do not agree on their transcript model, we selected a transcript model according to the following rules: 1) if a peptide is mapped only to a single transcript model, we chose this transcript, 2) if a peptide is mapped to multiple transcript models and thus can be interpreted as multiple novel events, we selected more reliable transcript model based on their curation annotations. As the result of searching 6,269,523 spectra, we found more peptides mapped to UTR or pseudo genes than peptides including alternative splicing or junction variation events. Thus, it is worthwhile to mainly focus on novel transcription/translation events by constructing a sequence database mainly from UTR and pseudo genes and identifying peptides against such sequence database to confirm gene expression at proteomic level.

ACS Paragon Plus Environment

15

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 28

Figure 1. The overall workflow of NextSearch.

ACS Paragon Plus Environment

16

Page 17 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 2. Compact exon graph construction: (a) An illustrative gene consisting of three Ensembl transcripts. (b) Each exon and splice event is represented as a node and an edge, respectively, of a graph. (c) Exon skipping: 3 edges are added (marked by dotted lines). (d) Junction variation: 5 exons and 5 edges are added. The zoom box shows the existence of consensus sequences (shown as “GT” and “AG” motifs in dotted rectangles). (e) Graph compaction: 14 nodes (exons) and 19 edges are reduced into 8 nodes (exons) and 13 edges after graph compaction.

ACS Paragon Plus Environment

17

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 28

Figure 3. MS/MS spectra of novel peptides identified by NextSearch. (a) KKPCSETSQIEGSPTEFLEEK includes an alternative splicing event between E and G. (b) DESAN-EEPEAR includes a junction variation event between N and E. Fragmentation between these novel splice

ACS Paragon Plus Environment

18

Page 19 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

sites

can

be

confirmed

by

fragment

ions

associated

with

boxed

residues.

(c)

TQTAAAAAAGGVGGGGGAMGGLASGGDVEPGLPVEVR is a translated UTR peptide.

ACS Paragon Plus Environment

19

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 28

Figure 4. Visualization of novel peptides within the UCSC Genome Browser using GFF files. Red blocks represent identified peptides, black blocks represent Ensembl database’s transcript models, green blocks represent CDS regions. (a) KKPCSETSQIE-GSPTEFLEEK is mapped as an alternative splicing form skipping one exon in ENST00000395734 and ENST00000360240 (blue box). (b) DESAN-EEPEAR is identified as a junction variation form of both ENST00000245932, ENST00000586014, and ENST00000588482. There is a consensus ‘AG’ motif

at

the

splice

acceptor

site

(red

circle).

(c)

TQTAAAAAAGGVGGGGGAMGGLASGGDVEPGLPVEVR is mapped to upstream UTR and CDS regions of ENST00000250113. A part of the peptide, MGGLASGGDVEPGLPVEVR, corresponds to CDS. Note that this is a reverse transcript.

ACS Paragon Plus Environment

20

Page 21 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 1. Statistics of Compact Exon Graph. Number of components(G)

56,526 (193,997)

Number of nodes

770,022 (1,821,813)

Number of edges

1,497,521 (2,867,638)

Average node length

159.10 bp (155.73 bp)

Average number of edges per node

1.94 (1.57)

Note: The values in parentheses are statistics of exon graph before compaction. Number of components(G) gives the number of gene models (the number of transcripts), which includes protein coding (20,738), pseudogene (13,419), long noncoding (13,220), and short noncoding (9,149).

ACS Paragon Plus Environment

21

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 28

Table 2. The number of novel peptides for each event type identified only from the NextSearch.

Event type

Number of novel peptides

Junction variation

17

Translated UTR peptide

134

Translated Pseudo gene

237

Alternative splicing

58

Frame shift

33

ACS Paragon Plus Environment

22

Page 23 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 3. The number of novel peptides shared among different cell lines. More than half of the novel peptides were observed in multiple cell lines. Eight peptides were observed in all 11 cell lines.

Number of cell lines

Number of novel peptides

11

8

10

2

9

4

8

6

7

9

6

9

5

11

4

263

3

28

2

77

1

318

ACS Paragon Plus Environment

23

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 28

ASSOCIATED CONTENT Supporting Information Supplementary Table 1. Pseudo code for nucleotide exon graph search algorithm. Supplementary Table 2. The number of novel peptides shared among replicates for each cell line. Supplementary Data 1. 2,738 novel PSMs identified by NextSearch. This material is available free of charge via the Internet at http://pubs.acs.org.

AUTHOR INFORMATION Corresponding Author *E.P.: tel, +82-2-2220-2377; fax, +82-2-2220-1723; e-mail, [email protected]. *H.P.: tel, +82-2-2220-1986; fax, +82-2-2220-1986; e-mail, [email protected].

Notes The authors declare no competing financial interest.

ACKNOWLEDGMENT

ACS Paragon Plus Environment

24

Page 25 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

This research was supported by the National Research Foundation of Korea [NRF2012M3A9B9036676, NRF-2014R1A2A1A11054147, NRF-2012M3A9D1054452].

REFERENCES 1. Yandell, M.; Ence, D., A beginner's guide to eukaryotic genome annotation. Nat Rev Genet 2012, 13 (5), 329-342. 2. Gupta, N.; Tanner, S.; Jaitly, N.; Adkins, J. N.; Lipton, M.; Edwards, R.; Romine, M.; Osterman, A.; Bafna, V.; Smith, R. D.; Pevzner, P. A., Whole proteome analysis of posttranslational modifications: Applications of mass-spectrometry for proteogenomic annotation. Genome Res 2007, 17 (9), 1362-1377. 3. Castellana, N.; Bafna, V., Proteogenomics to discover the full coding content of genomes: A computational perspective. J Proteomics 2010, 73 (11), 2124-2135. 4. Wang, X. J.; Zhang, B., Integrating Genomic, Transcriptomic, and Interactome Data to Improve Peptide and Protein Identification in Shotgun Proteomics. J Proteome Res 2014, 13 (6), 2715-2723. 5. Li, J.; Su, Z. L.; Ma, Z. Q.; Slebos, R. J. C.; Halvey, P.; Tabb, D. L.; Liebler, D. C.; Pao, W.; Zhang, B., A Bioinformatics Workflow for Variant Peptide Detection in Shotgun Proteomics. Mol Cell Proteomics 2011, 10 (5). 6. Gatlin, C. L.; Eng, J. K.; Cross, S. T.; Detter, J. C.; Yates, J. R., Automated identification of amino acid sequence variations in proteins by HPLC/microspray tandem mass spectrometry. Anal Chem 2000, 72 (4), 757-763. 7. Yates, J. R.; Eng, J. K.; Mccormack, A. L., Mining Genomes - Correlating Tandem Mass-Spectra of Modified and Unmodified Peptides to Sequences in Nucleotide Databases. Anal Chem 1995, 67 (18), 3202-3210. 8. Choudhary, J. S.; Blackstock, W. P.; Creasy, D. M.; Cottrell, J. S., Interrogating the human genome using uninterpreted mass spectrometry data. Proteomics 2001, 1 (5), 651-667. 9. Fermin, D.; Allen, B. B.; Blackwell, T. W.; Menon, R.; Adamski, M.; Xu, Y.; Ulintz, P.; Omenn, G. S.; States, D. J., Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics. Genome Biol 2006, 7 (4). 10. Sevinsky, J. R.; Cargile, B. J.; Bunger, M. K.; Meng, F.; Yates, N. A.; Hendrickson, R. C.; Stephenson, J. L., Whole genome searching with shotgun proteomic data: Applications for genome annotation. J Proteome Res 2008, 7 (1), 80-88. 11. Tanner, S.; Shen, Z. X.; Ng, J.; Florea, L.; Guigo, R.; Briggs, S. P.; Bafna, V., Improving gene annotation using peptide mass spectrometry. Genome Res 2007, 17 (2), 231-239. 12. Castellana, N. E.; Payne, S. H.; Shen, Z. X.; Stanke, M.; Bafna, V.; Briggs, S. P., Discovery and revision of Arabidopsis genes by proteogenomics. P Natl Acad Sci USA 2008, 105 (52), 21034-21038. 13. Woo, S.; Cha, S. W.; Merrihew, G.; He, Y. P.; Castellana, N.; Guest, C.; MacCoss, M.; Bafna, V., Proteogenomic Database Construction Driven from Large Scale RNA-seq Data. J Proteome Res 2014, 13 (1), 21-28.

ACS Paragon Plus Environment

25

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 28

14. Sheynkman, G. M.; Shortreed, M. R.; Frey, B. L.; Smith, L. M., Discovery and Mass Spectrometric Analysis of Novel Splice-junction Peptides Using RNA-Seq. Mol Cell Proteomics 2013, 12 (8), 2341-2353. 15. Wang, X. J.; Slebos, R. J. C.; Wang, D.; Halvey, P. J.; Tabb, D. L.; Liebler, D. C.; Zhang, B., Protein Identification Using Customized Protein Sequence Databases Derived from RNA-Seq Data. J Proteome Res 2012, 11 (2), 1009-1017. 16. Geiger, T.; Wehner, A.; Schaab, C.; Cox, J.; Mann, M., Comparative Proteomic Analysis of Eleven Common Cell Lines Reveals Ubiquitous but Varying Expression of Most Proteins. Mol Cell Proteomics 2012, 11 (3). 17. Elias, J. E.; Gygi, S. P., Target-decoy search strategy for increased confidence in largescale protein identifications by mass spectrometry. Nat Methods 2007, 4 (3), 207-214. 18. Nesvizhskii, A. I., Proteogenomics: concepts, applications and computational strategies. Nat Methods 2014, 11 (11), 1114-1125. 19. Kim, S.; Pevzner, P. A., MS-GF plus makes progress towards a universal database search tool for proteomics. Nat Commun 2014, 5. 20. Apweiler, R.; Martin, M. J.; O'Donovan, C.; Magrane, M.; Alam-Faruque, Y.; Alpi, E.; Antunes, R.; Arganiska, J.; Casanova, E. B.; Bely, B.; Bingley, M.; Bonilla, C.; Britto, R.; Bursteinas, B.; Chan, W. M.; Chavali, G.; Cibrian-Uhalte, E.; Da Silva, A.; De Giorgi, M.; Dimmer, E.; Fazzini, F.; Gane, P.; Fedotov, A.; Castro, L. G.; Garmiri, P.; Hatton-Ellis, E.; Hieta, R.; Huntley, R.; Jacobsen, J.; Jones, R.; Legge, D.; Liu, W. D.; Luo, J.; MacDougall, A.; Mutowo, P.; Nightingale, A.; Orchard, S.; Patient, S.; Pichler, K.; Poggioli, D.; Pundir, S.; Pureza, L.; Qi, G. Y.; Rosanoff, S.; Sawford, T.; Sehra, H.; Turner, E.; Volynkin, V.; Wardell, T.; Watkins, X.; Zellner, H.; Corbett, M.; Donnelly, M.; van Rensburg, P.; Goujon, M.; McWilliam, H.; Lopez, R.; Xenarios, I.; Bougueleret, L.; Bridge, A.; Poux, S.; Redaschi, N.; Auchincloss, A.; Axelsen, K.; Bansal, P.; Baratin, D.; Binz, P. A.; Blatter, M. C.; Boeckmann, B.; Bolleman, J.; Boutet, E.; Breuza, L.; de Castro, E.; Cerutti, L.; Coudert, E.; Cuche, B.; Doche, M.; Dornevil, D.; Duvaud, S.; Estreicher, A.; Famiglietti, L.; Feuermann, M.; Gasteiger, E.; Gehant, S.; Gerritsen, V.; Gos, A.; Gruaz-Gumowski, N.; Hinz, U.; Hulo, C.; James, J.; Jungo, F.; Keller, G.; Lara, V.; Lemercier, P.; Lew, J.; Lieberherr, D.; Martin, X.; Masson, P.; Morgat, A.; Neto, T.; Paesano, S.; Pedruzzi, I.; Pilbout, S.; Pozzato, M.; Pruess, M.; Rivoire, C.; Roechert, B.; Schneider, M.; Sigrist, C.; Sonesson, K.; Staehli, S.; Stutz, A.; Sundaram, S.; Tognolli, M.; Verbregue, L.; Veuthey, A. L.; Zerara, M.; Wu, C. H.; Arighi, C. N.; Arminski, L.; Chen, C. M.; Chen, Y. X.; Huang, H. Z.; Kukreja, A.; Laiho, K.; McGarvey, P.; Natale, D. A.; Natarajan, T. G.; Roberts, N. V.; Suzek, B. E.; Vinayaka, C. R.; Wang, Q. H.; Wang, Y. Q.; Yeh, L. S.; Yerramalla, M. S.; Zhang, J.; Consortium, U., Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Research 2013, 41 (D1), D43-D47. 21. Flicek, P.; Ahmed, I.; Amode, M. R.; Barrell, D.; Beal, K.; Brent, S.; Carvalho-Silva, D.; Clapham, P.; Coates, G.; Fairley, S.; Fitzgerald, S.; Gil, L.; Garcia-Giron, C.; Gordon, L.; Hourlier, T.; Hunt, S.; Juettemann, T.; Kahari, A. K.; Keenan, S.; Komorowska, M.; Kulesha, E.; Longden, I.; Maurel, T.; McLaren, W. M.; Muffato, M.; Nag, R.; Overduin, B.; Pignatelli, M.; Pritchard, B.; Pritchard, E.; Riat, H. S.; Ritchie, G. R. S.; Ruffier, M.; Schuster, M.; Sheppard, D.; Sobral, D.; Taylor, K.; Thormann, A.; Trevanion, S.; White, S.; Wilder, S. P.; Aken, B. L.; Birney, E.; Cunningham, F.; Dunham, I.; Harrow, J.; Herrero, J.; Hubbard, T. J. P.; Johnson, N.; Kinsella, R.; Parker, A.; Spudich, G.; Yates, A.; Zadissa, A.; Searle, S. M. J., Ensembl 2013. Nucleic Acids Res 2013, 41 (D1), D48-D55.

ACS Paragon Plus Environment

26

Page 27 of 28

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

22. Na, S.; Bandeira, N.; Paek, E., Fast Multi-blind Modification Search through Tandem Mass Spectrometry. Mol Cell Proteomics 2012, 11 (4). 23. Karolchik, D.; Baertsch, R.; Diekhans, M.; Furey, T. S.; Hinrichs, A.; Lu, Y. T.; Roskin, K. M.; Schwartz, M.; Sugnet, C. W.; Thomas, D. J.; Weber, R. J.; Haussler, D.; Kent, W. J., The UCSC Genome Browser Database. Nucleic Acids Res 2003, 31 (1), 51-54. 24. Aho, A. V.; Corasick, M. J., Efficient String Matching - Aid to Bibliographic Search. Commun Acm 1975, 18 (6), 333-340. 25. Dean, J.; Ghemawat, S., Mapreduce: Simplified data processing on large clusters. Commun Acm 2008, 51 (1), 107-113.

ACS Paragon Plus Environment

27

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 28

Abstract Graphic

ACS Paragon Plus Environment

28