Efficient Application of De Novo RNA Assemblers for Proteomics

Aug 15, 2016 - Toni Luge†∥, Cornelius Fischer†‡∥, and Sascha Sauer†‡§⊥. † Otto-Warburg-Laboratory, Max Planck Institute for Molecul...
0 downloads 0 Views 776KB Size
Subscriber access provided by Northern Illinois University

Letter

Efficient application of de novo RNA assemblers for proteomics informed by transcriptomics Toni Luge, Cornelius Fischer, and Sascha Sauer J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.6b00301 • Publication Date (Web): 15 Aug 2016 Downloaded from http://pubs.acs.org on August 15, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Efficient application of de novo RNA assemblers for proteomics informed by transcriptomics

Toni Luge1†, Cornelius Fischer1,2†, Sascha Sauer1,2,3*

Author Affiliations 1

Otto-Warburg-Laboratory, Max Planck Institute for Molecular Genetics, Ihnestraße 63-73,

14195 Berlin, Germany. 2

BIMSB and BIH Genomics Platforms, Laboratory of Functional Genomics, Nutrigenomics

and Systems Biology, Max-Delbrück-Center for Molecular Medicine, Robert-Rössle-Straße 10, 13125 Berlin, Germany 3

CU Systems Medicine, University of Würzburg, Josef-Schneider-Straße 2, 97080

Würzburg, Germany

* Author to whom correspondence shall be addressed: †

The first two authors should be regarded as joint first authors.

KEYWORDS Sequence

Databases,

Proteomics

Informed

by

Transcriptomics,

RNA-seq,

Mass

Spectrometry

1 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 20

ABSTRACT RNA sequencing is a powerful method to build reference transcriptome assemblies and eventually sample-specific protein databases for mass spectrometry-based analyses. This novel proteomics informed by transcriptomics (PIT) workflow improves sample-specific proteome characterization of dynamic- and especially non-model organism proteomes, and moreover helps to identify novel gene products. With increasing popularity of such proteogenomics applications a growing number of researchers demand qualitative but resource-friendly and easy to use analysis strategies. Most PIT applications so far rely on the initially introduced Trinity de novo assembly tool. To aid potential users to start off with PIT we compared main performance criteria of Trinity and other alternative RNA assembly tools known from the transcriptomics field including Oases, SOAPdenovo-Trans, and TransABySS. Using exemplary data sets and software-specific default parameters

Trinity and

alternative assemblers produced comparable and high-quality reference data for vertebrate transcriptomes/proteomes

of

varying

complexity.

However,

Trinity

required

large

computational resources and time. We found that alternative de novo assemblers, in particular SOAPdenovo-Trans but also Oases and Trans-ABySS rapidly produced protein databases with far lowerer computational requirements. By making choice among various RNA assembly tools, proteomics researchers new to transcriptome assembly and with future projects with high sample numbers can benefit from alternative approaches to efficiently apply PIT.

Introduction Sequence data resources are major components for identification of proteins using mass spectrometry to match experimentally derived spectra with theoretic spectra inferred from in silico digests of protein sequences listed in databases 1. Thus, the quality of the protein databases has a strong impact on the outcome of a proteomics study. Notably, many existing 2 ACS Paragon Plus Environment

Page 3 of 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

protein databases that rely on genome sequences or on computational prediction of genes/proteins do not reflect exactly the expressed proteins. Moreover, for many organisms no reference genomes have been generated as this is still a non-trivial process 2. For application in proteomics a reasonable compromise between completeness and effort might be to restrict the reference to the actually active (i.e. protein-coding) part of the genome

3,4

.

This seems in particular advantageous for analyzing dynamic proteomes (as in developmental biology or in host-pathogen studies)

5

and for complex proteomes such as

(microbial) metaproteomes 6,7. Recently, proteomics informed by transcriptomics (PIT) was introduced as a powerful proteogenomics approach to use RNA sequencing data and the Trinity RNA assembly tool for generating sample-specific protein databases without a reference genome 8. Shortly after, user-friendly integrated computational PIT pipelines based on the Galaxy platform such as GIO (Galaxy Integrated Omics) have been developed

9,10

. However, de novo reconstruction

of a set of transcripts from short to medium length sequences is computationally challenging (common next generation sequencing platforms produce reads of ~150 or less nucleotides). For example, uneven coverage, distinct splice forms and overlapping genes from opposite strands can be difficult to analyze 11. Different solutions to assemble transcriptomes de novo became popular during the last few years, and versatile RNA assembly tools such as Trinity and Trans-ABySS

15

12

, Oases

13

, SOAPdenovo-Trans

14

were developed and validated for various applications in genome

research. These tools allow for building a reference transcriptome assembly to generate sample-specific protein databases for mass spectrometry based analyses. As the protein sequence database has a direct impact on the peptide and protein identification by PIT a high quality transcriptome assembly is required. To make the potential users of the emerging PIT approach aware of useful and proven alternatives for RNA assembly, we here aimed to show that the influence of varying assemblies in terms of peptide spectrum matching and false discovery rates (FDR) seems 3 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 20

neglectable whereas practical issues such as computational workload (required random access memory (RAM) and central processing units (CPUs)) can vary significantly from tool to tool. For this analysis, we used publicly available reference sequences and published combined data sets including RNA sequencing and mass spectrometry-based protein data. These still rare data sets were derived from higher vertebrate organisms of different biological origin and featured varying sequence complexity. Concretely, for benchmarking we here included data from previous key PIT studies including Chinese hamster ovary (CHO) cells (Chinese hamster, Cricetulus griseus)

8

, Jurkat cells (immortalized human T-

lymphocytes, Homo sapiens) 16 and spleen tissue of zebrafish (Danio rerio) 17. This letter primarily aims to make the (new) user of the PIT workflow aware of efficient existing alternatives for the critical RNA assembly step. Our rough comparative analysis shall allow for quickly assessing the potential pros and cons of different approaches for generating transcriptome-based reference sequences for proteomics applications. Moreover, this letter may motivate users of PIT to extend evaluation and more deeply test RNA assembly tools to cater to the specific needs and resources required for the study, respectively.

Experimental Procedures Transcriptome assembly, protein sequence database construction and evaluation The reads obtained from three publicly available mRNA sequencing datasets (Table S1) were subjected to protein sequence database construction as described by Evans et al. (8). Transcriptome de novo assemblies from raw sequencing reads were built using the aboveintroduced RNA assemblers Trinity (version r20140717), Oases (Velvet version 1.2.10, Oases version 0.2.09), SOAPdenovo-Trans (version 1.03) and Trans-ABySS (version 1.5.3) and applying software-specific default parameters. All assemblers were run three times per dataset for performance assessments. To enable a comparative analysis a fixed k-mer value of 25 was specified for all assemblers. Number of threads was set to 8 for assembler with 4 ACS Paragon Plus Environment

Page 5 of 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

multithreading support. In cases in which more than a single FASTQ file of raw sequencing reads was obtained due to replicates and multiple screened conditions, reads were concatenated into a single file (single-read sequencing) or into two files (paired-end sequencing, one file for forward reads and one for reverse reads) prior to assembly. From the generated transcripts open reading frames (ORFs) between start and stop codons from all six frames that were longer than 200 nucleotides were translated into amino acid sequences using the EMBOSS tool "getorf" 18. To assess the potential influence of the protein sequence databases derived from different assemblers on peptide matching, the transcripts and predicted ORFs were compared to the reference transcriptomes

19

and proteomes

20

of the respective datasets, namely Chinese

hamster (Cricetulus griseus), Human (Homo sapiens) and Zebrafish (Danio rerio, see Table S1). Additionally, for further benchmarking we integrated reference sets from relatives such as mouse, chimp and Mexican tetra in the analysis (Figure 1C, Table S1). To infer the identity of transcripts and ORFs blastn and blastp of NCBI’s BLAST service against the reference sequences was used (http://blast.ncbi.nlm.nih.gov/Blast.cgi). For each query sequence only the top hit was considered in further analyses. Mass spectrometry data was processed and protein sequence database searching was performed as described below. The database search space for each flat file database was extrapolated post database searching as described recently applying the EMBOSS tool "pepdigest"

21

. The databases were in silico digested by

18

, allowing only for fully tryptic peptides considering

the proline-rule. Resulting peptide sequences were filtered and only those were retained that contained a minimum of 6 amino acids and which fell into the mass detection window of 6004,000 Da. Either all masses of the measured and identified distinct peptide sequences with no missed cleavages were taken into account, or a randomly sampled subset of 10,000 was considered (when more than 10,000 peptide sequences could be identified). For each mass the number of theoretical peptide candidates in the in silico digested protein sequence database was determined with a mass tolerance of +/- 10 ppm. 5 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 20

As was shown recently 8, comparison on the level of protein ambiguity groups (PAGs comprising the minimal list of proteins described by the identified peptides) is difficult to achieve with the grouping approaches that are applied to solve the so-called protein inference problem

22

. However, given the length of most ORFs (Figure 1D), lists obtained by

protein inference were only slightly disturbed by the fragmented nature of RNA-seq de novo assemblies. Thus, interpretation in terms of a minimal set of ORFs and not proteins seems advisable. In all cases, searching of homologous protein sequences from a related species performed worse, whereas the transcriptome derived data provided a valuable approach when no reference sequences were available. Notably, in cases where the reference protein sequence databases were of lower quality, as for the Chinese hamster, the PIT-workflow outperformed the conventional analyses in terms of peptide spectrum matches and identified peptides. For example, there was no indication for an increased FDR, in contrast to performing searches using homologous reference sequences.

Evaluation of evidence for global protein expression from discovery proteomics data Raw files from public repositories (Table S1) were processed and analyzed using the MaxQuant software

23

version 1.3.0.5 and the search engine Andromeda

24

. Cysteine

carbamidomethylation was selected as a fixed modification, methionine oxidation and Nterminal acetylation as variable modifications. Trypsin digest including cleavage N-terminal to proline (proline-rule) and a maximum number of two missed cleavages were selected as enzyme specific parameters. A minimum peptide length of 6 amino-acids was set in the analysis of public datasets and no second peptide was included. Generated peak lists were searched against each protein sequence database individually with a precursor mass tolerance of 10 ppm and a fragment mass tolerance of 0.5 Da. In all searches the peptide and protein FDR was controlled and limited to 1% applying the conventional target-decoy search strategy.

6 ACS Paragon Plus Environment

Page 7 of 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Results and Discussion The datasets were chosen to represent i) different complexities (47,798 RefSeq mRNA sequences from 25,865 annotated protein coding genes for zebrafish; 61,224 sequences from 21,893 genes for Chinese hamster; and 72,284 sequences from 20,276 genes for human - see Table S2) and ii) varying sequencing modes (5.7 gigabases of 36-nucleotide single reads for Chinese hamster; 7 gigabases of 50-nucleotide paired-end reads for zebrafish, and 22.9 gigabases of 100-nucleotide paired-end reads for human - see Table S1). For each of the three datasets (Chinese hamster, human and zebrafish) raw reads from multiple sequencing runs were merged. A single transcriptome was then reconstructed using the alternative

assemblers

Trinity,

Oases, SOAPdenovo-Trans and

Trans-ABySS.

Transcripts were translated into protein sequences by searching for open reading frames (ORFs) in all six frames according to the standard genetic code. Notably, in all cases more ORFs could be produced than protein sequences were available in the corresponding UniProtKB reference proteomes (Table 1). A similar trend was observed on the nucleotide level (Table S2). For example, by using the Trinity assembly for the human dataset we could infer 354,591 transcripts and 314,192 ORFs. These numbers are 4.9 and 4.6 times higher, respectively, compared to using a reference sequence. The majority (up to 95.1%) of the de novo derived sequences aligned to the reference sequences. Notably, 260,119 of 314,192 ORFs (82.8%) could be assigned to 44,974 of 68,049 protein sequences of the human reference proteome, indicating that multiple ORFs aligned to the same protein sequence (in this example 5.8 on average). With all assemblers two major populations of ORFs could be observed in all datasets: ORFs that were close to full length and ORFs that were fragmented (Figure 1A). Using the chosen program-specific default parameters Trinity provided consistent and robust results across all datasets. But also the other assemblers Oases, SOAPdenovo-Trans and TransABySS produced in general similar outputs with slight variation. Oases tended to produce the 7 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 20

highest number of ORFs but these were rather fragmented whereas the SOAPdenovo-Trans assemblies generated by trend fewer but (more) complete ORFs (Table 1).

8 ACS Paragon Plus Environment

Page 9 of 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Reference

Chinese hamster dataset Sequences [counts] Mean length [aa] Aligned [counts] Aligned [%] Aligned to [counts] ORFs aligned per seq. [counts] ORFs evidenced [count] Novel ORFs evidenced [%] Human dataset Sequences [counts] Mean length [aa] Aligned [counts] Aligned [%] Aligned to [counts] ORFs aligned per seq. [counts] ORFs evidenced [count] Novel ORFs evidenced [%] Zebrafish dataset Sequences [counts] Mean length [aa] Aligned [counts] Aligned [%] Aligned to [counts] ORFs aligned per seq. [counts] ORFs evidenced [count] Novel ORFs evidenced [%]

Chinese hamster 23884 368.3 23884 100 23884

Related Organism Mouse 44455 447.9 43796 98.5 15218

Trinity

Oases

SOAPdenov o

TransABySS

51461 157.6 48759 94.7 16951

74482 157.5 70564 94.7 16854

48550 152.8 46160 95.1 16474

67192 159.2 63687 94.8 17172

1

2.9

2.9

4.2

2.8

3.7

5479

8987

5145

9143

5290

7548

-

3

4.3

3.3

4.3

3.6

Human 68049 333.7 68049 100 68049

Chimp 20127 528 20061 99.7 18545

314192 136.4 260119 82.8 44974

486488 129.5 405999 83.5 44146

244593 123.3 196968 80.5 43778

263510 116.9 222706 84.5 46295

1

1.1

5.8

9.2

4.5

4.8

26858

8077

23510

45095

17972

21364

-

3.8

3.1

1.5

2.7

2.6

Zebrafish 41064 501.3 41064 100 41064

Mexican tetra 23667 531.9 23550 99.5 18900

57814 142.4 52953 91.6 24350

70689 148.3 64015 90.6 24315

113124 136.7 62739 55.5 25543

72752 149.3 66341 91.2 25715

1

1.2

2.2

2.6

2.5

2.6

1846

757

1134

1777

1345

1518

-

10.8

9.2

5.6

8.3

6.9

Table 1: Characteristics of UniProtKB reference proteomes and open reading frames (ORFs) predicted from RNA sequencing data. Transcripts from three different short read RNA sequencing datasets (Chinese hamster, Human and Zebrafish) were reconstructed using Trinity, Oases, SOAPdenovo-Trans or Trans-ABySS de novo transcriptome assemblers. ORFs were extracted and translated into amino acid sequences. By aligning predicted ORFs against the respective reference proteome using blastp the identity of the sequences could be determined.

9 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

We further analyzed high-resolution mass spectral data sets (Table S1) for which searches were performed in separate runs against the reference or the here de novo constructed protein sequence databases (Table 1). Posterior error probabilities (PEP) were calculated based on peptide length dependent score distributions of hits in the forward and reverse section of the database. As each analysis was based on a background with unique properties (Table 1), we checked for potential inflation of false identifications because of violation of the basic target-decoy strategy (potentially increased numbers of erroneous protein sequences in the databases). For instance, the curated UniProtKB reference proteome of Chinese hamster included 23,884 entries whereas the transcriptome assemblies of CHO cells comprised up to 74,482 ORFs (Table 1). However, a larger protein sequence database did not necessarily result in a larger search space in peptide spectrum matching (Figure 1B). Here the search space was defined as the number of theoretical peptides of the in silico tryptic digested database with the same mass (+/- 10 ppm) of an actually measured peptide. The fraction of identified spectra ranged from 14.8% up to 38.45% (Table S1 and Figure 1C). The use of de novo constructed ORFs led either to an increase of identifications (for the Chinese hamster dataset) or to a minor decrease of identifications for the human dataset, and resulted in a moderate decrease of identifications for the zebrafish dataset compared to the respective reference proteomes (Figure 1C). The largest drops of sensitivity occurred when using protein sequences of a related species (from mouse instead of Chinese hamster, from chimp instead of human, and from Mexican tetra instead of zebrafish) or in cases of extraordinary high search space (i.e. the Oases assembly of the human dataset). These results indicate that all de novo assembly tools provided useful data for PIT and not surprisingly outperformed the use of reference sequences of homologous DNA sequences. We used the results derived from querying UniProtKB reference proteomes to define a standard at which identifications were affected with 1% FDR. Analyses of the same dataset using the different (assembly) approaches were compared against this standard on the level of individual tandem mass spectra and the highest scoring spectrum match for each peptide (Figure 1C, see red fraction of the bars and numbers above). The proportion of conflicting 10 ACS Paragon Plus Environment

Page 10 of 20

Page 11 of 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

results was lower or equal to 1% for the transcriptome based analyses and (slightly) increased when using reference proteomes of evolutionary related species. Notably, a number of novel peptide matches could be observed (Figure 1C, grey fraction of the bars), in particular for the Chinese hamster dataset, which is in line with recent findings 8. However, in general a high level of congruence in the annotation of spectra with peptide sequences was observed (Figure 1C, black fraction of the bars). The subset of sequences from which the identified peptides were derived from aligned in many cases over nearly the full length of the respective reference sequence (Figure 1D). The population of predicted ORFs that were fragmented in length was low and no evidence for their expression on protein level could be found (Figure 1A). Thus, omitting fragmented ORFs from the analysis may represent a reasonable and resource-friendly strategy with minor influence on the final outcome.

Figure 1: Comparative evaluation of the performance of RNA assemblers Trinity, Oases, SOAPdenovo-Trans (here short SOAPdenovo) and Trans-ABySS for proteomics informed by 11 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 20

transcriptomics (PIT). A. Density distribution of relative length of open reading frames (ORF) predicted from RNA sequencing data using the four different transcriptome assemblers. ORF sequences were aligned to the UniProtKB reference proteome and relative length was calculated as ratio of query (ORF) and target (UniProtKB reference sequence) length. B. Box plot of peptide candidates per precursor ion within +/-10 ppm mass tolerance as estimation for the search space in peptide spectrum matching. C. Bar plot shows tandem mass spectra (left bar) and distinct peptide sequences (right bar) as identified by MaxQuant analysis at 1% FDR using differently generated protein sequence databases in target-decoy searching of publicly available discovery proteomics datasets. Identifications were compared to those obtained from searching the UniProtKB reference proteome and proportions indicate shared (black, coinciding), novel (grey) and differential (red, conflicting) results. Additionally, red numbers indicate percentage of conflicting versus total identifications. D. Density distribution of relative length of the subset of ORFs as depicted in A of which evidence on protein level (C) was found.

Our analyses focused on the performance of mass spectrometry based protein detection, which all four tested assemblers seemed to support sufficiently. Estimation of the search space can help the user to determine bad quality assemblies and guide the decision to test software-specific parameters (k-mers) and identify most suitable assembly tools. In summary, all four de novo assembly approaches (using Trinity, Oases, SOAPdenovo-Trans, Trans-ABySS) produced useful results of similar quality for peptide identification within the PIT workflow. Trinity appeared to provide robust results across different datasets and is an established validated tool for PIT

25

. But on the other hand, this assembler seems computationally most

expensive in terms of memory (RAM) usage, CPU usage and runtime. This problem may be neglectable for organisms with genomes/proteomes of low complexity (e.g. yeast). However, for medically and biologically more relevant complex genomes/proteomes (e.g. 12 ACS Paragon Plus Environment

Page 13 of 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

human), the Trinity assembly required more than 6 hours for the here applied human dataset (Figure 2 A-C and Figure S1). In contrast SOAPdenovo-Trans, Trans-ABySS and Oases were much more resourcefriendly in terms of CPU time (Figure 2B) and run-time (Figure 2C), and produced solid results for complex human samples in a few minutes (SOAPdenovo-Trans) to up to 2 hours (Trans-ABySS). In terms of RAM usage SOAPdenovo-Trans and Trans-ABySS were least demanding and could be applied on desktop computers instead of server systems – a big advantage for laboratories with limited IT resources. For larger RNA-seq data sets we even experienced that Trinity and Oases but not SOAPdenovo-Trans and Trans-ABySS were incapable to produce assemblies because server RAM capacities were exceeded.

Figure 2: Evaluation of different hardware requirements and usability for Trinity, Oases, SOAPdenovo-Trans and Trans-ABySS de novo transcriptome assemblers. Assemblers were run three times (error bars indicate standard deviations) on the complete datasets from Chinese hamster ovary (CHO) cells (Chinese hamster, Cricetulus griseus, Hamster) and from spleen tissue of zebrafish (Danio rerio, Zebrafish) and on a subset of the dataset from Jurkat cells (immortalized line of human T-lymphocytes, Homo sapiens, Human). A. Maximum random-access memory (RAM) usage in gigabytes. Average RAM usage is indicated by +. B. Central processing unit (CPU) measured as the amount of time (hours) the CPUs were busy executing code in user space. C. Run time (hours) that was required to complete the full assembly task. D. Comparison of online documentation, installability, 13 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 20

usability, ability of parallel computing and general computational requirements. This evaluation is in part subjective but may provide users of RNA de novo assembly tools with an idea how to start off.

Moreover, we found remarkable differences for the general ease of use (documentation, installability and usability) for the tested de novo assemblers (Figure 2D). This may be especially important for non-experts of RNA de novo assembly and for wide dissemination of the PIT approach. Among the compared assemblers SOAPdenovo-Trans showed striking properties in terms of a) simple and clear documentation, b) easy installability due to available precompiled binaries, c) intuitive usability d) good implementation of multithreading (parallel processing with many CPUs) and e) low computational requirements. Clearly, the users of PIT and proteogenomics applications are advised to observe the highly dynamic field of functional genomics and are invited to test further assembly tools. In summary, Trinity has so far been the main RNA assembly tool for application in the PIT workflow as this assembler produces reliable results. Nevertheless, computationally less demanding, simpler and faster RNA assemblers such as SOAPdenovo-Trans and others are available, which we recommend to consider as valuable alternatives for PIT protocols and for integration

in

web

interface

tools

such

as

GIO

based

on

Galaxy

(see

e.g.

http://gio.sbcs.qmul.ac.uk/). Thus, further progress in resource friendly and easy to implement assembly strategies will downsize the need for extensive computational infrastructures and thereby probably extend the common use of PIT and proteogenomics applications for a growing number of scientists.

14 ACS Paragon Plus Environment

Page 15 of 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

ASSOCIATED CONTENT

Supporting Information Supplementary Information: Additional information on computational performance of assemblers and data sets. Supplementary Figure 1: Assessment of random-access memory (RAM) usage in gigabytes over time (minutes) for Trinity, Oases, SOAPdenovo-Trans and Trans-ABySS de novo transcriptome assemblers; Supplementary Table 1: Public datasets used for in silico analyses; Supplementary Table 2: Characteristics of RefSeq transcriptomes and transcripts predicted from RNA sequencing data. This material is available free of charge via the Internet at http://pubs.acs.org.

AUTHOR INFORMATION

Corresponding Author Author to whom correspondence should be addressed: E-mail: [email protected]

Present Address Max-Delbrück-Center for Molecular Medicine, Robert-Rössle-Straße 10, 13125 Berlin, Germany

Author Contributions The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript.

Funding Sources German Ministry for Education and Research [BMBF, 0315082 and 01EA1303] and the European Commission [FP7/2007–2013 under grant agreement 262055 ESGI].

15 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 20

Notes The authors declare no competing financial interest. ACKNOWLEDGMENTS We thank Sven Klages for helping with software setup and maintenance. ABBREVIATIONS The abbreviations used are: FDR, False discovery rate; UniProtKB, UniProt Knowledgebase; RAM, Random-access memory; CPU, Central processing unit.

REFERENCES 1

Sauer, S.; Lange, B. M.; Gobom, J.; Nyarsik, L.; Seitz, H.; Lehrach, H., Miniaturization

in functional genomics and proteomics. Nat Rev Genet 2005, 6, (6), 465-76. 2

Yandell, M.; Ence, D., A beginner's guide to eukaryotic genome annotation. Nat Rev

Genet 2012, 13, (5), 329-42. 3

Shanmugam, A. K.; Yocum, A. K.; Nesvizhskii, A. I., Utility of RNA-seq and GPMDB

protein observation frequency for improving the sensitivity of protein identification by tandem MS. J Proteome Res 2014, 13, (9), 4113-9. 4

Halvey, P. J.; Zhang, B.; Coffey, R. J.; Liebler, D. C.; Slebos, R. J., Proteomic

consequences of a single gene mutation in a colorectal cancer model. J Proteome Res 2012, 11, (2), 1184-95. 5

Luge, T.; Kube, M.; Freiwald, A.; Meierhofer, D.; Seemuller, E.; Sauer, S.,

Transcriptomics assisted proteomic analysis of Nicotiana occidentalis infected by Candidatus Phytoplasma mali strain AT. Proteomics 2014, 14, (16), 1882-9. 6

Hettich, R. L.; Pan, C.; Chourey, K.; Giannone, R. J., Metaproteomics: harnessing the

power of high performance mass spectrometry to identify the suite of proteins that control metabolic activities in microbial communities. Anal Chem 2013, 85, (9), 4203-14. 16 ACS Paragon Plus Environment

Page 17 of 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

7

Sauer, S.; Luge, T., Nutriproteomics: facts, concepts, and perspectives. Proteomics

2015, 15, (5-6), 997-1013. 8

Evans, V. C.; Barker, G.; Heesom, K. J.; Fan, J.; Bessant, C.; Matthews, D. A., De

novo derivation of proteomes from transcriptomes for transcript and protein identification. Nat Methods 2012, 9, (12), 1207-11. 9

Fan, J.; Saha, S.; Barker, G.; Heesom, K. J.; Ghali, F.; Jones, A. R.; Matthews, D. A.;

Bessant, C., Galaxy Integrated Omics: Web-based Standards-Compliant Workflows for Proteomics Informed by Transcriptomics. Mol Cell Proteomics 2015, 14, (11), 3087-93. 10

Jagtap, P. D.; Johnson, J. E.; Onsongo, G.; Sadler, F. W.; Murray, K.; Wang, Y.;

Shenykman, G. M.; Bandhakavi, S.; Smith, L. M.; Griffin, T. J., Flexible and accessible workflows for improved proteogenomic analysis using the Galaxy framework. J Proteome Res 2014, 13, (12), 5898-908. 11

Jones, A. R.; Eisenacher, M.; Mayer, G.; Kohlbacher, O.; Siepen, J.; Hubbard, S. J.;

Selley, J. N.; Searle, B. C.; Shofstahl, J.; Seymour, S. L.; Julian, R.; Binz, P. A.; Deutsch, E. W.; Hermjakob, H.; Reisinger, F.; Griss, J.; Vizcaino, J. A.; Chambers, M.; Pizarro, A.; Creasy, D., The mzIdentML data standard for mass spectrometry-based proteomics results. Mol Cell Proteomics 2012, 11, (7), M111 014381. 12

Haas, B. J.; Papanicolaou, A.; Yassour, M.; Grabherr, M.; Blood, P. D.; Bowden, J.;

Couger, M. B.; Eccles, D.; Li, B.; Lieber, M.; Macmanes, M. D.; Ott, M.; Orvis, J.; Pochet, N.; Strozzi, F.; Weeks, N.; Westerman, R.; William, T.; Dewey, C. N.; Henschel, R.; Leduc, R. D.; Friedman, N.; Regev, A., De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc 2013, 8, (8), 1494-512. 13

Schulz, M. H.; Zerbino, D. R.; Vingron, M.; Birney, E., Oases: robust de novo RNA-

seq assembly across the dynamic range of expression levels. Bioinformatics 2012, 28, (8), 1086-92. 14

Luo, R.; Liu, B.; Xie, Y.; Li, Z.; Huang, W.; Yuan, J.; He, G.; Chen, Y.; Pan, Q.; Liu, Y.;

Tang, J.; Wu, G.; Zhang, H.; Shi, Y.; Liu, Y.; Yu, C.; Wang, B.; Lu, Y.; Han, C.; Cheung, D. W.; Yiu, S. M.; Peng, S.; Xiaoqian, Z.; Liu, G.; Liao, X.; Li, Y.; Yang, H.; Wang, J.; Lam, T. 17 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 20

W.; Wang, J., SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 2012, 1, (1), 18. 15

Robertson, G.; Schein, J.; Chiu, R.; Corbett, R.; Field, M.; Jackman, S. D.; Mungall,

K.; Lee, S.; Okada, H. M.; Qian, J. Q.; Griffith, M.; Raymond, A.; Thiessen, N.; Cezard, T.; Butterfield, Y. S.; Newsome, R.; Chan, S. K.; She, R.; Varhol, R.; Kamoh, B.; Prabhu, A. L.; Tam, A.; Zhao, Y.; Moore, R. A.; Hirst, M.; Marra, M. A.; Jones, S. J.; Hoodless, P. A.; Birol, I., De novo assembly and analysis of RNA-seq data. Nat Methods 2010, 7, (11), 909-12. 16

Sheynkman, G. M.; Shortreed, M. R.; Frey, B. L.; Smith, L. M., Discovery and mass

spectrometric analysis of novel splice-junction peptides using RNA-Seq. Mol Cell Proteomics

2013, 12, (8), 2341-53. 17

Kelkar, D. S.; Provost, E.; Chaerkady, R.; Muthusamy, B.; Manda, S. S.;

Subbannayya, T.; Selvan, L. D.; Wang, C. H.; Datta, K. K.; Woo, S.; Dwivedi, S. B.; Renuse, S.; Getnet, D.; Huang, T. C.; Kim, M. S.; Pinto, S. M.; Mitchell, C. J.; Madugundu, A. K.; Kumar, P.; Sharma, J.; Advani, J.; Dey, G.; Balakrishnan, L.; Syed, N.; Nanjappa, V.; Subbannayya, Y.; Goel, R.; Prasad, T. S.; Bafna, V.; Sirdeshmukh, R.; Gowda, H.; Wang, C.; Leach, S. D.; Pandey, A., Annotation of the zebrafish genome through an integrated transcriptomic and proteomic analysis. Mol Cell Proteomics 2014, 13, (11), 3184-98. 18

Rice, P.; Longden, I.; Bleasby, A., EMBOSS: the European Molecular Biology Open

Software Suite. Trends Genet 2000, 16, (6), 276-7. 19

Pruitt, K. D.; Tatusova, T.; Maglott, D. R., NCBI reference sequences (RefSeq): a

curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 2007, 35, (Database issue), D61-5. 20

UniProt, C., UniProt: a hub for protein information. Nucleic Acids Res 2015, 43,

(Database issue), D204-12. 21

Krug, K.; Popic, S.; Carpy, A.; Taumer, C.; Macek, B., Construction and assessment

of individualized proteogenomic databases for large-scale analysis of nonsynonymous single nucleotide variants. Proteomics 2014, 14, (23-24), 2699-708.

18 ACS Paragon Plus Environment

Page 19 of 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

22

Nesvizhskii, A. I.; Aebersold, R., Interpretation of shotgun proteomic data: the protein

inference problem. Mol Cell Proteomics 2005, 4, (10), 1419-40. 23

Cox, J.; Matic, I.; Hilger, M.; Nagaraj, N.; Selbach, M.; Olsen, J. V.; Mann, M., A

practical guide to the MaxQuant computational platform for SILAC-based quantitative proteomics. Nat Protoc 2009, 4, (5), 698-705. 24

Cox, J.; Neuhauser, N.; Michalski, A.; Scheltema, R. A.; Olsen, J. V.; Mann, M.,

Andromeda: a peptide search engine integrated into the MaxQuant environment. J Proteome Res 2011, 10, (4), 1794-805. 25

Luge, T.; Sauer, S., Generating Sample-Specific Databases for Mass Spectrometry-

Based Proteomic Analysis by Using RNA Sequencing. Methods Mol Biol 2016, 1394, 219-32.

19 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 20

For TOC only

20 ACS Paragon Plus Environment