Tools to Covisualize and Coanalyze Proteomic Data with Genomes

School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales 2052, Australia. § Intersect Australia Limi...
6 downloads 4 Views 5MB Size
Article pubs.acs.org/jpr

Tools to Covisualize and Coanalyze Proteomic Data with Genomes and Transcriptomes: Validation of Genes and Alternative mRNA Splicing Chi Nam Ignatius Pang,†,‡ Aidan P. Tay,†,‡ Carlos Aya,§ Natalie A. Twine,†,‡ Linda Harkness,∥ Gene Hart-Smith,†,‡ Samantha Z. Chia,† Zhiliang Chen,† Nandan P. Deshpande,†,‡ Nadeem O. Kaakoush,‡ Hazel M. Mitchell,‡ Moustapha Kassem,∥ and Marc R. Wilkins*,†,‡ †

Systems Biology Initiative, The University of New South Wales, Sydney, New South Wales 2052, Australia School of Biotechnology and Biomolecular Sciences, The University of New South Wales, Sydney, New South Wales 2052, Australia § Intersect Australia Limited, Sydney, New South Wales 2000, Australia ∥ Endocrine Research Laboratory (KMEB), Department of Endocrinology and Metabolism, Odense University Hospital & University of Southern Denmark, Odense 5230, Denmark ‡

S Supporting Information *

ABSTRACT: Direct links between proteomic and genomic/transcriptomic data are not frequently made, partly because of lack of appropriate bioinformatics tools. To help address this, we have developed the PG Nexus pipeline. The PG Nexus allows users to covisualize peptides in the context of genomes or genomic contigs, along with RNA-seq reads. This is done in the Integrated Genome Viewer (IGV). A Results Analyzer reports the precise base position where LC− MS/MS-derived peptides cover genes or gene isoforms, on the chromosomes or contigs where this occurs. In prokaryotes, the PG Nexus pipeline facilitates the validation of genes, where annotation or gene prediction is available, or the discovery of genes using a “virtual protein”-based unbiased approach. We illustrate this with a comprehensive proteogenomics analysis of two strains of Campylobacter concisus. For higher eukaryotes, the PG Nexus facilitates gene validation and supports the identification of mRNA splice junction boundaries and splice variants that are protein-coding. This is illustrated with an analysis of splice junctions covered by human phosphopeptides, and other examples of relevance to the ChromosomeCentric Human Proteome Project. The PG Nexus is open-source and available from https://github.com/IntersectAustralia/ ap11_Samifier. It has been integrated into Galaxy and made available in the Galaxy tool shed. KEYWORDS: visualization, data integration, RNA-seq, alternative splicing, splice-junction peptides



short genes of less than 400 nucleotides.8 Proteogenomic analysis, which involves comparing proteomics data to six-frame translation of the bacterial genome, can be used to validate existing genes and discover novel genes that are not identified by machine learning-based approach or homology searches.9−13 Identification of peptide hits through a proteogenomics approach has been shown to accurately predict start and stop codons of genes, validate overlapping genes, find evidence to locate short proteins, and annotate proteins with programmed ribosomal frame-shifting.3,9,14,15 Consequently, proteogenomics is an excellent method for validating putative ORFs in bacterial genomes. While existing tools such as VESPA10 and Peppy16

INTRODUCTION Proteomic, transcriptomic, and genomic analyses are complementary. Proteomic analyses are underpinned by access to complete and annotated genomes, to facilitate protein identification. Yet proteomic data can also be used for genome annotation,1 identification of novel genes,2,3 and validation of alternatively spliced mRNA isoforms.4 The attainment of high sequence coverage of a genome is becoming routine with nextgeneration sequencing; nevertheless, it can be difficult to assess the quality of de novo sequence assemblies.5 Proteomic analyses are now of sufficient scale to offer large-scale means to help validate new genomes, even though it remains difficult to get complete coverage of the proteome.6 The annotation of newly assembled bacterial genomes involves the use of gene prediction algorithms, but these methods cannot reliably predict all genes.7 A common issue for gene prediction algorithms is that they have difficulty predicting © 2013 American Chemical Society

Special Issue: Chromosome-centric Human Proteome Project Received: August 9, 2013 Published: October 23, 2013 84

dx.doi.org/10.1021/pr400820p | J. Proteome Res. 2014, 13, 84−98

Journal of Proteome Research

Article

ProteomeXchange with identifiers PXD000504 and PXD000505 for strains BAA-1547 and UNWCD, respectively. The MS/MS ions searches were performed using Mascot with the following parameters: trypsin enzyme, variable modifications of Carbamidomethyl and Oxidation, peptide mass tolerance of ±4 ppm, fragment mass tolerance of ±0.4 Da, maximum of 1 missed cleavage, and the electrospray ion trap (ESI-TRAP) instrument type. The proteomics data for each strain of C. concisus was analyzed using the MS/MS ions search tool against three different protein sequence databases. These included (1) the genes annotated in the NCBI RefSeq database,33 (2) the genes predicted by Glimmer, and (3) the six-frame translation of the genome. The first MS/MS ions search used the protein sequences from the NCBI RefSeq database33 for the corresponding strain of C. concisus. The MS/MS ions search results were analyzed with the Samifier and Results Analyzer using the gene annotations from the .GFF file obtained from the NCBI RefSeq database. The second protein sequence database was based on genes predicted by Glimmer (version 3.02)36 from the genome sequence of each strain of C. concisus. On the basis of the predicted genes, the Virtual Protein Generator was used to create a gene annotation file in GFF3 format, a list of protein accession numbers, and a list of protein sequences in FASTA format. The FASTA file was used as a custom Mascot protein sequence database for the MS/MS ions search. The Samifier and Results Analyzer were subsequently used to validate genes using the Mascot search results, the .GFF file and accession files generated by Virtual Protein Merger, and the genome for the corresponding strain of C. concisus. The third protein sequence database was comprised of “virtual proteins”, which were generated by slicing the genome into segments of 900 nucleotides (300 amino acids) and sixframe translation of each segment. The length of 300 amino acids was chosen, as this was the average length of protein in the C. concisus BAA-1457 genome calculated from the Uniprot database (version 2013_1). This length was applied to both strains of the bacteria to facilitate cross-comparisons. The resulting sequence database was used for the MS/MS ions searches. The Virtual Protein Merger was used to reconstruct the genome coordinates of the putative open reading frames based on peptide matches to the virtual proteins, and the results were saved into a .GFF file. The Samifier and Results Analyzer were used to analyze the Mascot search results, using the accession file generated by Virtual Protein Generator, the .GFF file generated by Virtual Protein Merger, and the genome sequence.

can perform proteogenomic analyses, there is a lack of tools that “plug in” to widely used genomic viewers such as the Integrative Genomics Viewer.17 Such tools could facilitate systematic and exploratory analysis of proteomics data in the context of genome assemblies and, where appropriate, any underlying next-generation sequencing reads. In eukaryotes, alternative splicing is important for diversifying the sequence of mRNA, the function of proteins, and the fine-tuning of their roles in different types of tissues.18−21 Nextgeneration sequencing of vertebrate transcriptomes has shown that the majority of multiexon genes are transcribed into alternatively spliced isoforms.22,23 In humans, most genes are expressed as 10−12 isoforms per cell type, with at least two major isoforms for three-quarters of the protein-coding genes.22 The question raised by these observations is whether these are all protein-coding. Proteomic analysis has shown that a surprisingly small percentage of the isoforms are translated into proteins and that most transcripts seem to lack proteincoding potential.22,24,25 This highlights a need to confirm the translation of spliced transcripts using proteomics data (e.g., refs 4, 24, 26, 27). In fact, a key deliverable in the Chromosome-Centric Human Proteome Project (C-HPP) is to catalogue at least one alternative splice isoform for each gene. There are several tools that allow users to browse C-HPP data (e.g., refs 28 and 29), but there is a lack of tools for users to analyze their mass spectrometry data for evidence of proteins arising from alternatively spliced mRNA. Any such tools will also be applicable to eukaryotic genomes in which the annotation of alternative splicing is not extensive, for example, in the pathogenic fungus Aspergillus f lavus30,31 and the model plant Arabidopsis thaliana.32 To allow the coanalysis of proteomic, genomic and transcriptomic data, we have developed the Proteomic− Genomic Nexus (PG Nexus) pipeline. Its software components, the Samifier and a Results Analyzer, together enable the user to ask a number of different integrative questions. Its other components, the Virtual Protein Generator and Virtual Protein Merger facilitate the analysis of unannotated bacterial genomes or contigs. We illustrate the use of the PG Nexus with case studies from the bacteria Campylobacter concisus and examples with data from human cells. We also highlight how the proteomic and nucleic acid data types can be coexplored in the Integrative Genomics Viewer (IGV).17



MATERIALS AND METHODS

Genomic and Proteomic Data Sets for Campylobacter concisus

The genomics data for C. concisus (strain BAA-1457) was downloaded from NCBI33 (accession number NC_009802.1). Similarly, the contigs for C. concisus UNSWCD strain was downloaded from NCBI (accession numbers AENQ01000001 to AENQ01000086). The genomics data include the .GFF file containing the genome coordinates of all the genes and their corresponding protein sequences. Additional information on the locus name and NCBI accession numbers for C. concisus strain BAA-1457 and strain UNSWCD was downloaded from Uniprot34 (version 2013_01 and version 2013_4, respectively). The Virtual Protein Generator requires the user to supply the bacterial genetic code, which was obtained from NCBI.35 The mass spectrometry data from Deshpande et al. (2011)11 was used for the proteomics data analysis of both C. concisus strain BAA-1457 and UNSWCD. The data has been deposited to the

Genomic, Transcriptomic, and Proteomic Data Sets for Saccharomyces cerevisiae

The Saccharomyces cerevisiae strain S288C genomics data (version R64-1-1) was obtained from the Saccharomyces Genome Database.37 This includes information on the genome sequence and the genomic coordinates for each gene and the respective location of their exons and introns. Additional information such as the protein sequences, protein description, and sequence identifiers such as Ordered Locus Name, Uniprot accession number, and identifier were obtained from Uniprot (version 2012_12). The proteomics data were obtained from de Godoy et al. (2008),38 which reported 4399 proteins in haploid and diploid yeasts. The data set for this publication was downloaded from Peptide Atlas 39 (accession number 85

dx.doi.org/10.1021/pr400820p | J. Proteome Res. 2014, 13, 84−98

Journal of Proteome Research

Article

Cell Culture and Protein Extraction of Human Mesenchymal Stem Cell Proteome

PAe000793 and PAe001277). The peptides that were included in the analysis were within a 1% false discovery rate and were not part of dubious open reading frames. These data were converted into Mascot results file (Mascot DAT) format using custom Perl scripts. The genome sequence, genomic coordinates for each gene, accession file containing the Ordered Locus Name for each protein, and the custom .DAT file were used as input files to run the Samifier and Results Analyzer (see Supporting Information 1). Three biological replicates of Saccharomyces cerevisiae BY4741 wild-type cells were grown to midlog (OD600−0.8) in YEPD (2% w/v dextrose, 2% w/v peptone, 1% w/v yeast extract) at 30 °C. RNA was immediately extracted from the cells using TRIzol (Invitrogen). RNA integrity was confirmed with 2100 Bioanalyzer (Agilent Technologies). For mRNA-Seq sample preparation, the Illumina TruSeq RNA Sample Prep Kit (version 2) was used according to the manufacturer’s instructions, with 2 μg of total RNA for each of the three biological triplicate samples as input. The libraries were enriched using 15 cycles of PCR, and the insert size ranged from 80 to 330 bp. The libraries were sequenced using HiSeq 2000 with the TruSeq v3 SBS reagents to generate 100 base paired-end reads. Raw pair-end reads were trimmed using SolexaQA version 1.1140 with the BWA trimming mode at a threshold of Q13 (P = 0.05). Low-quality 3′ ends of each read were removed. The SAM files were generated by aligning the reads against the S. cerevisiae S288C reference genome (version R64-1-1) with TopHat 2.0.441 and Bowtie 2-2.0.0-beta742 using default parameters.

A telomerized bone marrow stromal (mesenchymal) cell line, (hMSC-TERT, Simonsen et al. (2002))50−52 was grown for 14 days in MEM medium (Invitrogen, Taastrup, Denmark) supplemented with 10% FBS (Sigma-Aldrich). Protein was extracted using RIPA buffer (Sigma-Aldrich) for 30 min on rotation at 4 °C. The lysate was then centrifuged at 12 000 rpm for 20 min, and the supernatant quantified using the Bradford method. 25 μg of protein lysate was added onto a Novex protein Nu-Page 4−12% BIS-TRIS gel (Life Technologies, Taastrup, Denmark) and run at 200 V for 35 min. Subsequently, the gel was stained for 1 h in Biosafe Comassie G-250 stain (Bio-Rad, Copenhagen, Denmark) and washed in distilled water for 30 min. Five lanes were run for the sample, and each lane was cut into 15 sections for analysis. Proteolytic Digestions

Polyacrylamide gel slices were destained, reduced, and alkylated following the procedure described by Shevchenko et al. (1996).53 For protein digestion, 40 ng of trypsin (Promega) in 120 μL of 0.1 M NH4HCO3 was used for each gel slice, and incubation was for 16 h at 37 °C. The digest solutions were removed to new microfuge tubes and the gel slices treated with the following solutions sequentially for 30 min each: 80 μL of 0.1% (v/v) formic acid/67% (v/v) acetonitrile then 80 μL of 100% acetonitrile. The pooled digest and peptide extraction solutions were then dried (Savant SPD1010, Thermofisher Scientific) before resuspending in 20 μL of 0.1% (v/v) formic acid. Mass Spectrometry

Analysis of Human Phosphopeptides

Proteolytic peptide samples were separated by nano-LC using an UltiMate 3000 HPLC and autosampler system (Dionex, Amsterdam, Netherlands), and ionized using positive ion mode electrospray following experimental procedures described previously.54 MS and MS/MS were performed using an LTQ Orbitrap Velos Pro (Thermo Electron, Bremen, Germany) hybrid linear ion trap and Orbitrap mass spectrometer. Survey scans m/z 350−2000 were acquired in the Orbitrap (resolution = 30 000 at m/z 400, with an initial accumulation target value of 1 000 000 ions in the linear ion trap; lock mass applied to polycyclodimethylsiloxane background ions of exact m/z 445.1200 and 429.0887). Up to the 15 most abundant ions (>5000 counts) with charge states of > +2 were sequentially isolated and fragmented via collision induced dissociation (CID) with an activation q = 0.25, an activation time of 30 ms, normalized collision energy of 30%, and at a target value of 10 000 ions. Fragment ions were mass analyzed in the linear ion trap. The raw mass spectra were analyzed with Mascot using the following parameters: trypsin enzyme, variable modifications of Carbamidomethyl and Oxidation, peptide mass tolerance of ±4 ppm, fragment mass tolerance of ±0.4 Da, maximum of 1 missed cleavage, and the electrospray ion trap (ESI-TRAP) instrument type. The ENSEMBL protein sequences (version 71) were used for the Mascot protein sequence database, and a peptide threshold score of greater than 60 was used to define high-confidence peptides.

The hg19 genomic sequence (version 71) was downloaded from the FTP server of ENSEMBL.43 Only nonhaplotype chromosomes were used as input to the Samifier and Results Analyzer. Additionally, a GTF file of the same release version was downloaded from ENSEMBL and converted into a .GFF file through custom Perl scripts.44 In relation to proteomics data, four human phosphoproteomics data sets45−48 were downloaded from the PHOSIDA database.49 The phosphopeptides were of high confidence as they were identified at less than 1% false discovery rate. The data were converted into a Mascot results file (Mascot DAT format) using custom Perl scripts. The start and stop position of the phosphopeptide and the phosphopeptide sequence was verified against proteincoding sequences in the ENSEMBL database, and phosphopeptides with incorrect sequence removed from the data set. These files, along with an accession file containing the ENSEMBL identifications for each transcript, were used as input files to run Samifier and Results Analyzer. Custom Perl scripts were used to generate in silico peptide sequences based on the phosphoproteins from PHOSIDA. These peptides were created based on a trypsin digest allowing up to 3 missed cleavages and ranged between 614 and 8064 Da (including phosphorylation). To map peptides to their corresponding locations in the genome, the in silico peptides were formatted into Mascot result files (Mascot DAT format) through custom Perl scripts and used, along with other previously described input files, in the Results Analyzer. In silico peptides and phosphopeptides that mapped to multiple genomic locations were removed from subsequent analyses.

Analysis with Samifier

The start and stop positions of the matched peptide sequences were verified against protein-coding sequences in the ENSEMBL database. These files, along with an accession file 86

dx.doi.org/10.1021/pr400820p | J. Proteome Res. 2014, 13, 84−98

Journal of Proteome Research

Article

Figure 1. Graphical user interface of Samifier in Galaxy. (a) A hyperlink to an interface to upload the input files. (b) Users can then run the Samifier tool by selecting their files from the drop-down menus, defining a peptide confidence threshold score, and choosing whether to output a log file or a list of region of interests in .BED format. (c) Output files that are ready for download are listed in the right-most panel.

assembly in GTF file format. Cufflinks was run with the following options “-L, -b, -u, -g, and -M”.60 The “-L” option instructs Cufflinks to report novel alternatively spliced transcripts based on known exon information. The “-b” option enables bias detection and correction to improve accuracy of transcript abundance estimates. The “-u” option enables accurate weighting of reads that map to multiple locations in the genome. The “-g” option instructs Cufflinks to use the supplied .GTF annotation file to perform the reference-based transcriptome assembly. The “-M” option instructs Cufflinks to ignore all reads that are included in the mask file in .GTF format. This allows removal of highly abundant reads that can interfere with the assembly (e.g., rRNA and mitochondrial transcripts).

containing the ENSEMBL identifications for each transcript, were used as input files to run the Samifier. RNA-seq Library Preparation and Sequencing

Total RNA was isolated using TRIzol (Invitrogen) as previously reported.55 Following initial extraction, samples were additionally eluted using a GenElute mammalian total RNA miniprep kit (Sigma-Aldrich; according to manufacturer’s instructions) to achieve the highest purity RNA. 0.5 μg RNA for each sample was validated for its purity using a 1% agarose gel (Sigma-Aldrich) in Tris-acetate-EDTA (TAE; Invitrogen). For mRNA-Seq sample preparation, the Illumina TruSeq RNA Sample Prep Kit (version 2) was used according to the manufacturer’s instructions, with 2 μg of total RNA for each of the three biological triplicate samples as input. The libraries were enriched using 15 cycles of PCR, and the insert size ranged from 80 to 330 bp. The libraries were sequenced in separate lanes using HiSeq 2000 with the TruSeq v3 SBS reagents to generate 100 base paired-end reads.

Integration of the PG Nexus into Galaxy

A graphical user interface of the PG Nexus pipeline was built and then made available online in Galaxy under the Galaxy test tool shed.61 The tool can be located using the name “Samifier” in the tool search field. The layout of the user interface for the Samifier tool is shown in Figure 1. Running the Samifier tool involves the following steps: (a) Input files can be uploaded directly onto Galaxy by clicking on the “Upload File” link and following the steps provided. One exception to this is the genome sequence files, which must be compressed into a TAR archived file before being uploaded onto Galaxy. (b) Users run the Samifier tool by selecting their files from the drop-down menus, choosing a peptide confidence threshold score, and choosing whether to output an error log file or the regions of interest in BED file format. (c) Once executed, the .SAM output file for Samifier will appear on the side panel of Galaxy for users to download and visualize in IGV. The interface and

Processing of RNA-seq Data

RNA-seq data were analyzed as per Twine et al. (2013).56 Briefly, sequencing reads for all samples were assessed for quality using FastQC (version 0.10.1).57 All samples had an average Phred score of 30 or greater. A median of 99.92% (range: 83.31−99.97%) reads passed quality filtering, across all samples. Filtered reads were then mapped against the human genome reference (UCSC hg19 build) using TopHat (version 2.04)58 and options “-b2-sensitive, -G, -transcriptome-index and -M”. A median of 87.14% (range: 84.23−89.56%) of reads were mapped, across all samples. Cufflinks (version 2.02)59 was then used to build a transcript assembly for each sample, and Cuffmerge was used to combine all assemblies into a consensus 87

dx.doi.org/10.1021/pr400820p | J. Proteome Res. 2014, 13, 84−98

Journal of Proteome Research

Article

Figure 2. The PG Nexus Pipeline. The analytical pipeline allows genomics and transcriptomics data, including that generated from next-generation sequencing platforms, to be used as custom sequences in Mascot searches. The new Samifier tool then converts results from MS/MS ions searches into a .SAM file format that can be visualized in the Integrative Genomics Viewer.17 The new Results Analyzer tool reports the number of proteins and peptides found in the experiment, with a particular emphasis on the validation of genes, and the discovery of novel genes or alternatively spliced transcripts. The pipeline, when used in different ways, can support many different applications. Examples are given as case studies in this manuscript.

Figure 3. Visualizing peptides with the Integrative Genomics Viewer (IGV) in the context of the genome, known gene architecture and transcripts. IGV was used to covisualize experimental peptides and yeast RNA-seq data for the 40S ribosomal protein S7-B (YNL096C) for Saccharomyces cerevisiae. RNA reads that do not span across splice junctions were manually removed to facilitate this visualization in IGV. Peptides that span a mRNA splice junction, and an intron, are highlighted in the red box. The sequence of the protein and the peptides can be seen by zooming into the protein sequence track. This type of covisualization can be done on a large scale to comprehensively integrate the proteome with the genome.

.SAM file) and Results Analyzer. An outline of the pipeline is shown in Figure 2, whereby results from Mascot MS/MS ions searches62 can be processed with the Samifier to facilitate data visualization in the Integrative Genomics Viewer (IGV), and the Results Analyzer can be used to generate reports on relevant protein and peptide statistics. As the PG Nexus pipeline can be used in different ways, these reports can be used in different ways, depending on the precise application. The Samifier tool converts peptides identified from tandem mass spectrometry analysis of complex proteomics samples, in the context of a particular set of FASTA sequence files, into a .SAM file.63 In the .SAM file, each peptide is linked to a gene

execution of the Results Analyzer inside Galaxy is similar (see Supporting Information 2).



RESULTS

Integration and Visualization of Proteomics, Genomics, and Transcriptomics Data

The purpose of the PG Nexus pipeline is to integrate proteomic data obtained from mass spectrometry with genomic and transcriptomic data, including but not limited to that from nextgeneration sequencing. This is achieved through two tools in the PG Nexus: the Samifier (which converts Mascot results to 88

dx.doi.org/10.1021/pr400820p | J. Proteome Res. 2014, 13, 84−98

Journal of Proteome Research

Article

Table 1. Information Provided by the Results Analyzer, Illustrated with Two Example Peptides column name

description

example: yeast

protein ID peptide score start position stop position peptide length peptide sequence chromosome ID gene start gene end strand

Name of the protein. MS/MS ions search confidence score for the peptide. Start position of the peptide with respect to the protein sequence. End position of the peptide with respect to the protein sequence. Length of the peptide in amino acids. The amino acids sequence of the peptide

COQ10_YEAST 70.32 31 42 12 FFGLSGTNHTIR

ENSP00000357642 45.67 123 142 20 RVSRSSFSSDPDESEGIPLK

Chromosome where the peptide is found.

chrXV

chr10

Start position of the gene with respect to the chromosome. End position of the gene with respect to the chromosome. The strand of DNA that encodes the protein. Positive (+) and negative (−) indicate forward and reverse strands, respectively. Translation reading frame of the peptide, with frames 0, 1, or 2. Number of exons the peptide covers. Peptides that span splice junctions have values >1. Coordinates of the peptide with respect to the chromosome. Peptides that span across splice junctions are separated by “:”. Query ID of the peptide identified in the MS/MS ions search.

310312 310935 +

129894923 129924649 −

0 1

1 2

310402−310437

129911841−129911866:129914755−129914788

q67171_p1

q45172_p1

frame exons exon string query ID

example: human

Figure 4. Overview of algorithm for the Samifier and Results Analyzer. The algorithm maps the location of a peptide to a particular locus in the genome. (a) The program is provided with the protein accession number and the start and end positions of the peptide from the MS/MS ions input. (b) The protein accession number is converted into the transcript accession number. (c) The genomic coordinates of the mRNA transcript are retrieved from the .GFF file using the transcript accession number. (d) The peptide is mapped to the location in the transcript sequence, which in turn is mapped to the genomic coordinates of the corresponding gene. The latter step takes into account the presence of introns. (e) The nucleotide sequence corresponding to the peptide is deduced from the transcript; the splice site is denoted by a “*” in this example.

defined in the genome annotation (.GFF) file from the relevant chromosome for the species, or from a numbered genomic contig. When the .SAM file is opened in the IGV viewer, peptides that match to a protein are represented as tiles along the corresponding genomic location, similar to the way RNAseq reads are visualized17 (Figure 3). Peptides that span across introns, and thus splice junctions, can be visualized, validating spliced and alternatively spliced genes with proteomics data. In the Samifier, a threshold peptide identity score can be specified by the user to select for high-confidence peptides. All peptides that have a score less than the defined threshold are removed,

therefore reducing noise. The Samifer also outputs a list of regions of interest in a .BED file format, which can be uploaded to IGV using IGV’s region navigator to quickly find peptide matches for each protein. The Results Analyzer is a reporting tool. It returns the number of peptides detected from Mascot MS/MS ions searches and their confidence scores, aligned with the genomic coordinates of the nucleotide sequence that encodes each peptide (Figure 2). A list of all information provided by the Results Analyzer is shown in Table 1. The output is in a table format, where each row in the table corresponds to a peptide 89

dx.doi.org/10.1021/pr400820p | J. Proteome Res. 2014, 13, 84−98

Journal of Proteome Research

Article

Figure 5. The analytical pipeline for proteomic validation of novel bacterial genes in a genome. (a) The Predicted Protein Generator is used to generate the protein sequence database for MS/MS ions searches in Mascot. Additionally, the Virtual Protein Generator can generate a protein sequence database by slicing the genome into fixed length overlapping fragments (virtual proteins), followed by six-frame translation of each fragment. (b) The Virtual Protein Merger can be used to analyze peptides that match six-frame translation of virtual proteins. The output genome annotation data from the Virtual Protein Merger (a .GFF file) can be used with the Results Analyzer and Samifier to perform visualization and further analysis.

following steps. (a) The Samifier retrieves the protein accession number and the start and end positions of the peptide, with respect to the protein sequence, from the Mascot MS/MS ion search results. (b) The Samifier uses the accession number mapping file to convert the protein accession number to the accession number of the corresponding mRNA transcript. (c) The mRNA accession number allows the Samifier to find an entry in the .GFF file which describes the genomic coordinates of the mRNA transcript. (d) The peptide position is converted to a position on the mRNA transcript, followed by another conversion to locate the precise position of the peptide within the genome. This step involves adjusting the positions of the peptide to account for the intron−exon structure of the gene. (e) The nucleotide sequence that encodes the peptide can be deduced from the genome sequence and the genome coordinates. The genomic coordinates of the peptide, positions of any introns, and the corresponding nucleotide sequence are recorded in the output .SAM file.63 The PG Nexus pipeline was developed in Java. The software is open-source and available under the GPL v3 license via the GitHub source code repository https://github.com/ IntersectAustralia/ap11_Samifier. An archived copy of the software (v1.0.6) can also be found in Supporting Information 3. Documentation for the software package can be found on the project wiki page under the GitHub repository page. The entire pipeline is available as command line tools; the Samifier and Results Analyzer have also been integrated into the Galaxy platform61 and Tool Shed to provide a graphical user interface (see Materials and Methods for details).

that matched against the genome coordinates of known or predicted genes in the .GFF file. The Results Analyzer output can be queried by use of Structured Query Language (SQL)64 to filter and generate summary statistics of interest on the data. Sample results (Table 1) show how peptide FFGLSGTNHTIR from the yeast protein Coenzyme Q-binding protein COQ10 (COQ10_YEAST) maps against its chromosome XV and how human peptide RVSRSSFSSDPDESEGIPLK from the human protein of unknown function ENSP00000357642 spans 2 exons and 1 splice junction on human chromosome 10. The latter example illustrates the proteomic validation of intron/exon boundaries and can also be used to confirm alternatively spliced isoforms of mRNA. The Samifier and Results Analyzer both require the following input files: (1) a genomic sequence in FASTA format, which can be a fully sequenced genome or a numbered list of contigs from genome assembly, (2) the genome coordinates for all genes, including their exons and introns, in general feature format (GFF3),65 (3) the results from the MS/MS ions search results in Mascot MIME format (.DAT file)62 or mzIdentML v1.1 format66 where the proteomic data has been matched against the sequences defined from 1 and 2 above, and (4) an accession number mapping file for converting the protein accession numbers used in the protein sequence database to the mRNA transcript accession numbers used in the .GFF file. Both the Samifier and Results Analyzer tools can accept multiple MS/MS search results files simultaneously to merge results from multiple experiments. A comprehensive list of input and output files for the Samifier and Results Analyzer is shown in Supporting Information 2.

Strategy for Validation of Predicted ORFs in a Complete or Fragmented Bacterial Genome

Algorithm and Software for the Samifier and Results Analyzer

Data from mass spectrometric analysis of a proteome can be used to validate genes that are predicted to be present in a novel bacterial genome (Figure 5a). The Predicted Protein

The Samifier and Results Analyzer both use the same underlying algorithm (see Figure 4), which involves the 90

dx.doi.org/10.1021/pr400820p | J. Proteome Res. 2014, 13, 84−98

Journal of Proteome Research

Article

Figure 6. The Virtual Protein Generator and Virtual Protein Merger. The Virtual Protein Generator slices the bacterial genome, available as a complete genome or as a series of contigs, into overlapping fixed-sized virtual proteins (e.g., 300 amino acids). Six-frame translation of each segment is undertaken to create a database of virtual proteins. LC−MS/MS data is matched against this database. Using the Virtual Protein Merger, one or more peptides that have matched a virtual protein are used to predict putative open reading frames (ORFs), on the basis of flanking start and end codons that are of the same translation frame. Note that the Virtual Protein Merger can traverse multiple virtual proteins to do this.

Figure 7. Validation of genes in bacterial genomes using proteomics data. Number of genes validated with two or more high-confidence peptides for (a) C. concisus strain BAA-1457, and (b) C. concisus strain UNSWCD. The Venn diagram shows the MS/MS ion search results from matching against the three protein sequence databases (NCBI RefSeq,33 Glimmer36 and the Virtual Protein Generator/Merger). The numbers in brackets beside the name of each database represents the total number of proteins validated and the total number of sequences in the respective database.

spectra are then searched against the virtual protein database (Figure 5a). The Virtual Protein Generator tool generates the overlapping virtual proteins to a user-specified size (in amino acids). Although any size may be specified, we recommend the estimated average length of the protein in the relevant proteome be used (e.g., 300 amino acids), since this provides a balance between the ability to find larger proteins and the sensitivity required to find a peptide in the resulting database. The Virtual Protein Merger is a tool that identifies putative open reading frames based on MS/MS ion search results of a virtual protein database (Figure 5b). It takes peptides that matched to a virtual protein in the same translation frame and calculates the likely position of the corresponding open reading frame by searching for flanking start and end codons in that frame of translation (Figure 6). To increase the specificity of identifying putative open reading frames, the tool allows the user to define a minimum threshold score for accepting a peptide match. Peptides of less than five amino acids, containing stop codons, or containing ambiguous amino acid codes are excluded by the Virtual Protein Merger since these matches will be spurious. The Virtual Protein Merger command line tool (protein_generator.jar) requires the following input files and command line parameters shown in brackets: (1) the bacterial genome or contig sequences in FASTA format, (2) the genome coordinates of the virtual proteins in GFF3 file format,65 (3) the results from MS/MS ions searches (Mascot DAT format or mzIdentML format), and (4) the relevant codon to amino acid translation table. The output file in GFF3 format describes the coordinates of the start and end codons for the putative open

Generator uses the coordinates of the start and stop codons for predicted genes to generate a custom protein sequence database for MS/MS ions searches. This tool requires the following input files: (1) the bacterial genome in FASTA format, which can be either a fully sequenced genome or a numbered list of contigs from the genome assembly, (2) the output file from the Glimmer gene prediction tool,36 and (3) a relevant translation table that maps each codon to an amino acid or stop codon. The Predicted Protein Generator creates the custom protein sequence database in FASTA format. The query mass spectra, from proteomic analysis of the bacteria by GeLC−MS/MS or shotgun proteomics LC−MS/MS, are searched against this custom database using the MS/MS ion search engine of Mascot. The search results are then analyzed and visualized using the Samifer or Results Analyzer to validate the existence of predicted genes as proteins. This requires two output files from the Predicted Protein Generator, which includes the coordinates of the predicted genes in GFF3 file format, and a file that contains accession numbers for genes and proteins to allow the Samifier and Results Analyzer to map peptides to coordinates on the genome. Strategy for Proteomic Data-Driven Discovery of ORFs in Bacterial Genomes

Prediction programs cannot reliably predict all the genes in bacterial genomes, especially those that are short in length.8 To identify novel genes, a database of virtual protein sequences can first be generated. This involves cleaving the bacterial genome into fixed-size overlapping segments, followed by six-frame translation of these segments into virtual proteins;67 this is done by the Virtual Protein Generator tool. The query mass 91

dx.doi.org/10.1021/pr400820p | J. Proteome Res. 2014, 13, 84−98

Journal of Proteome Research

Article

Figure 8. Visualization of peptides and correctly reconstructed gene for TRAP transporter solute receptor 2C. The Virtual Protein Merger was used to validate the expression of a gene from the Campylobacter concisus genome (strain UNSWCD). The Integrative Genomics Viewer was used to visualize the results. The gene of interest is the TRAP transporter solute receptor 2C (Genbank: EIF07222.1). (a) The amino acid sequence of the protein is shown in IGV. Start codons are represented as green bars, and stop codons are represented as red bars. There are no intervening stop codons in the correct frame of translation (highlighted in black box), but intervening stop codons were found in the incorrect translation frames. (b) The RefSeq genomic coordinates of the protein. (c) The genomic coordinates of the putative open reading frame (ORF) identified by the Virtual Protein Merger, which matched exactly to the RefSeq coordinates. (d) The peptides that validated the expression of this protein are tiled below.

reading frames for which protein evidence was available. This file serves as input file for the Samifier and Results Analyzer, which enables visualization in the IGV and analysis of the peptides with respect to the putative open reading frames found by the Virtual Protein Merger (Figure 5b).

Glimmer predicted a slightly higher number of genes for each strain than that documented in NCBI RefSeq. When these were queried with the GeLC−MS/MS data, it was interesting to note that a larger number of genes were validated in each strain than was validated from the NCBI RefSeq data alone (1358 genes for BAA-1457 and 1366 for UNSWCD). Notable was the validation of 311 Glimmer-predicted genes in UNSWCD that were not present in the NCBI RefSeq gene set for this strain; this is consistent with the draft status of the UNSWCD genome. The GeLC−MS/MS data were also matched against the generated virtual proteins, and the Virtual Protein Merger used to reconstruct open reading frames with available peptide hits. Figure 8 shows an example open reading frame that was correctly reconstructed from peptide hits and the Virtual Protein Merger−TRAP transporter solute receptor 2C. Through this approach, we identified 902 virtual proteins in BAA-1457, 699 (78%) of which were identical to the NCBI RefSeq or Glimmer databases (Figure 7a). A total of 917 proteins were identified in strain UNSWCD, 725 (79%) which were identical to the NCBI RefSeq or Glimmer databases (Figure 7b). We examined the remaining 203 genes predicted only by GeLC−MS/MS and the Virtual Protein Merger in the strain BAA-1457. All predicted proteins were identified to be in the correct translation frame, but had start and stop positions that were different to those from Glimmer or NCBI RefSeq. This was probably due to factors such as the lack of peptide coverage near the protein N-terminus, genes that use unconventional start codons, N-terminal extensions, and not accounting for programmed frame-shifting or the reading through of some stop codons.68 Machine learning-based gene prediction algorithms, such as Glimmer, can have difficulty identifying genes in situations where gaps are present in the sequence assembly and/or a draft genome exists in many contigs. GeLC−MS/MS data, matched against virtual proteins, should identify incomplete gene sequences at the start or end of contigs. We examined this in the 192 proteins that were predicted in strain UNSWCD by GeLC−MS/MS and the Virtual Protein Merger that were not

Case Study: Validation and Discovery of Genes in Bacterial Genomes

We demonstrate the validation and discovery of genes in bacterial genomes through the analysis of two strains of Campylobacter concisus, an emerging pathogen of the human gastrointestinal tract.11 The C. concisus reference strain (strain BAA-1457) is a completely sequenced and closed genome, while the UNSWCD strain is a draft genome consisting of 86 contigs assembled from next-generation sequencing.11 GeLC− MS/MS data obtained for each strain of C. concisus was analyzed using the Mascot MS/MS ion search against three different protein sequence databases for each strain: (1) the genes/proteins annotated in the NCBI RefSeq database,33 (2) the genes/proteins predicted by Glimmer, and (3) the virtual proteins predicted by the Virtual Protein Generator. The Virtual Protein Merger was used to reconstruct the putative open reading frames based on peptide matches to the virtual proteins. The number of proteins identified in the three Mascot searches are shown in Venn diagrams, for both strains of C. concisus (Figure 7). The counts in the diagram represent proteins verified by two or more peptide matches, whose peptide scores were above the Mascot identity thresholds (listed in Supporting Information 2). For proteins to be classified as common to NCBI, Glimmer, and the Virtual Protein Merger approaches, they had to be of identical length and sequence. The input and output files from the analyses of the C. concisus BAA-1457 and UNSWCD strains using the PG Nexus pipeline are available as Supporting Informations 4 and 5, respectively. When matching against the relevant genes described in the NCBI RefSeq database, GeLC−MS/MS validated 1287 of the 1933 genes (66%) for C. concisus strain BAA-1547 and 1321 of the 1812 genes (73%) for strain UNSWCD (Figure 7a). 92

dx.doi.org/10.1021/pr400820p | J. Proteome Res. 2014, 13, 84−98

Journal of Proteome Research

Article

Figure 9. Visualization of peptides for an aldehyde dehydrogenase (NAD) family protein that was split across two contigs. (a) Peptides are shown to match within the region where the two ORFs, q662 and q1289, overlap each other. (b) The reconstructed ORFs, q662 and q1289, were located at the start of contig 17 and the end of contig 59 respectively. The tracks in blue show the 15 amino acids overlapping region between the two contigs. (c) The two half ORFs are stitched together to form the complete gene, which spans across contig 17 and 59.

Figure 10. The PG Nexus identifies human peptides that validate differential mRNA splicing. Peptides are shown that match against different splice junctions of isoforms of the human oxysterol binding protein-like 3 protein (ENSG00000070882, chromosomal location: 7p15.3, 24 836 158−25 021 253 bp, minus strand). (a) A set of peptides that spans over a splice junction in one of the protein isoforms (ENSP00000315331), the peptide sequence is TYSAPAINAIQVPKPFSGPVR. (b) Another set of peptides is shown spanning over an alternative splice junction in a second protein isoform (ENSP00000315277), the peptide sequence is TYSAPAINAIQGGSFESPK.

Figure 11. Splicing of human tropomyosin α-1, investigated by RNA-seq and GeLC−MS/MS. (a) Two alternatively spliced mRNAs of human tropomyosin α-1 (ENSG00000140416, chromosomal location, 15q22.1, 3 334 831−63 364 114 bp, plus strand) were found using the RNA-seq data and cufflinks. RNA reads that do not span across splice junctions were manually removed to facilitate this visualization in IGV. (b) The two alternatively spliced isoforms were confirmed by junction peptides. A peptide of sequence SKQLEDELVSLQK was found to span the splice junction of one isoform of this protein (TCONS_00030294). This peptide had a Mascot peptide score of 61. (c) A different peptide, of sequence SKQLEEDIAAKEK, was found to span across an alternative splice junction for another isoform (TCONS_00030296). This peptide had a score of 47. (d) We also checked against known mRNA isoforms from the Ensembl database, showing that we have rediscovered known mRNA isoforms and validated these were made into proteins.

present in the gene sets from NCBI RefSeq or Glimmer. Interestingly, 12 genes were located at the start or end of contigs; 10 genes had incomplete sequences, and 2 genes were split across two or more contigs (see Supporting Information 6). We further examined these to determine whether they

provided clues for the stitching together of contigs. An aldehyde dehydrogenase (NAD) family protein spanned across contig 17 and contig 59, and had 16 amino acids overlapping the end of each contig (Figure 9). A hypothetical protein spanned across contig 17 and contig 35, which contained a 93

dx.doi.org/10.1021/pr400820p | J. Proteome Res. 2014, 13, 84−98

Journal of Proteome Research

Article

been visualized with the genome in the UCSC genome browser70 (e.g., refs 25 and 71), there has been no general tool to visualize peptides in the context of the genome and transcriptome. There are several browsers developed for the visualization of data supporting Chromosome-Centric Human Proteome Project. The Proteome Browser Web Portal (TPB)28 allows users to browse C-HPP data for evidence of gene expression at the transcript or protein level using a traffic light based information matrix. However, it lacks detailed views of the peptides derived from mass spectrometry data. The Chromosome-Assembled Human Proteome Browser (CAPER) 29 allows users to view predefined genomics annotation and protein evidence tracks, including peptides mapped onto the genome, on a genome browser. By clicking on specific feature on the track, user can access annotation details or mass spectra. It also provides users a heat-map view of the data to browse quantitative data, such as transcript and protein abundances, number of post-translational modifications, and interaction partners. Nevertheless, there is a lack of tools for users to analyze their own mass spectrometry data for evidence of alternatively spliced proteins. The Samifier tool converts peptides from MS/MS ions searches into the .SAM file format, to allow their visualization in the Integrative Genomics Viewer (IGV). Covisualization of peptides with genomics and transcriptomics data in IGV facilitates the manual validation of genes, confirmation of splice sites, and provides evidence that alternative spliced forms of mRNA are translated to proteins. Since it is impractical to visually validate thousands of genes, the Results Analyzer maps peptides to their respective genomic locations and creates relevant information in tabular format (Table 1); this can then be queried with SQL. The Results Analyzer thus facilitates large-scale and systematic analysis of peptides in the context of the genome. It is becoming routine to sequence bacterial genomes using next-generation platforms. However, complete and accurate genome annotation remains an ongoing challenge. By integrating proteomics and genomics data using proteogenomic approaches,3,9,14 accurate peptide hits from MS/MS ions searches then provide evidence for accurate gene annotation and expression of proteins. It is important to note that since the six-frame translation is used in proteogenomic searches, the sequence database tends to be overly inflated as there are five incorrect sequences for each correct sequence.72 This can lead to underestimation of the confidence score assigned to the peptide hits.72 Blakeley et al. (2012)72 described a number of strategies to limit database size and improve the sensitivity of identifying peptide hits. Our strategy has been to generate virtual proteins, which are then queried in Mascot, and undertake a posthoc analysis with the Virtual Protein Merger to identify open reading frames. We have shown this to be effective for bacterial genomes. While not a specific focus of the PG Nexus, proteomics data can also be used to identify new Nterminal peptides that result from the cleavage of signal peptides or N-terminal methionine cleavage.9,68 The addition of the N-terminal peptide sequences to sequence databases, with the signal peptides or start methionine removed, will allow these sequences to be detected using MS/MS ions searches.9,14 The PG Nexus pipeline can be used to validate eukaryotic genomes. However, one limitation is that the Samifier and Results Analyzer must be provided with the genome annotation (GFF3 format file). This file includes information on the start and end position of genes, exons and introns relative to the

lysine-rich repetitive motif bridging the two contigs. Proteomic evidence for the above genes could thus be used to join contigs 35, 17, and 59 in that order. In summary, the above results demonstrate that proteomics data, analyzed in the PG Nexus pipeline, can be used to validate and discover genes in complete or fragmented bacterial genomes.67 Case Study: Large Scale Validation of Human Intron−Exon Splice Sites with Proteomic Data

To demonstrate the validation of complex alternative splicing in humans,18,21 phosphoproteomics data from the PHOSIDA database49 was analyzed with the PG Nexus pipeline. This data set was chosen as alternatively spliced exons, and their splice junctions often code for intrinsically disordered regions,18,21 which can have a high density of phosphorylation sites.32,69 The data set included 49 196 high-confidence phosphopeptides (FDR < 1%) for 8767 proteins. The Results Analyzer identified 2413 phosphopeptides that spanned splice junctions from 1514 genes, which we refer to as “junction peptides”. Peptides that spanned across spliced or alternatively spliced junctions, but could be mapped to multiple genomic locations, were excluded from the analysis. The peptides verified 2236 splice junctions present in the human proteome. Importantly, 47 unique phosphopeptides that span across 42 unique alternatively spliced junctions were identified, which are shared among 81 different protein isoforms from 21 genes. An example of this is shown in Figure 10. The list of all alternatively spliced junction peptides is shown in Supporting Information 7, and the corresponding input and output files for the Samifier and Results Analyzer are available as Supporting Information 8. Example: Visual Coanalysis of Human Proteome and RNA-seq Transcriptome to Validate Alternate Splicing of mRNA

RNA-seq data, when matched against well-annotated genomes, can highlight the presence of alternate splicing of mRNA. What has been more difficult to determine is whether the alternatively spliced mRNAs become protein products; this can be investigated with the PG Nexus pipeline. Paired proteomic and RNA-seq data from human mesenchymal stem cells was generated. GeLC−MS/MS data were searched against the human ENSEMBL database,43 and Mascot results processed with the Samifier. RNA-seq data generated by an Illumina HiSeq 2000 platform were processed to allow identification of splice junctions (see Materials and Methods). The paired proteomic and transcriptomic data were then codisplayed in IGV and explored visually. Figure 11 shows an example where RNA-seq reads highlighted alternative splicing of mRNA of microtubule-associated protein 4; the figure also clearly shows the existence of GeLC−MS/MS peptides that covered these two different splicing events, verifying that both mRNA isoforms are made into proteins. The two junction peptides had Mascot scores up to 63 and 88, providing high confidence for this verification. This example demonstrates that the Samifier and Results Analyzer are capable of integrating proteomic data with genomic and RNA-seq data, facilitating their covisualization in IGV, and the verification that alternative splicing is generating different protein products.



DISCUSSION In the PG Nexus pipeline, the core tools are the Samifier and Results Analyzer. They facilitate the coanalysis of LC−MS/MS proteomics data with genome sequence, gene structure and transcriptomics data. Although proteomics data has previously 94

dx.doi.org/10.1021/pr400820p | J. Proteome Res. 2014, 13, 84−98

Journal of Proteome Research

Article

chromosome. The annotation file also needs to include information on the combination of exons for each alternatively spliced protein isoform and the corresponding start and end codons for the protein. For most model organisms, including human, these genome annotation files can be obtained from public databases such as Ensembl,43 NCBI,33 and the UCSC genome browser.70 For poorly annotated or unannotated novel genomes, software such as Peppy16 can be used to perform proteogenomics analysis of eukaryotic genomes when genome annotation is not available. The Peppy pipeline can perform all analysis steps required for proteogenomics analysis. Unlike PG Nexus, which relies on Mascot (or similar tools) for MS/MS ions searches, decoy database searching and FDR correction, Peppy can perform these and additional tasks (e.g., spectral cleaning) as a self-contained pipeline. Peppy can perform proteogenomic analysis of complex eukaryotic organisms, while PG Nexus is only developed for use on viral and prokaryotic genome. Peppy currently lacks a means to visualize peptides using a genome browser, but the outputs can easily be adapted for visualization and SQL analysis using our Samifier and Results Analyzer tools. We have shown that high-confidence peptides from LC− MS/MS can be used to cross-validate splice isoforms of mRNA, and have presented tools to facilitate that process. One translational application of the tools involve mining proteomics and alternative splicing databases for proteins arising from alternatively spliced mRNAs that are differentially expressed in human diseases, which may serve as novel candidate biomarkers (e.g., Menon et al. (2009)73). Menon et al. (2009)73 used the ECgene database,74 which is a database of alternatively spliced transcripts that are predicted using gene prediction tools and known exon−exon junctions from EST transcripts. Sequence databases that collect information on alternatively spliced transcripts from EST transcripts (e.g., ECgene74) and RNA-seq data sets (e.g., DBATE75) could be useful for these projects. RNA-seq transcript analysis software is of varying sensitivity and accuracy in detecting alternatively spliced transcripts.58,76,77 Peptides that span across splice junctions can unequivocally validate exon−exon splice boundaries, and differential splicing events, even in cases where RNA-seq isoform detection tools may fail. For example, although high-confidence peptides and RNA-seq reads validated two alternatively spliced isoforms of the human microtubule-associated protein 4 (Supporting Information 2), the Cufflinks41 sequence assembly program did not detect both isoforms. This highlights the value of using proteomics analysis as a complementary approach to crossvalidate high-throughput transcriptomics analysis. The PG Nexus pipeline allows users to incorporate their own annotation of alternatively spliced transcripts as a custom genome annotation file, which will facilitate the validation of novel alternatively spliced transcripts.4,26 However, this will require the user to infer the start and end codons and translation frame of the transcripts assembled from RNA-seq analysis software, such as Cufflinks.41 A challenge in the use of RNA-seq data with the PG Nexus pipeline is to correctly infer the translation frame of any novel alternatively spliced junctions. Sheynkman et al. (2013)4 suggested inferring the translation frame of novel splice junction using preceding Nterminal exons, in which the start codon and translation frame is already known.4 On the other hand, if the novel exon is the first exon at the N-termini, the translation frame cannot be inferred. One solution is to perform proteogenomics analysis, with three-frame translation if the coding strand is known,4,71

or six-frame translation otherwise.78 In the future, the PG Nexus pipeline will be extended to perform the above steps to make the validation of novel alternatively spliced exons more efficient. An advantage of using RNA-seq and proteomics data from matching biological samples is that a protein sequence database, which contains mRNA variants, can be filtered based on expression-level of alternatively spliced transcripts.4 By excluding mRNA isoforms that are not expressed or have low level of expression, the size of the sequence database becomes smaller. However, a more restricted sequence database increases the sensitivity of identifying novel alternatively spliced proteins.26 For example, Wang et al. (2012)26 showed that filtering the RNA-seq derived protein database based on an expression level threshold of >20 RPKM (reads per kilo base per million mapped reads) increased the number of peptides identified by 6%. Sheynkman et al. (2013)4 only analyzed splice junctions with six supporting RNA-seq reads or higher, which reduced the false positive rates as compared with a lower threshold of 1 RNA-seq reads. Consequently, this approach has shown to be better compared to previous studies that include all possible theoretical exon−exon junctions in the sequence database.30,79 In conclusion, we have developed the PG Nexus pipeline to enable cross-validation and coanalysis of proteomics data with nucleic acid sequence data. For prokaryotes, the PG Nexus pipeline can covisualize and coanalyze genomic sequences or contigs with LC−MS/MS peptides to confirm and discover genes, and validate the quality of assembled bacterial genomes. In eukaryotes, the PG Nexus can map peptides to genomes, integrate this with RNA-seq data, to validate genes and confirm that validated spliced or alternatively spliced mRNAs are translated to become protein isoforms. This new pipeline thus forms a unique and important link between proteins and the genomes and transcriptomes that encode for them.



ASSOCIATED CONTENT

S Supporting Information *

Supporting Information 1: The input and output files for the proteome-scale analysis of S. cerevisiae38 using Samifier and Results Analyzer. Supporting Information 2: Table S1 lists the input files used by each of the tools in the Proteomic−Genomic Nexus software package. Table S2 lists the output files produced by each of the tools in the Proteomic−Genomic Nexus software package. Table S3 lists the Mascot identity threshold scores for analysis of the BAA-1457 and UNSWCD strains of Campylobacter concisus. Figure S1 shows the graphical user interface of Results Analyzer in the Galaxy environment. Figure S2 provides an example where high-confidence peptides and RNA-seq reads validated two alternatively spliced isoforms of the human microtubule-associated protein 4; the Cufflinks sequence assembly program did not detect both isoforms. Figures S3 and S4 show the CID-MS/MS spectrum of a peptide of sequences shown in Figure 11. Supporting Information 3: An archived copy of the PG Nexus pipeline (v1.0.6) as a zip compressed file. Supporting Information 4: The input and output files from the analyses of C. concisus BAA1457 strain using the PG Nexus pipeline. Supporting Information 5: The input and output files from the analyses of C. concisus UNSWCD strain using the PG Nexus pipeline. Supporting Information 6: A list of proteins found at the start or end of contigs from C. concisus UNSWCD strain. Supporting 95

dx.doi.org/10.1021/pr400820p | J. Proteome Res. 2014, 13, 84−98

Journal of Proteome Research

Article

(2) Brosch, M.; Saunders, G. I.; Frankish, A.; Collins, M. O.; Yu, L.; Wright, J.; Verstraten, R.; Adams, D. J.; Harrow, J.; Choudhary, J. S.; Hubbard, T. Shotgun proteomics aids discovery of novel proteincoding genes, alternative splicing, and “resurrected” pseudogenes in the mouse genome. Genome Res. 2011, 21 (5), 756−67. (3) Gupta, N.; Benhamida, J.; Bhargava, V.; Goodman, D.; Kain, E.; Kerman, I.; Nguyen, N.; Ollikainen, N.; Rodriguez, J.; Wang, J.; Lipton, M. S.; Romine, M.; Bafna, V.; Smith, R. D.; Pevzner, P. A. Comparative proteogenomics: combining mass spectrometry and comparative genomics to analyze multiple genomes. Genome Res. 2008, 18 (7), 1133−42. (4) Sheynkman, G. M.; Shortreed, M. R.; Frey, B. L.; Smith, L. M. Discovery and mass spectrometric analysis of novel splice-junction peptides using RNA-Seq. Mol. Cell. Proteomics 2013, 12 (8), 2341−53. (5) Howald, C.; Tanzer, A.; Chrast, J.; Kokocinski, F.; Derrien, T.; Walters, N.; Gonzalez, J. M.; Frankish, A.; Aken, B. L.; Hourlier, T.; Vogel, J. H.; White, S.; Searle, S.; Harrow, J.; Hubbard, T. J.; Guigo, R.; Reymond, A. Combining RT-PCR-seq and RNA-seq to catalog all genic elements encoded in the human genome. Genome Res. 2012, 22 (9), 1698−710. (6) Nagaraj, N.; Wisniewski, J. R.; Geiger, T.; Cox, J.; Kircher, M.; Kelso, J.; Paabo, S.; Mann, M. Deep proteome and transcriptome mapping of a human cancer cell line. Mol. Syst. Biol. 2011, 7, 548. (7) Ederveen, T. H.; Overmars, L.; van Hijum, S. A. Reduce manual curation by combining gene predictions from multiple annotation engines, a case study of start codon prediction. PLoS One 2013, 8 (5), e63523. (8) Goli, B.; Nair, A. S. The elusive short genean ensemble method for recognition for prokaryotic genome. Biochem. Biophys. Res. Commun. 2012, 422 (1), 36−41. (9) Christie-Oleza, J. A.; Miotello, G.; Armengaud, J. Highthroughput proteogenomics of Ruegeria pomeroyi: seeding a better genomic annotation for the whole marine Roseobacter clade. BMC Genomics 2012, 13, 73. (10) Peterson, E. S.; McCue, L. A.; Schrimpe-Rutledge, A. C.; Jensen, J. L.; Walker, H.; Kobold, M. A.; Webb, S. R.; Payne, S. H.; Ansong, C.; Adkins, J. N.; Cannon, W. R.; Webb-Robertson, B. J. VESPA: software to facilitate genomic annotation of prokaryotic organisms through integration of proteomic and transcriptomic data. BMC Genomics 2012, 13, 131. (11) Deshpande, N. P.; Kaakoush, N. O.; Mitchell, H.; Janitz, K.; Raftery, M. J.; Li, S. S.; Wilkins, M. R. Sequencing and validation of the genome of a Campylobacter concisus reveals intra-species diversity. PLoS One 2011, 6 (7), e22170. (12) Yates, J. R., 3rd; Eng, J. K.; McCormack, A. L. Mining genomes: correlating tandem mass spectra of modified and unmodified peptides to sequences in nucleotide databases. Anal. Chem. 1995, 67 (18), 3202−10. (13) Berghoff, B. A.; Konzer, A.; Mank, N. N.; Looso, M.; Rische, T.; Förstner, K. U.; Krüger, M.; Klug, G. Integrative “omics”-approach discovers dynamic and regulatory features of bacterial stress responses. PLoS Genet. 2013, 9 (6), e1003576. (14) Payne, S. H.; Huang, S. T.; Pieper, R. A proteogenomic update to Yersinia: enhancing genome annotation. BMC Genomics 2010, 11, 460. (15) Feng, Y.; Chien, K. Y.; Chen, H. L.; Chiu, C. H. Pseudogene recoding revealed from proteomic analysis of salmonella serovars. J. Proteome Res. 2012, 11 (3), 1715−9. (16) Risk, B. A.; Spitzer, W. J.; Giddings, M. C. Peppy: proteogenomic search software. J. Proteome Res. 2013, 12 (6), 3019−25. (17) Thorvaldsdottir, H.; Robinson, J. T.; Mesirov, J. P. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Briefings Bioinf. 2013, 14 (2), 178−92. (18) Buljan, M.; Chalancon, G.; Eustermann, S.; Wagner, G. P.; Fuxreiter, M.; Bateman, A.; Babu, M. M. Tissue-specific splicing of disordered segments that embed binding motifs rewires protein interaction networks. Mol. Cell 2012, 46 (6), 871−83.

Information 7: The list of human phosphopeptides from the PHOSIDA database49 that spans across alternatively spliced junction. This file also indicates the genomic locations for all junction peptides and the Ensembl accession for the associated protein isoform. Supporting Information 8: The input and output files for the analysis of human phosphopeptides from the PHOSIDA database49 using the Samifier and Results Analyzer. This material is available free of charge via the Internet http://pubs.acs.org.



AUTHOR INFORMATION

Corresponding Author

*Tel: (+61 2) 9385 3633. Fax: (+61 2) 9385 1483. E-mail: m. [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS The software is developed in conjunction with Intersect Australia Limited, a not-for-profit eResearch company. We would especially like to thank Georgina Edwards, Schemek Pochopien, Jeff Christiansen, and Kali Waterford from Intersect and Mingfang Wu from ANDS for their contributions on this project. We thank the Australian Proteomics Computational Facility (APCF) for providing access to the Mascot server and Simon Michnowicz for technical support. This project was supported by the Australian National Data Service (ANDS), in turn supported by the Australian Government through the National Collaborative Research Infrastructure Strategy (NCRIS) Program and the Education Investment Fund (EIF) Super Science Initiative. M.R.W. acknowledges financial support from the New South Wales State Government Science Leveraging Fund, the EIF Super Science scheme, and the University of New South Wales. M.R.W. and G.H.S. acknowledge financial support from the Australian Research Council.



ABBREVIATIONS CAPER, Chromosome-Assembled Human Proteome Browser; C-HPP, Chromosome-Centric Human Proteome Project; ESITRAP, electrospray ion trap; EST, expressed sequence tag; FDR, false discovery rate; GFF, general feature format; GTF, general transfer format; HPLC, high-performance liquid chromatography; IGV, Integrative Genomics Viewer; nanoLC, nanoflow liquid chromatography; MS/MS, tandem mass spectrometry; NCBI, National Center for Biotechnology Information; ORFs, open reading frames; PG Nexus, Proteomic−Genomic Nexus Pipeline; RNA-seq, next-generation RNA sequencing; RPKM, reads per kilo base per million mapped reads; SAM, sequence alignment/map format; SQL, structured query language; TPB, Proteome Browser Web Portal



REFERENCES

(1) Harrow, J.; Frankish, A.; Gonzalez, J. M.; Tapanari, E.; Diekhans, M.; Kokocinski, F.; Aken, B. L.; Barrell, D.; Zadissa, A.; Searle, S.; Barnes, I.; Bignell, A.; Boychenko, V.; Hunt, T.; Kay, M.; Mukherjee, G.; Rajan, J.; Despacio-Reyes, G.; Saunders, G.; Steward, C.; Harte, R.; Lin, M.; Howald, C.; Tanzer, A.; Derrien, T.; Chrast, J.; Walters, N.; Balasubramanian, S.; Pei, B.; Tress, M.; Rodriguez, J. M.; Ezkurdia, I.; van Baren, J.; Brent, M.; Haussler, D.; Kellis, M.; Valencia, A.; Reymond, A.; Gerstein, M.; Guigo, R.; Hubbard, T. J. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 2012, 22 (9), 1760−74. 96

dx.doi.org/10.1021/pr400820p | J. Proteome Res. 2014, 13, 84−98

Journal of Proteome Research

Article

genome annotation policy. Nucleic Acids Res. 2012, 40 (Database issue), D130−5. (34) Consortium, U. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012, 40 (Database issue), D71−5. (35) Elzanowski, A.; Ostell, J. The Genetic Codes. http://www.ncbi. nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi (accessed May 10, 2013). (36) Delcher, A. L.; Bratke, K. A.; Powers, E. C.; Salzberg, S. L. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 2007, 23 (6), 673−9. (37) Cherry, J. M.; Hong, E. L.; Amundsen, C.; Balakrishnan, R.; Binkley, G.; Chan, E. T.; Christie, K. R.; Costanzo, M. C.; Dwight, S. S.; Engel, S. R.; Fisk, D. G.; Hirschman, J. E.; Hitz, B. C.; Karra, K.; Krieger, C. J.; Miyasato, S. R.; Nash, R. S.; Park, J.; Skrzypek, M. S.; Simison, M.; Weng, S.; Wong, E. D. Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res. 2012, 40 (Databaseissue), D700−5. (38) de Godoy, L. M.; Olsen, J. V.; Cox, J.; Nielsen, M. L.; Hubner, N. C.; Frohlich, F.; Walther, T. C.; Mann, M. Comprehensive massspectrometry-based proteome quantification of haploid versus diploid yeast. Nature 2008, 455 (7217), 1251−4. (39) Deutsch, E. W. The PeptideAtlas Project. Methods Mol. Biol. 2010, 604, 285−96. (40) Cox, M. P.; Peterson, D. A.; Biggs, P. J. SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinf. 2010, 11, 485. (41) Trapnell, C.; Roberts, A.; Goff, L.; Pertea, G.; Kim, D.; Kelley, D. R.; Pimentel, H.; Salzberg, S. L.; Rinn, J. L.; Pachter, L. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 2012, 7 (3), 562−78. (42) Langmead, B.; Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 2012, 9 (4), 357−9. (43) Flicek, P.; Amode, M. R.; Barrell, D.; Beal, K.; Brent, S.; Carvalho-Silva, D.; Clapham, P.; Coates, G.; Fairley, S.; Fitzgerald, S.; Gil, L.; Gordon, L.; Hendrix, M.; Hourlier, T.; Johnson, N.; Kahari, A. K.; Keefe, D.; Keenan, S.; Kinsella, R.; Komorowska, M.; Koscielny, G.; Kulesha, E.; Larsson, P.; Longden, I.; McLaren, W.; Muffato, M.; Overduin, B.; Pignatelli, M.; Pritchard, B.; Riat, H. S.; Ritchie, G. R.; Ruffier, M.; Schuster, M.; Sobral, D.; Tang, Y. A.; Taylor, K.; Trevanion, S.; Vandrovcova, J.; White, S.; Wilson, M.; Wilder, S. P.; Aken, B. L.; Birney, E.; Cunningham, F.; Dunham, I.; Durbin, R.; Fernandez-Suarez, X. M.; Harrow, J.; Herrero, J.; Hubbard, T. J.; Parker, A.; Proctor, G.; Spudich, G.; Vogel, J.; Yates, A.; Zadissa, A.; Searle, S. M. Ensembl 2012. Nucleic Acids Res. 2012, 40 (Database issue), D84−90. (44) Wall, L.; Christiansen, T.; Orwant, J., Programming Perl, 3rd ed.; O’Reilly Media: Cambridge, 2000. (45) Olsen, J. V.; Blagoev, B.; Gnad, F.; Macek, B.; Kumar, C.; Mortensen, P.; Mann, M. Global, in vivo, and site-specific phosphorylation dynamics in signaling networks. Cell 2006, 127 (3), 635−48. (46) Daub, H.; Olsen, J. V.; Bairlein, M.; Gnad, F.; Oppermann, F. S.; Korner, R.; Greff, Z.; Keri, G.; Stemmann, O.; Mann, M. Kinaseselective enrichment enables quantitative phosphoproteomics of the kinome across the cell cycle. Mol. Cell 2008, 31 (3), 438−48. (47) Oppermann, F. S.; Gnad, F.; Olsen, J. V.; Hornberger, R.; Greff, Z.; Keri, G.; Mann, M.; Daub, H. Large-scale proteomics analysis of the human kinome. Mol. Cell. Proteomics 2009, 8 (7), 1751−64. (48) Olsen, J. V.; Vermeulen, M.; Santamaria, A.; Kumar, C.; Miller, M. L.; Jensen, L. J.; Gnad, F.; Cox, J.; Jensen, T. S.; Nigg, E. A.; Brunak, S.; Mann, M. Quantitative phosphoproteomics reveals widespread full phosphorylation site occupancy during mitosis. Sci. Signaling 2010, 3 (104), ra3. (49) Gnad, F.; Gunawardena, J.; Mann, M. PHOSIDA 2011: the posttranslational modification database. Nucleic Acids Res. 2011, 39 (Database issue), D253−60. (50) Simonsen, J. L.; Rosada, C.; Serakinci, N.; Justesen, J.; Stenderup, K.; Rattan, S. I.; Jensen, T. G.; Kassem, M. Telomerase expression extends the proliferative life-span and maintains the

(19) Salomonis, N.; Schlieve, C. R.; Pereira, L.; Wahlquist, C.; Colas, A.; Zambon, A. C.; Vranizan, K.; Spindler, M. J.; Pico, A. R.; Cline, M. S.; Clark, T. A.; Williams, A.; Blume, J. E.; Samal, E.; Mercola, M.; Merrill, B. J.; Conklin, B. R. Alternative splicing regulates mouse embryonic stem cell pluripotency and differentiation. Proc. Natl. Acad. Sci. U. S. A. 2010, 107 (23), 10514−9. (20) Davis, M. J.; Shin, C. J.; Jing, N.; Ragan, M. A. Rewiring the dynamic interactome. Mol. BioSyst. 2012, 8 (8), 2054−66. (21) Colak, R.; Kim, T.; Michaut, M.; Sun, M.; Irimia, M.; Bellay, J.; Myers, C. L.; Blencowe, B. J.; Kim, P. M. Distinct types of disorder in the human proteome: functional implications for alternative splicing. PLoS Comput. Biol. 2013, 9 (4), e1003030. (22) Djebali, S.; Davis, C. A.; Merkel, A.; Dobin, A.; Lassmann, T.; Mortazavi, A.; Tanzer, A.; Lagarde, J.; Lin, W.; Schlesinger, F.; Xue, C.; Marinov, G. K.; Khatun, J.; Williams, B. A.; Zaleski, C.; Rozowsky, J.; Roder, M.; Kokocinski, F.; Abdelhamid, R. F.; Alioto, T.; Antoshechkin, I.; Baer, M. T.; Bar, N. S.; Batut, P.; Bell, K.; Bell, I.; Chakrabortty, S.; Chen, X.; Chrast, J.; Curado, J.; Derrien, T.; Drenkow, J.; Dumais, E.; Dumais, J.; Duttagupta, R.; Falconnet, E.; Fastuca, M.; Fejes-Toth, K.; Ferreira, P.; Foissac, S.; Fullwood, M. J.; Gao, H.; Gonzalez, D.; Gordon, A.; Gunawardena, H.; Howald, C.; Jha, S.; Johnson, R.; Kapranov, P.; King, B.; Kingswood, C.; Luo, O. J.; Park, E.; Persaud, K.; Preall, J. B.; Ribeca, P.; Risk, B.; Robyr, D.; Sammeth, M.; Schaffer, L.; See, L. H.; Shahab, A.; Skancke, J.; Suzuki, A. M.; Takahashi, H.; Tilgner, H.; Trout, D.; Walters, N.; Wang, H.; Wrobel, J.; Yu, Y.; Ruan, X.; Hayashizaki, Y.; Harrow, , J.; Gerstein, M.; Hubbard, T.; Reymond, A.; Antonarakis, S. E.; Hannon, G.; Giddings, M. C.; Ruan, Y.; Wold, B.; Carninci, P.; Guigo, R.; Gingeras, T. R. Landscape of transcription in human cells. Nature 2012, 489 (7414), 101−8. (23) Merkin, J.; Russell, C.; Chen, P.; Burge, C. B. Evolutionary dynamics of gene and isoform regulation in Mammalian tissues. Science 2012, 338 (6114), 1593−9. (24) Ezkurdia, I.; del Pozo, A.; Frankish, A.; Rodriguez, J. M.; Harrow, J.; Ashman, K.; Valencia, A.; Tress, M. L. Comparative proteomics reveals a significant bias toward alternative protein isoforms with conserved structure and function. Mol. Biol. Evol. 2012, 29 (9), 2265−83. (25) Khatun, J.; Yu, Y.; Wrobel, J. A.; Risk, B. A.; Gunawardena, H. P.; Secrest, A.; Spitzer, W. J.; Xie, L.; Wang, L.; Chen, X.; Giddings, M. C. Whole human genome proteogenomic mapping for ENCODE cell line data: identifying protein-coding regions. BMC Genomics 2013, 14, 141. (26) Wang, X.; Slebos, R. J.; Wang, D.; Halvey, P. J.; Tabb, D. L.; Liebler, D. C.; Zhang, B. Protein identification using customized protein sequence databases derived from RNA-Seq data. J. Proteome Res. 2012, 11 (2), 1009−17. (27) Blakeley, P.; Siepen, J. A.; Lawless, C.; Hubbard, S. J. Investigating protein isoforms via proteomics: a feasibility study. Proteomics 2010, 10 (6), 1127−40. (28) Goode, R. J.; Yu, S.; Kannan, A.; Christiansen, J. H.; Beitz, A.; Hancock, W. S.; Nice, E.; Smith, A. I. The proteome browser web portal. J. Proteome Res. 2013, 12 (1), 172−8. (29) Guo, F.; Wang, D.; Liu, Z.; Lu, L.; Zhang, W.; Sun, H.; Zhang, H.; Ma, J.; Wu, S.; Li, N.; Jiang, Y.; Zhu, W.; Qin, J.; Xu, P.; Li, D.; He, F. CAPER: a chromosome-assembled human proteome browser. J. Proteome Res. 2013, 12 (1), 179−86. (30) Chang, K. Y.; Georgianna, D. R.; Heber, S.; Payne, G. A.; Muddiman, D. C. Detection of alternative splice variants at the proteome level in Aspergillus f lavus. J. Proteome Res. 2010, 9 (3), 1209−17. (31) Chang, K. Y.; Muddiman, D. C. Identification of alternative splice variants in Aspergillus flavus through comparison of multiple tandem MS search algorithms. BMC Genomics 2011, 12, 358. (32) Severing, E. I.; van Dijk, A. D.; van Ham, R. C. Assessing the contribution of alternative splicing to proteome diversity in Arabidopsis thaliana using proteomics data. BMC Plant Biol. 2011, 11 (1), 82. (33) Pruitt, K. D.; Tatusova, T.; Brown, G. R.; Maglott, D. R. NCBI Reference Sequences (RefSeq): current status, new features and 97

dx.doi.org/10.1021/pr400820p | J. Proteome Res. 2014, 13, 84−98

Journal of Proteome Research

Article

osteogenic potential of human bone marrow stromal cells. Nat. Biotechnol. 2002, 20 (6), 592−6. (51) Abdallah, B. M.; Haack-Sorensen, M.; Burns, J. S.; Elsnab, B.; Jakob, F.; Hokland, P.; Kassem, M. Maintenance of differentiation potential of human bone marrow mesenchymal stem cells immortalized by human telomerase reverse transcriptase gene despite extensive proliferation. Biochem. Biophys. Res. Commun. 2005, 326 (3), 527−38. (52) Al-Nbaheen, M.; Vishnubalaji, R.; Ali, D.; Bouslimi, A.; Al-Jassir, F.; Megges, M.; Prigione, A.; Adjaye, J.; Kassem, M.; Aldahmash, A. Human stromal (mesenchymal) stem cells from bone marrow, adipose tissue and skin exhibit differences in molecular phenotype and differentiation potential. Stem Cell Rev. 2013, 9 (1), 32−43. (53) Shevchenko, A.; Wilm, M.; Vorm, O.; Mann, M. Mass spectrometric sequencing of proteins from silver-stained polyacrylamide gels. Anal. Chem. 1996, 68 (5), 850−8. (54) Hart-Smith, G.; Raftery, M. J. Detection and characterization of low abundance glycopeptides via higher-energy C-trap dissociation and orbitrap mass analysis. J. Am. Soc. Mass Spectrom. 2012, 23 (1), 124−40. (55) Zou, L.; Zou, X.; Chen, L.; Li, H.; Mygind, T.; Kassem, M.; Bunger, C. Multilineage differentiation of porcine bone marrow stromal cells associated with specific gene expression pattern. J. Orthop. Res. 2008, 26 (1), 56−64. (56) Twine, N. A.; Janitz, C.; Wilkins, M. R.; Janitz, M. Sequencing of hippocampal and cerebellar transcriptomes provides new insights into the complexity of gene regulation in the human brain. Neurosci. Lett. 2013, 541, 263−8. (57) FastQC. http://www.bioinformatics.babraham.ac.uk/projects/ fastqc. Accessed October 14, 2013. (58) Kim, D.; Pertea, G.; Trapnell, C.; Pimentel, H.; Kelley, R.; Salzberg, S. L. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013, 14 (4), R36. (59) Trapnell, C.; Hendrickson, D. G.; Sauvageau, M.; Goff, L.; Rinn, J. L.; Pachter, L. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat. Biotechnol. 2013, 31 (1), 46−53. (60) Cufflinks Manual. http://cufflinks.cbcb.umd.edu/manual. html#cufflinks. Accessed October 14, 2013. (61) Goecks, J.; Nekrutenko, A.; Taylor, J. Galaxy Team, Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010, 11 (8), R86. (62) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20 (18), 3551−67. (63) The SAM Format Specification (v1.4-r985). http://samtools. sourceforge.net/SAM1.pdf (accessed May 9, 2013). (64) HyperSQL: HSQLDB100% Java Database. http://hsqldb. org/ (accessed May 9, 2013). (65) Stein, L. Generic Feature Format Version 3 (GFF3). http:// www.sequenceontology.org/gff3.shtml (accessed May 9, 2013). (66) Jones, A. R.; Eisenacher, M.; Mayer, G.; Kohlbacher, O.; Siepen, J.; Hubbard, S. J.; Selley, J. N.; Searle, B. C.; Shofstahl, J.; Seymour, S. L.; Julian, R.; Binz, P. A.; Deutsch, E. W.; Hermjakob, H.; Reisinger, F.; Griss, J.; Vizcaino, J. A.; Chambers, M.; Pizarro, A.; Creasy, D. The mzIdentML data standard for mass spectrometry-based proteomics results. Mol. Cell. Proteomics 2012, 11 (7), M111.014381. (67) Arthur, J. W.; Wilkins, M. R. Using proteomics to mine genome sequences. J. Proteome Res. 2004, 3 (3), 393−402. (68) Gupta, N.; Tanner, S.; Jaitly, N.; Adkins, J. N.; Lipton, M.; Edwards, R.; Romine, M.; Osterman, A.; Bafna, V.; Smith, R. D.; Pevzner, P. A. Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomic annotation. Genome Res. 2007, 17 (9), 1362−77. (69) Tress, M. L.; Bodenmiller, B.; Aebersold, R.; Valencia, A. Proteomics studies confirm the presence of alternative protein isoforms on a large scale. Genome Biol. 2008, 9 (11), R162.

(70) Meyer, L. R.; Zweig, A. S.; Hinrichs, A. S.; Karolchik, D.; Kuhn, R. M.; Wong, M.; Sloan, C. A.; Rosenbloom, K. R.; Roe, G.; Rhead, B.; Raney, B. J.; Pohl, A.; Malladi, V. S.; Li, C. H.; Lee, B. T.; Learned, K.; Kirkup, V.; Hsu, F.; Heitner, S.; Harte, R. A.; Haeussler, M.; Guruvadoo, L.; Goldman, M.; Giardine, B. M.; Fujita, P. A.; Dreszer, T. R.; Diekhans, M.; Cline, M. S.; Clawson, H.; Barber, G. P.; Haussler, D.; Kent, W. J. The UCSC Genome Browser database: extensions and updates 2013. Nucleic Acids Res. 2013, 41 (Database issue), D64−9. (71) Woo, S.; Cha, S. W.; Merrihew, G.; He, Y.; Castellana, N.; Guest, C.; Maccoss, M.; Bafna, V. Proteogenomic database construction driven from large scale RNA-seq data. J. Proteome Res. 2013, DOI: 10.1021/pr400294c. (72) Blakeley, P.; Overton, I. M.; Hubbard, S. J. Addressing statistical biases in nucleotide-derived protein databases for proteogenomic search strategies. J. Proteome Res. 2012, 11 (11), 5221−34. (73) Menon, R.; Zhang, Q.; Zhang, Y.; Fermin, D.; Bardeesy, N.; DePinho, R. A.; Lu, C.; Hanash, S. M.; Omenn, G. S.; States, D. J. Identification of novel alternative splice isoforms of circulating proteins in a mouse model of human pancreatic cancer. Cancer Res. 2009, 69 (1), 300−9. (74) Lee, Y.; Lee, Y.; Kim, B.; Shin, Y.; Nam, S.; Kim, P.; Kim, N.; Chung, W. H.; Kim, J.; Lee, S. ECgene: an alternative splicing database update. Nucleic Acids Res. 2007, 35 (Database issue), D99−103. (75) Bianchi, V.; Colantoni, A.; Calderone, A.; Ausiello, G.; Ferre, F.; Helmer-Citterich, M. DBATE: database of alternative transcripts expression. Database (Oxford) 2013, 2013, bat050. (76) Schulz, M. H.; Zerbino, D. R.; Vingron, M.; Birney, E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 2012, 28 (8), 1086−92. (77) Chu, H. T.; Hsiao, W. W.; Chen, J. C.; Yeh, T. J.; Tsai, M. H.; Lin, H.; Liu, Y. W.; Lee, S. A.; Chen, C. C.; Tsao, T. T.; Kao, C. Y. EBARDenovo: highly accurate de novo assembly of RNA-Seq with efficient chimera-detection. Bioinformatics 2013, 29 (8), 1004−10. (78) Ning, K.; Nesvizhskii, A. I. The utility of mass spectrometrybased proteomic data for validation of novel alternative splice forms reconstructed from RNA-Seq data: a preliminary assessment. BMC Bioinf. 2010, 11 (Suppl 11), S14. (79) Mo, F.; Hong, X.; Gao, F.; Du, L.; Wang, J.; Omenn, G. S.; Lin, B. A compatible exon-exon junction database for the identification of exon skipping events using tandem mass spectrum data. BMC Bioinf. 2008, 9, 537.

98

dx.doi.org/10.1021/pr400820p | J. Proteome Res. 2014, 13, 84−98