proBed - Journal of

Jun 2, 2017 - The introduction of new standard formats, proBAM and proBed, improves the integration of genomics and proteomics information, thus aidin...
3 downloads 13 Views 886KB Size
Subscriber access provided by CORNELL UNIVERSITY LIBRARY

Technical Note

proBAMconvert: a conversion tool for proBAM/proBed Volodimir Olexiouk, and Gerben Menschaert J. Proteome Res., Just Accepted Manuscript • Publication Date (Web): 02 Jun 2017 Downloaded from http://pubs.acs.org on June 5, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

proBAMconvert: proBAM/proBed

a

conversion

tool

for

Volodimir Olexiouk1,*, Gerben Menschaert1,* 1

Lab of Bioinformatics and Computational Genomics (BioBix), Department of Mathematical Modelling,

Statistics and Bioinformatics, Faculty of Bioscience Engineering, Ghent University, Ghent, Belgium * To whom correspondence should be addressed. Tel: +32 9 264 99 22; Email: [email protected], [email protected]

Present Address: BioBiX - Department of Mathematical Modeling, Statistics and Bioinformatics Faculty of Bioscience Engineering, Ghent University, Coupure Links 653, Building A, Gent, 9000, Belgium ABSTRACT The introduction of new standard formats, proBAM and proBed, improves the integration of genomics and proteomics information, thus aiding proteogenomics applications. These novel formats enable to store, inspect and analyze peptide spectrum matches within the context of the genome. However, an easy-to-use and transparent tool to convert mass spectrometry identification files to these new formats is indispensable. proBAMconvert enables the conversion of common identification file formats (mzIdentML, mzTab and pepXML) to proBAM/proBed using an intuitive interface. Furthermore, ProBAMconvert enables to output information both at the PSM and peptide level and has a command line interface next to the graphical user interface. Detailed documentation and a completely worked-out tutorial is available at http://probam.biobix.be. KEYWORDS proBAM, proBed, conversion tool, proteogenomics, bio-informatics BACKGROUND Interplay between multiple omics fields gained significant momentum in recent years. Different omics such as transcriptomics and proteomics –with their specific technological and computational challengesare stand-alone, mature research fields and have been thoroughly investigated. However, advances in bioinformatics tools bridge the gap between these different omics, resulting in multi-omics integration towards better explanation of biological questions. The field of proteogenomics, i.e. the combination of genomics, transcriptomics and proteomics is steadily growing (for recent advances in proteogenomics see 1). Advances in proteogenomics forged the need for a common template to map translation to genomic or expression information, respectively obtained from mass spectrometry or sequencing-based technologies. For this purpose, the HUPO Proteomics Standards Initiative devised two new file formats proBAM (http://www.psidev.info/proBAM) and proBed (http://www.psidev.info/proBED),both in internal

1

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 12

review at time of writing. proBAM is based on the binary form of the SAM-format2 whereas proBed is an extension to the BED format3. Both the BED and BAM formats are frequently used formats to store sequencing data, aiding downstream analysis and the visualization in popular genome browsers such as UCSC4,5,IGV6... Both formats serve a similar purpose, just as SAM/BAM and BED in genomics, both formats are comparable. However, the SAM/BAM format is much more feature rich and depending on the follow-up analysis one format may be preferred. These novel formats are based on their genomic counterpart, extending the formats rather than reinventing them, while maintaining compatibility with current existing BAM/BED tools. However, due to the novelty of these formats, conversion tools are indispensable to properly generate them. Here we introduce proBAMconvert, a tool to convert common proteomic

identification

files

(mzIdentML

(http://www.psidev.info/mztab)

(http://www.psidev.info/mzidentml),

or

mzTab

pepXML

http://tools.proteomecenter.org/wiki/index.php?title=Formats:pepXML))

to

( proBAM/proBed.

proBAMconvert is designed with great consideration for user-friendliness, aiming at reaching a wide audience ranging from wet-lab to in silico researchers. With this in mind, an intuitive Graphical User Interface (GUI) was created that does not require any preinstalled software. Furthermore, a Command Line Interface (CLI) and a separate python implementation is available at GitHub (https://github.com). Additionally, proBAMconvert is implemented into galaxy7 (http://galaxy.ugent.be), can be downloaded from

to

proteomics

galaxy

toolshed

(https://github.com/galaxyproteomics/tools-

galaxyp/tree/probamconvert) and is available as a bioconda package (https://bioconda.github.io/).

Figure 1: proBAMconvert graphical user interface (GUI)

2

ACS Paragon Plus Environment

Page 3 of 12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

MATERIALS AND METHODS 1. data import proBAMconvert is written in python 2.7.11 and is available for both UNIX-based systems, OSX and Windows. proBAMconvert accepts common MS identification file standards, mzIdentML8, pepXML and mzTab9, as input and converts these files to proBAM or proBed. Files are read using pyteomics10 (for pepXML and partly mzIdentML), or a self-build parser (for mzTab and in part for mzIdentML)). Next to parsing the peptide and PSM level information, extra comment lines along with other MS-protocol specific attributes such as enzyme (specificity), missed cleavages, peptide charge, mass difference, modifications, PSM score amongst others are copied from the original files compliant to the corresponding

format

guidelines

(see

also

specification

documents

on

proBAM

(www.http://www.psidev.info/proBAM) and proBed (http://www.psidev.info/proBED)). 2. id conversion proBAMconvert automatically recognizes the protein identifiers used in the identification file. These identifiers must comply with guidelines adapted from commonly used software. The identifier may be confined between a pipe “|” or an underscore “_”. For a decoy entry, the extra annotation should be preceding or following the protein identifier with an underscore. Below, the accepted protein identifier (protein_ID) formats are listed: [DECOY]_[protein_ID] ..|..|..|[protein_ID]_[DECOY]|..|..|.. .._.._.._[protein_ID]_[DECOY]_.._.._..

The accepted decoy annotations are: “REV_, DECOY_, _REVERSED, REVERSED_, _DECOY”, these cover the most use-cases. However, the user can add additional decoy annotations (see section “5. options”). Currently Ensembl transcript11, Ensembl protein11 and REFSEQ identifiers12 or UNIPROT/TREMBL13 accession and entry names are recognized. By default, proBAMconvert uses the first protein_ID recognized, but the user can specify which identifier proBAMconvert needs to use or can specify that all protein identifiers that are encountered in the file get converted (see section “5. options”). 3. genomic data retrieval Next, the protein identifiers are converted to Ensembl11 identifiers from the target annotation database using the BioMart14 interface from the BioServices15 Python package. When multiple identifiers are available from the target database (i.e. UniProt accessions can have multiple corresponding Ensembl transcript IDs), all are processed. Next, the corresponding genomic information is retrieved from Ensembl, this includes transcript/protein sequences, exon information, etc. This data is retrieved using the BioMart interface from BioServices15 and the Ensembl MySQL database11. 4. mapping

3

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 12

Using exon information, the transcript sequence is spliced and translated. Next, the identified peptides are mapped on the corresponding transcript (allowing mismatches if specified), providing genomic coordinates of the mapped peptide. Furthermore, proBAMconvert allows 3 frame translation of transcripts (see section “5.options”), in this case the (un-spliced) transcript sequence is translated in 3frames. This option can be particularly useful when, for instance, scanning 5’UTR/3’UTR regions or mapping novel protein isoforms. Remaining proBAM/proBed attributes are then reconstructed using the genomic coordinates and exon/transcript/gene information, or copied directly from the original peptide identification file. 5. Options The table below provides a summary of all the input variables to proBAMconvert. option

description

choose file

Choose the file to be converted (mzIdentML, pepXML or mzTab file format)

working directory

Specify the working directory (where to store the generated files)

project name

Specify the project name (the generated files will hold this name)

select species

Select the species used to perform the mapping

select version

Select the Ensembl database version

database

remove duplicate PSM mappings

When protein isoforms are included in the search space, a peptide could map to multiple isoforms. This can lead to confusion when visualizing/analyzing the data. Turning this option on, will refrain a PSM entry to map to the same location multiple times because of this isoform duplication.

conversion mode

The conversion mode specifies which output will be generated. The different options are:  proBAM_psm (default, proBAM format)  proBAM_peptide (peptide-based proBAM as group of PSMs),  proBAM_peptide_mod (peptide-based proBAM version where peptides with different modification are considered different peptides)  proBed (proBed PSM-format)

allow mismatches

While mapping peptides to the genomes, mismatches can be allowed. Here one can specify how many mismatches are allowed

Advanced options sorting order

Specify the BAM sorting order. The BAM file can be sorted based on various attributes, which might play a role in third party software. Options:  unknown  unsorted  query_name  coordinate

decoy annotation(s)

Here you can specify how the decoy annotation should be recognized. The accepted different decoy annotations should be comma-separated, the default values are: "REV_,DECOY_,_REVERSED,REVERSED_,_DECOY"

3-frame translation

Specifies whether a three reading frame translation needs to be performed on the transcript sequences. In this case, all three reading frames are used for mapping.

annotation identifiers

Common proteomics identification files specify the identifier of the protein to which the corresponding peptide was identified. proBAMconvert is able to 4

ACS Paragon Plus Environment

Page 5 of 12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

identify and use different protein identifiers. This attribute specifies how proBAMconvert should process these identifiers. A summary of the different option:  First: The first identifier recognized will be used throughout the whole conversion (DEFAULT)  Ensembl_tr: Ensembl transcript identifiers will be used  Ensembl_pr: Ensembl protein identifiers will be used  UniProt_ACC: UniProt accession identifiers will be used  UniProt_Entry: UniProt Entry names will be used  RefSeq: RefSeq identifiers will be used  all: all recognizable identifiers will be used include_unmapped

Describes whether unmapped PSM's should be included in the final output (only for proBAM)

add comment(s)

Here, you can add additional comments to the proBAM/proBed file.

validated (mzIdentML)

mzIdentML can contain both validated and non-validated PSMs. Here it can be specified whether only the validated PSMs should be converted.

only

Table 1: proBAMconvert v1.0.0 input options

6. proBAM/proBed proBAMconvert is able to generate proBAM (both at PSM and peptide level) and proBed files. The peptide-based proBAM files are generated according to the guidelines from the proBAM specification file (http://www.psidev.info/proBAM). Furthermore, peptide-based proBAM can be generated with or without considering peptide modifications.

The python PySAM (https://github.com/pysam-

developers/pysam) package is used to convert the SAM to its binary form, BAM. The proBAMconvert website, http://probam.biobix.be, contains an extensive manual covering the methodology and usage of proBAMconvert. Additionally, a tutorial is provided demonstrating a real use-case. Figure 2 provides an overview of the proBAMconvert workflow.

5

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 12

Figure 2: A visual overview of the proBAMconvert workflow. 1) Peptide identification files (mzIdentML,mzTab or pepXML) are read by proBAMconvert and subsequently parsed to retrieve all necessary information. 2) The protein identifiers retrieved in step1 are then converted to Ensembl identifiers. 3) Next, genomic information is retrieved for all identifiers, this includes genomic coordinates, exon information, DNA-sequence, … 4) Using the genomic information retrieved from the previous step, the peptides are mapped against their corresponding translated coding sequence. proBAMconvert assembles all remaining attributes, either extracted from the original peptide identification file or calculated using the acquired information. Finally, proBAMconvert generates output in the form of regular PSM-based proBAM, peptide-based proBAM or proBed.

RESULTS AND DISCUSSION proBAMconvert was thoroughly tested on +50 peptide identification files (both mzIdentML, mzTab and pepXML formats). This large test-set was carefully selected to include files from different resources, generated

by

various

software

tools

(i.e.

PeptideShaker 16,

PRIDE-utils,

Scaffold

(http://www.proteomesoftware.com/products/scaffold), Mascot (http://www.matrixscience.com), MSGF+17, PEAKS18, XTandem19, MyriMatch20,…. Subsequently, the proBAM/proBed files generated by proBAMconvert were tested for compatibility with different tools, such as SAMTools2, BEDTools21, IGV22, UCSC3 and JBrowse23.

6

ACS Paragon Plus Environment

Page 7 of 12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

To demonstrate the utility of proBAMconvert, a complete proteogenomics study was reprocessed based on different publicly available datasets. This test-case consists of multi-omics data from mouse embryonic stem cells. More specifically, RIBO-seq (translatomics) datasets from a study performed by Ingolia et al.24 were used (dataset GSE30839 from NCBI’s Gene Expression Omnibus) and processed using the PROTEOFORMER pipeline as described by Crappé et al.25 . Next, this RIBO-seq derived translation product database was used to reanalyze matching shotgun mass spectrometry dataset (PXD000124 accession within the PRIDE repository) as described by Menschaert et al.26 using SearchGUI27 and PeptideShaker16. Identifications were exported from PeptideShaker16 as mzIdentML and used as input for our conversion tool proBAMconvert. Bedgraphs generated by PROTEOFORMER from the RNA-seq and RIBO-seq data alongside the generated proBAM files from both types of proteomic experiments were then visualized in the IGV browser, enabling detailed inspection of this multi-sourced information in one view. Figure 3 contains an example where evidence from the various omics layers is visualized in the IGV browser, focused around the start region of the IQGA1. As can be seen on Figure 3, accumulating evidence indicates that translation also starts at a near-cognate startsite located in the 5’UTR region. A tutorial is available at http://probam.biobix.be/manual/tutorial where different genes and scenarios can be inspected.

Figure 3: Multi-omics visualization of the IQGAP1 translation start region. Ribosome profiling data of translating ribosomes (anti-sense strand) are visualized in the first track (red). The second track shows ribosome profiling data from ribosomes stalled at the translation initiation site (blue). The third track contains the average proBAM coverage, whereas the fourth track contains the full proBAM mappings (grey). The fifth track contains the reference genome sequence (anti-sense) translated in its three reading frames, with below the Ensembl gene annotation track loaded. All tracks are shown in the IGV genome browser focusing on the start-site region of the IQGAP1 gene. This particular gene was chosen as it provides evidence throughout all omics layers, that translation also occurs at a near-cognate start-site located in the 5’UTR of the transcript (red square).

7

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 12

CONCLUSION & FUTURE PERSPECTIVES The newly defined formats, proBAM and proBed, undoubtedly facilitate the integration between genomics and proteomics. They mediate a platform where proteomic data can easily be integrated into genomics and facilitates detection/verification of novel events such as protein extensions or nearcognate start sites. Maturation of both novel data formats together with an increased tool usage will demand new features from proBAMconvert, which consequently will be implemented. In the future, proBAMconvert will be expanded by adding additional species (Rattus norvegus, Arabidopsis thaliana, Canorhabditis elegans …) and allowing the use of additional protein identifiers (UCSC ID, EntrezGene ID). Also, new features are being developed for proBAMconvert, such as whole genome mapping, allowing input of coordinates instead of protein identifiers. proBAMconvert will be gradually updated and expanded acting upon the needs of the community. We aim at creating and maintaining a community standard for conversion of common peptide identification files to proBAM/proBed. ACCESSABILITY & OTHER TOOLS proBAMconvert is available as executables for UNIX, OSX and Windows both as a GUI and a CLI (downloads are available at http://probam.biobix.be/download_page). Moreover, proBAMconvert is implemented into galaxy7 (http://galaxy.ugent.be), can be downloaded from to proteomics galaxy toolshed (https://github.com/galaxyproteomics/tools-galaxyp/tree/probamconvert) and is available as a bioconda package (https://bioconda.github.io). The website http://probam.biobix.be provides detailed information on the tool, alongside an extensive manual and a completely worked-out tutorial. The standalone application was created using pyinstaller (http://www.pyinstaller.org), providing a one-file executable without necessary requirement to pre-install any software. The source code (python 2.7.11) is available from GitHub (https://github.com/Biobix/proBAMconvert), and an issue tracking system is also set up (https://github.com/Biobix/proBAMconvert/issues). The novel proBAM/proBed formats facilitate the combined representation and inspection of multi-omics data, but likewise support downstream analysis of the proteomics data within a proteogenomics setting. Even though proBAM/proBed are novel formats, tools such as the proBAMsuite28 exist enabling downstream analysis. Surely, more tools will emerge handling these novel formats. proBAMsuite28 is also able to generate the proBAM format and the PRIDE-Utilities29 package also generates the proBed format, however, the usage and purpose of these tools differ significantly from proBAMconvert in the following ways: (1) proBAMconvert provides an user friendly GUI, requiring no (bio-)informatics knowledge nor any preinstalled software. (2) proBAMconvert converts all routinely used peptide identification file formats and converts these to proBAM/proBed, requiring no programming experience nor download of additional data (annotation information is gathered on the fly by the application). (3) proBAMconvert provides various customization options in order to accommodate output to specific

8

ACS Paragon Plus Environment

Page 9 of 12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

needs (i.e. allowing mutation, 3-frame translation mapping, removing duplicated mapping … (see Material and Methods)). Output from proBAMconvert is compatible with conventional genomic browsers supporting BAM/BED, such as IGV22, UCSC3, JBrowse23. Also, the output should be compatible with all tools using BAM/BED. ACKNOWLEDGMENTS We would like to thank Jim Johnson and Björn Grüning for creating a bioconda package and a galaxy7 implementation for proBAMconvert. FUNDING Postdoctoral Fellow of the Research Foundation – Flanders (FWO-Vlaanderen) [G.M.,12A7813N]. Research grant from the Research Foundation - Flanders (FWO-Vlaanderen) [V.O, G0D3114N]. Funding for open access charge: Ghent University. Conflict of interest statement. None declared.

REFERENCES (1)

Végvári, Á.; Proteogenomics, Advances in Experimental Medicine and Biology 2016; 926

(2)

Li, H.; Handsaker, B.; Wysoker, A.; Fennell, T.; Ruan, J.; Homer, N.; Marth, G.; Abecasis, G.; Durbin, R.; Data, G. P.; et al. The Sequence Alignment / Map format and SAMtools. Bioinformatics 2009, 25 (16), 2078–2079.

(3)

Rosenbloom, K. R.; Armstrong, J.; Barber, G. P.; Casper, J.; Clawson, H.; Diekhans, M.; Dreszer, T. R.; Fujita, P. A.; Guruvadoo, L.; Haeussler, M.; et al. The UCSC Genome Browser database: 2015 update. Nucleic Acids Res. 2015, 43 (D1), D670–D681.

(4)

Karolchik, D.; Hinrichs, A. S.; Kent, W. J. The UCSC genome browser. Curr. Protoc. Hum. Genet. 2011, 71:18.6:18.6.1–18.6.33.

(5)

Kent, W. J.; Sugnet, C. W.; Furey, T. S.; Roskin, K. M.; Pringle, T. H.; Zahler, A. M.; Haussler, a. D. The Human Genome Browser at UCSC. Genome Research. 2002, pp 996–1006.

(6)

Robinson, J. T.; Thorvaldsdóttir, H.; Winckler, W.; Guttman, M.; Lander, E. S.; Getz, G.; Mesirov, J. P. Integrative genomics viewer. Nat. Biotechnol. 2011, 29 (1), 24–26.

(7)

Blankenberg, D.; Kuster, G. Von; Coraor, N.; Ananda, G.; Lazarus, R.; Mangan, M.; Nekrutenko, A.; Taylor, J. Galaxy: A web-based genome analysis tool for experimentalists. Current Protocols in Molecular Biology. 2010.

(8)

Apweiler, R.; Bateman, A.; Martin, M. J.; O’Donovan, C.; Magrane, M.; Alam-Faruque, Y.; Alpi, E.; Antunes, R.; Arganiska, J.; Casanova, E. B.; et al. Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2014, 42 (D1), D191--8. 9

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(9)

Page 10 of 12

Orchard, S. Data Standardization and Sharing - The work of the HUPO-PSI. Biochimica et Biophysica Acta - Proteins and Proteomics. 2014, pp 82–87.

(10)

Goloborodko, A. A.; Levitsky, L. I.; Ivanov, M. V.; Gorshkov, M. V. Pyteomics - A python framework for exploratory data analysis and rapid software prototyping in proteomics. J. Am. Soc. Mass Spectrom. 2013, 24 (2), 301–304.

(11)

Cunningham, F.; Amode, M. R.; Barrell, D.; Beal, K.; Billis, K.; Brent, S.; Carvalho-Silva, D.; Clapham, P.; Coates, G.; Fitzgerald, S.; et al. Ensembl 2015. Nucleic Acids Res. 2014, 43 (D1), D662–D669.

(12)

Pruitt, K. D.; Tatusova, T.; Brown, G. R.; Maglott, D. R. NCBI Reference Sequences (RefSeq): Current status, new features and genome annotation policy. Nucleic Acids Res. 2012, 40 (D1).

(13)

Boutet, E.; Lieberherr, D.; Tognolli, M.; Schneider, M.; Bairoch, A. UniProtKB/Swiss-Prot. Methods Mol. Biol. 2007, 406, 89–112.

(14)

Smedley, D.; Haider, S.; Durinck, S.; Pandini, L.; Provero, P.; Allen, J.; Arnaiz, O.; Awedh, M. H.; Baldock, R.; Barbiera, G.; et al. The BioMart community portal: an innovative alternative to large, centralized data repositories. Nucleic Acids Res. 2015, 43 (April), 589–598.

(15)

Cokelaer, T.; Pultz, D.; Harder, L. M.; Serra-Musach, J.; Saez-Rodriguez, J.; Valencia, A. BioServices: A common Python package to access biological Web Services programmatically. Bioinformatics 2013, 29 (24), 3241–3242.

(16)

Vaudel, M.; Burkhart, J. M.; Zahedi, R. P.; Oveland, E.; Berven, F. S.; Sickmann, A.; Martens, L.; Barsnes, H. PeptideShaker enables reanalysis of MS-derived proteomics data sets. Nat. Biotechnol. 2015, 33 (1), 22–24.

(17)

Kim, S.; Pevzner, P. a. MS-GF+ makes progress towards a universal database search tool for proteomics. Nat. Commun. 2014, 5, 5277.

(18)

Ma, B.; Zhang, K.; Hendrie, C.; Liang, C.; Li, M.; Doherty-Kirby, A.; Lajoie, G. PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 2003, 17 (20), 2337–2342.

(19)

Muth, T.; Vaudel, M.; Barsnes, H.; Martens, L.; Sickmann, A. XTandem Parser: an open-source library to parse and analyse X!Tandem MS/MS search results. Proteomics 2010, 10 (7), 1522– 1524.

(20)

Tabb, D. L.; Fernando, C. G.; Chambers, M. C. MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis. J Proteome Res 2007, 6 (2), 654–661.

(21)

Quinlan, A. R.; Hall, I. M. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 2010, 26 (6), 841–842.

10

ACS Paragon Plus Environment

Page 11 of 12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(22)

Journal of Proteome Research

Thorvaldsdóttir, H.; Robinson, J. T.; Mesirov, J. P. Integrative Genomics Viewer (IGV): Highperformance genomics data visualization and exploration. Brief. Bioinform. 2013, 14 (2), 178– 192.

(23)

Skinner, M. E.; Uzilov, A. V.; Stein, L. D.; Mungall, C. J.; Holmes, I. H. JBrowse: A nextgeneration genome browser. Genome Res. 2009, 19 (9), 1630–1638.

(24)

Ingolia, N. T.; Lareau, L. F.; Weissman, J. S. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell 2011, 147 (4), 789– 802.

(25)

Crappé, J.; Ndah, E.; Koch, A.; Steyaert, S.; Gawron, D.; De Keulenaer, S.; De Meester, E.; De Meyer, T.; Van Criekinge, W.; Van Damme, P.; et al. PROTEOFORMER: deep proteome coverage through ribosome profiling and MS integration. Nucleic Acids Res. 2014, No. 10, 1– 10.

(26)

Menschaert, G.; Van Criekinge, W.; Notelaers, T.; Koch, A.; Crappé, J.; Gevaert, K.; Van Damme, P. Deep proteome coverage based on ribosome profiling aids mass spectrometrybased protein and peptide discovery and provides evidence of alternative translation products and near-cognate translation initiation events. Mol. Cell. Proteomics 2013, 12 (7), 1780–1790.

(27)

Vaudel, M.; Barsnes, H.; Berven, F. S.; Sickmann, A.; Martens, L. SearchGUI: An open-source graphical user interface for simultaneous OMSSA and X!Tandem searches. Proteomics 2011, 11 (5), 996–999.

(28)

Wang, X.; Slebos, R. J. C.; Chambers, M. C.; Tabb, D. L.; Liebler, D. C.; Zhang, B. proBAMsuite , a Bioinformatics Framework for Genome-Based Representation and Analysis of Proteomics Data, Molecular & Cellular Proteomics, 2016, 15, 1164-1175..

(29)

Vizcaíno, J. A.; Côté, R. G.; Csordas, A.; Dianes, J. A.; Fabregat, A.; Foster, J. M.; Griss, J.; Alpi, E.; Birim, M.; Contell, J.; et al. The Proteomics Identifications (PRIDE) database and associated tools: Status in 2013. Nucleic Acids Res. 2013, 41 (D1) : D1063-D1069.

11

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

for TOC only 713x527mm (96 x 96 DPI)

ACS Paragon Plus Environment

Page 12 of 12