PPLine: An Automated Pipeline for SNP, SAP, and Splice Variant

Jul 6, 2015 - The fundamental mission of the Chromosome-Centric Human Proteome Project (C-HPP) is the research of human proteome diversity, including ...
2 downloads 10 Views 2MB Size
Article pubs.acs.org/jpr

PPLine: An Automated Pipeline for SNP, SAP, and Splice Variant Detection in the Context of Proteogenomics George Sergeevich Krasnov,*,†,‡,§ Alexey Alexandrovich Dmitriev,† Anna Viktorovna Kudryavtseva,†,∥ Alexander Valerievich Shargunov,‡,§ Dmitry Sergeevich Karpov,†,‡ Leonid Andreevich Uroshlev,† Natalya Vladimirovna Melnikova,† Vladimir Mikhailovich Blinov,‡,§ Ekaterina Vladimirovna Poverennaya,‡ Alexander Ivanovich Archakov,‡ Andrey Valerievich Lisitsa,‡ and Elena Alexandrovna Ponomarenko‡ †

Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, 111991 Russia Orekhovich Institute of Biomedical Chemistry, Russian Academy of Medical Sciences, Moscow, 119121 Russia § Mechnikov Research Institute of Vaccines and Sera, Moscow, 105064 Russia ∥ Herzen Moscow Cancer Research Institute, Ministry of Healthcare of the Russian Federation, Moscow, 125284 Russia ‡

S Supporting Information *

ABSTRACT: The fundamental mission of the Chromosome-Centric Human Proteome Project (C-HPP) is the research of human proteome diversity, including rare variants. Liver tissues, HepG2 cells, and plasma were selected as one of the major objects for C-HPP studies. The proteogenomic approach, a recently introduced technique, is a powerful method for predicting and validating proteoforms coming from alternative splicing, mutations, and transcript editing. We developed PPLine, a Python-based proteogenomic pipeline providing automated single-amino-acid polymorphism (SAP), indel, and alternative-spliced-variants discovery based on raw transcriptome and exome sequence data, single-nucleotide polymorphism (SNP) annotation and filtration, and the prediction of proteotypic peptides (available at https://sourceforge.net/projects/ppline). In this work, we performed deep transcriptome sequencing of HepG2 cells and liver tissues using two platforms: Illumina HiSeq and Applied Biosystems SOLiD. Using PPLine, we revealed 7756 SAP and indels for HepG2 cells and liver (including 659 variants nonannotated in dbSNP). We found 17 indels in transcripts associated with the translation of alternate reading frames (ARF) longer than 300 bp. The ARF products of two genes, SLMO1 and TMEM8A, demonstrate signatures of caspase-binding domain and Gcn5related N-acetyltransferase. Alternative splicing analysis predicted novel proteoforms encoded by 203 (liver) and 475 (HepG2) genes according to both Illumina and SOLiD data. The results of the present work represent a basis for subsequent proteomic studies by the C-HPP consortium. KEYWORDS: C-HPP, RNA-seq, SNP, SAP, indel, alternative reading frames, alternative splicing, proteotypic peptides



INTRODUCTION The Chromosome-Centric Human Proteome Project (C-HPP) is one of the largest international projects, driven by a consortium of scientific groups from 25 countries. The final goal of C-HPP is the comprehensive annotation of human proteome diversity coming from alternative splicing, presence of gene allele variants, mRNA editing, and post-translational modifications.1,2 The Russian part of C-HPP is devoted to the research of the proteins encoded by the genes of chromosome 18. The three types of biomaterial (HepG2 cells, liver tissue, and human depleted plasma) were selected for the pilot series of C-HPP.3 The proteogenomic approach is a two-staged powerful technique enabling the identification of modified protein species (proteoforms) containing single amino acid polymorphisms (SAP) or translated from alternatively spliced transcripts. At the first stage, exome or transcriptome sequencing is executed to reveal mutations and (in the case of RNA-seq) splicing alterations, mRNA editing, and fusion © 2015 American Chemical Society

transcripts. Protein prediction and mass spectrometric database construction is then performed based on the transcriptomic data. The second stage assumes mass spectrometric analysis aimed at the detection or quantification of the predicted proteoforms. This approach has been successfully used for the analysis of cancer cell lines and clinical tumor samples.4−7 The pilot phase of C-HPP (2010−2015) was devoted to the identification and quantification of canonical proteins, regardless of the presence of SAP, mRNA editing, transcript splice variants, or protein post-translational modifications.1 The introduction of these characteristics immeasurably increases the field of study; although the complete number of proteincoding genes (and, therefore, canonical proteins) only slightly Special Issue: The Chromosome-Centric Human Proteome Project 2015 Received: May 28, 2015 Published: July 6, 2015 3729

DOI: 10.1021/acs.jproteome.5b00490 J. Proteome Res. 2015, 14, 3729−3737

Article

Journal of Proteome Research

Figure 1. PPLine workflow. PPLine combines several RNA-seq and exome-sequencing tools to perform raw data processing, read alignment, deduplication, BQS recalibration, SNP calling and indel realignment, annotation of revealed variants, and alternative splicing profiling and integrates the derived data into single databases and tables.

analysis that will be performed in the next phase of C-HPP. The results of these studies will be included to the neXtProt knowledge database aimed at complementing UniProtKB.13 The development of exome- and transcriptome-sequencing pipelines is one of the most rapidly growing areas of bioinformatics.14−16 The explosive growth of experimental sequencing data needs robust, accurate, and effective analysis.17 Nowadays, several pipelines (particularly those sharing the functionality of PPLine) have been developed. Recently, Pirooznia et al. presented a unified pipeline for processing NGS data that encompasses four modules (mapping, filtering, realignment and recalibration, and variant calling) that represents a combination of a set of popular tools (BWA aligner, Picard, SAMtools, and GATK).18 PRADA, a Pythonbased pipeline, offers gene expression profiling, quality metrics, detection of unsupervised and supervised fusion transcripts, and the detection of intragenic fusion variants.19 One also should mention OTG-snpcaller, a SNP-caller pipeline based on TMAP

exceeds 20 000, the number of various proteoforms can reach several million. Within the frame of C-HPP, the proteogenomic approach will be applied for the deep analysis of proteome diversity of HepG2 cells and liver tissues. In this work, we present the results of the first phase of this study: the transcriptomic characterization of the biomaterial and the prediction of the proteotypic peptides that are specific for splice variants and proteoforms translated from SNP-containing alleles or edited transcripts. These variations include frameshift mutations or transcript edition events that can lead to the translation of long alternate reading frames (more than 300 bp). The phenomena of alternate reading frame translation is essential for many viruses but also known to play an important role for mammals, both in normal conditions as well as during the development of cancer.8−10 With the results of the alternative-splicing profiling of eight human tissues reported earlier11 and the results obtained by other members of CHPP,12 these data provide the basis for the proteogenomic 3730

DOI: 10.1021/acs.jproteome.5b00490 J. Proteome Res. 2015, 14, 3729−3737

Article

Journal of Proteome Research

SNP were scanned for annotations of dbSNP, 1000 Genomes, Exome Sequencing Project (ESP), ClinVar, COSMIC, and NCI60.32 SIFT, Polyphen, LRT, and MetaLRT scores indicating the functional impact of nonsynonymous SNP (nsSNP) were also evaluated using Annovar. The integration of the results and the translation of the predicted mRNA splice variants and SNP-containing transcripts is performed using PPLine. To avoid excessive false positive prediction rates, we used only reference start codons annotated in Ensembl to translate the predicted alternatively spliced transcripts. The generation of target and decoy protein FASTA databases and the prediction of unique proteotypic peptides was also done using PPLine. The reference protein database was derived by merging UniprotKB proteins (both Swiss-Prot and TrEMBL, October 2014) and translated Ensembl proteins (GRCh37 release 74). Tables containing SNP and SAP info, proteotypic peptides sequences, dbSNP, ESP, COSMIC annotations, and other info were generated by PPLine. All bioinformatics steps were performed with the computational resources of the “Genome” Center (Engelhardt Institute of Molecular Biology, Moscow) The crucial step of SNP analysis is the selection of the appropriate threshold values for read coverage or the Phred quality score of SNP detection. The Phred quality score Q is calculated with the formula Q = −10 log10(P), where P is the error probability. PPLine allows us to assess these values on the basis of the dependences between the ratio of known SNP, nonsynonymous SNP, and read coverage or Phred quality values. The details are described in the Results and Discussion section.

and GATK and designed especially for Ion Torrent and Ion Proton sequencing.20 However, several software suites have been developed to analyze proteomic data, including PGTools,21 HTAPP,22 and Galaxy-based approaches.23 In this study we present PPLine, a Python-based proteogenomic application providing a complete pipeline from read trimming to the prediction of unique proteotypic peptides for the detection of SAP-containing proteins and the quantification of alternatively spliced proteoforms. This is a bridge between raw sequence data and further proteomic analysis.



MATERIALS AND METHODS

Tissue Samples and Cell Lines

We analyzed pooled liver tissue samples and HepG2 cultures previously used in the other studies of C-HPP consortium.3,24 Liver samples were purchased from the ILSBioBiobank; HepG2 cells were grown, harvested, and processed by the standard protocol (see details in the article of Zgoda et al.3). RNA Purification, Reverse Transcription, Library Preparation, and Sequencing

RNA purification, reverse transcription, library preparation, and sequencing was performed using techniques described in the transcriptomic articles of the other members of C-HPP team.25,26 Briefly, the total RNA was extracted using an RNeasy Mini Kit (Qiagen, Germany) and tested for integrity using an Agilent 2100 Bioanalyzer (Agilent Technologies). RNA enrichment with poly-A transcripts and library preparation was performed using a TruSeq RNA Sample Prep Kit (Illumina) and SureSelect RNA Capture Enrichment System (Applied Biosystems). Sequencing was performed using Illumina HiSeq 2000 and Applied Biosystems SOLiD 4 platforms with singleend 50 bp reads (except for Illlumina liver transcriptome sequencing, which was done with paired-end 50 bp reads).



RESULTS AND DISCUSSION

SNP Calling

HepG2 and normal liver transcriptomes were sequenced using two platforms: Illumina HiSeq 2000 and Applied Biosystems SOLiD 4. Initially, we found 168 000 SNP candidates (all chromosomes; 12 675 nonsynonymous SNP, or SAP) in the transcribed fragments for both the liver tissues and the HepG2 cells with read coverage of at least 1. After we applied a procedure for BQS recalibration, the total amount of SAP candidates was reduced from 12 675 to 7756. The further filtration of SAP candidates was based on the read coverage and Phred quality, two major characteristics of SAP prediction reliability. However, one should not rely on raw Phred quality scores “as is” because they could not reflect the real false discovery rate; one should need to evaluate the results from different angles to ensure analysis accuracy.33 Each experimental design is characterized with its own false discovery rate. HepG2 cell and liver tissue transcriptomes have been sequenced with different platforms at different times. Thus, one should expect different false positive rates for each experiment. Our new tool, PPLine, offers an interesting method by which to evaluate the appropriate threshold values of these parameters for accurate SNP identification. We proceeded as follows.

Bioinformatics

Here we present our novel tool PPLine performing all steps of the bioinformatics analysis: read filtration and trimming, read alignment, base quality score (BQS) recalibration, SNP calling, indel realignment, SAP prediction and annotation, alternative splicing profiling, integration of the data, proteomic FASTA database generation, and prediction of proteotypic peptides (Figure 1). PPLine integrates into a single pipeline a set of popular tools: Trimmomatic, Tophat2, SAMtools, GATK, Cufflinks, and Annovar. Initially, adapters are removed, and reads are 5′-trimmed using Trimmomatic.27 The colorspace SOLiD reads are trimmed using our in-house Python script. The reads are then mapped to the transcriptome and genome (GRCh37.p10) with splice-aware aligner Tophat2.28 Run parameters are adjusted automatically by PPLine (2 mismatches or gaps for reads less 60 bp and forced read remapping if at least one mismatch or gap is found). The discovery of novel splice junctions using the segment mapping method can be enabled. Then, the novel splice variants for HepG2 cells and liver tissues were predicted with Cufflinks.29 To improve the accuracy of subsequent SNP calling and read deduplication, we performed BQS recalibration using Picard tools, GATK, and dbSNP release 142 (January 2, 2015).30 SNP calling is performed using SAMtools and bcftools with a nin BQS value >20 and mapping quality >20.31 Additionally, insertion and deletion (indel) realignment is done using GATK. Using Annovar, the derived

Selection of Thresholds Based on the Proportion of Nonsynonymous SNP in Coding Regions

Nonsynonymous SNP resulting in SAP occur less frequently than would be expected from a random SNP distribution. In the case of random distribution, about 74% of SNP or mutations occurring in coding regions should be nonsynonymous. However, in reality, the nonsynonymous 3731

DOI: 10.1021/acs.jproteome.5b00490 J. Proteome Res. 2015, 14, 3729−3737

Article

Journal of Proteome Research

Figure 2. (A) Percentage of nonsynonymous SNP (nsSNP) among all SNP in the coding regions (all chromosomes), depending on the predicted Phred quality threshold. nsSNP occur less frequently than would be expected from a random SNP distribution (74%), and their higher content refers to low data quality. (B) Percentage of nsSNP among all SNP in the coding regions (all chromosomes), depending on the read coverage.

raw data. Although the SOLiD sequencer reads each nucleotide twice, suggesting higher read quality, Illumina demonstrates itself as a more confident source of data for SNP calling. The desired RnsSNP was evaluated as 40−43% for certainly sufficient coverage thresholds, (e.g., 20 reads). The BQSrecalibration procedure significantly increases the quality of the SAP and SNP prediction that is indicated by much lower values of RnsSNP for the recalibrated data. We have chosen a minimum CT to exclude areas with rapidly growing RnsSNP values and to fit the desired range of 40−43% (Figure 2). Generally, the acceptable thresholds for high-confidence SAP (FDR < 0.02) can be selected as 15 for QT and 4 for CT with Illumina and 40 for QT and 12 for CT with SOLiD data. Filtration with CT rather than QT demonstrates better results for SOLiD data (Figure 2). On the basis of a comparison of RnsSNP values between the selected CT or QT and the obviously high coverage and quality threshold values, one should expect FDR values of 90% of SNP are known according to the dbSNP data. Just exome and genome, not transcriptome, sequencing results are the major source of information for the dbSNP database, composed of data from the Exome Sequencing Project, 1000 Genomes, and other resources. This suggests RNA editing as the primary mechanism of the occurrence of these indels. However, to prove this hypothesis, exome sequencing data for these samples are needed. It remains a task of future research.



ASSOCIATED CONTENT

S Supporting Information *

Supporting table S1. Catalogue of SAP (single amino acid residue variations and indels) for liver tissues and HepG2 cells analyzed in this study. Annotations with dbSNP, 1000 Genomes Project, the Exome Sequencing Project, ClinVar databases are provided. Predicted functional impact scores of SAP/mutations are presented according to Polyphen2, SIFT, LRT, PhyloP and other algorithms. Supporting material S2. Fasta databases for SAP-containing (Proteins.Seq Alt.SAP.fa) and reference proteins (*.ref.fa). Sequence identifiers are presented in format: Uniprot ID | Ensembl protein ID | chromosome: position in chr.: ref.variant (nucl.): alt.variant (nucl.): strand: ref.variant (a.a.): position in the protein: alt.variant (a.a.). Supporting table S3. Novel isoforms predicted for liver tissues and HepG2 cells on the basis of Illumina HiSeq and Applied Biosystems SOLiD RNA-seq data. Read coverage, FPKM (fragments per kbp per million) values, predicted protein sequences, and isoform-specific proteotypic peptides are provided. The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/ acs.jproteome.5b00490.

Novel Splice Variants in Normal Liver Tissues And HepG2 Cells

Alternative splicing introduces an additional level of proteome complexity. The results from combining proteoENCODEdb searches with experimental mass spectral data indicate that some alternative splicing forms detected at the transcript level are in fact translated to proteins.55 Alternative splicing profiles demonstrate pronounced tissue specificity.56 In our previous study, a meta-analysis of transcriptomic NCBI Sequence Read Archive data for eight human tissues (liver along with brain) demonstrated the highest complexity of transcript isoforms coming from alternative splicing.11 The present study is aimed at predicting the possible proteoforms translated from alternatively spliced transcripts in normal liver tissues and HepG2 cells based on the results of deep transcriptome sequencing performed using two platforms, Illumina HiSeq and Applied Biosystems SOLiD. An alternative splicing analysis with the discovery of novel exon junctions allowed us to predict proteoforms previously unannotated in UniProtKB for 642 and 1001 genes based on liver RNA-seq data from Illumina and SOLiD platforms, accordingly (10 and 12 genes from chromosome 18), and 1681 and 1389 genes (30 and 16 genes of chromosome 18) for HepG2 cells. Proteoforms for 204 (liver tissues) and 475 (HepG2 cell) genes were predicted based on both Illumina and SOLiD data. In chromosome 18, Illumina and SOLiD data simultaneously suggested a presence of novel splice variants for PIEZO2, SS18, NEDD4L, and CCBE1 genes in liver tissues and PIEZO2, SPIRE1, CEP192, ROCK1, MIB1, SMAD7, NARS, and KIAA1468 genes in HepG2 cells. The results are summarized in Table S3 in the Supporting Information.



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Tel: +7-916-983-6952. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work was supported by the Russian Science Foundation, grant no. 14-25-00132. Part of this work (the computational analysis) was performed using the equipment of the EIMB RAS Genome Center (http://www.eimb.ru/RUSSIAN_NEW/ INSTITUTE/ccu_genome_c.php) under the financial support of the Ministry of Education and Science of the Russian Federation (contract no. 14.621.21.0001, project unique identifier RFMEFI6114X0001).



CONCLUSIONS The Chromosome-Centric Human Proteome Project aims to research the diversity and functions of human proteins. One of the most powerful techniques, a two-staged protegenomic approach, is eligible to identify protein variations coming from genomic polymorphisms and mRNA editing as well as alternative transcript splicing, three major sources of proteome diversity. The results of the present work should serve as a basis for the second phase of proteogenomic studies being performed by the members of the C-HPP consortium. In the present work, we present PPLine, a pipeline developed as a bridge between RNA-seq or exome sequencing data and proteomic analysis. PPLine is designed to process raw sequencing data and aimed at identifying protein SAP and splice variants and predicting SAP and splice-specific



ABBREVIATIONS BQS, base quality score; SAP, single-amino-acid polymorphism; CT, read coverage threshold; QT, Phred quality score threshold; synSNP, synonymous SNP; nsSNP, nonsynonymous SNP; Ti, transitions; Tv, transversions



REFERENCES

(1) Paik, Y. K.; Jeong, S. K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Cho, S. Y.; Lee, H. J.; Na, K.; Choi, E. Y.; Yan, F.; Zhang, F.; Zhang, Y.; Snyder, M.; Cheng, Y.; Chen, R.; Marko-Varga, G.; Deutsch, E. W.; Kim, H.; Kwon, J. Y.; Aebersold, R.; Bairoch, A.; Taylor, A. D.; Kim, K. Y.; Lee, E. Y.; Hochstrasser, D.; Legrain, P.; Hancock, W. S. The

3735

DOI: 10.1021/acs.jproteome.5b00490 J. Proteome Res. 2015, 14, 3729−3737

Article

Journal of Proteome Research Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome. Nat. Biotechnol. 2012, 30 (3), 221−3. (2) Lane, L.; Bairoch, A.; Beavis, R. C.; Deutsch, E. W.; Gaudet, P.; Lundberg, E.; Omenn, G. S. Metrics for the Human Proteome Project 2013−2014 and strategies for finding missing proteins. J. Proteome Res. 2014, 13 (1), 15−20. (3) Ponomarenko, E. A.; Kopylov, A. T.; Lisitsa, A. V.; Radko, S. P.; Kiseleva, Y. Y.; Kurbatov, L. K.; Ptitsyn, K. G.; Tikhonova, O. V.; Moisa, A. A.; Novikova, S. E.; Poverennaya, E. V.; Ilgisonis, E. V.; Filimonov, A. D.; Bogolubova, N. A.; Averchuk, V. V.; Karalkin, P. A.; Vakhrushev, I. V.; Yarygin, K. N.; Moshkovskii, S. A.; Zgoda, V. G.; Sokolov, A. S.; Mazur, A. M.; Prokhortchouck, E. B.; Skryabin, K. G.; Ilina, E. N.; Kostrjukova, E. S.; Alexeev, D. G.; Tyakht, A. V.; Gorbachev, A. Y.; Govorun, V. M.; Archakov, A. I. Chromosome 18 transcriptoproteome of liver tissue and HepG2 cells and targeted proteome mapping in depleted plasma: update 2013. J. Proteome Res. 2014, 13 (1), 183−90. (4) Li, H. D.; Menon, R.; Omenn, G. S.; Guan, Y. Revisiting the identification of canonical splice isoforms through integration of functional genomics and proteomics evidence. Proteomics 2014, 14 (23−24), 2709−18. (5) Karpova, M. A.; Karpov, D. S.; Ivanov, M. V.; Pyatnitskiy, M. A.; Chernobrovkin, A. L.; Lobas, A. A.; Lisitsa, A. V.; Archakov, A. I.; Gorshkov, M. V.; Moshkovskii, S. A. Exome-driven characterization of the cancer cell lines at the proteome level: the NCI-60 case study. J. Proteome Res. 2014, 13 (12), 5551−60. (6) Fanayan, S.; Smith, J. T.; Lee, L. Y.; Yan, F.; Snyder, M.; Hancock, W. S.; Nice, E. Proteogenomic analysis of human colon carcinoma cell lines LIM1215, LIM1899, and LIM2405. J. Proteome Res. 2013, 12 (4), 1732−42. (7) Zhang, B.; Wang, J.; Wang, X.; Zhu, J.; Liu, Q.; Shi, Z.; Chambers, M. C.; Zimmerman, L. J.; Shaddox, K. F.; Kim, S.; Davies, S. R.; Wang, S.; Wang, P.; Kinsinger, C. R.; Rivers, R. C.; Rodriguez, H.; Townsend, R. R.; Ellis, M. J.; Carr, S. A.; Tabb, D. L.; Coffey, R. J.; Slebos, R. J.; Liebler, D. C.; Nci, C. Proteogenomic characterization of human colon and rectal cancer. Nature 2014, 513 (7518), 382−7. (8) Gott, J. M. Expanding genome capacity via RNA editing. C. R. Biol. 2003, 326 (10−11), 901−8. (9) Korff, S.; Woerner, S. M.; Yuan, Y. P.; Bork, P.; von Knebel Doeberitz, M.; Gebert, J. Frameshift mutations in coding repeats of protein tyrosine phosphatase genes in colorectal tumors with microsatellite instability. BMC Cancer 2008, 8, 329. (10) Delgado, A. P.; Brandao, P.; Chapado, M. J.; Hamid, S.; Narayanan, R. Open reading frames associated with cancer in the dark matter of the human genome. Cancer genomics & proteomics 2014, 11 (4), 201−13. (11) Shargunov, A. V.; Krasnov, G. S.; Ponomarenko, E. A.; Lisitsa, A. V.; Shurdov, M. A.; Zverev, V. V.; Archakov, A. I.; Blinov, V. M. Tissue-specific alternative splicing analysis reveals the diversity of chromosome 18 transcriptome. J. Proteome Res. 2014, 13 (1), 173−82. (12) Bai, Y.; Hassler, J.; Ziyar, A.; Li, P.; Wright, Z.; Menon, R.; Omenn, G. S.; Cavalcoli, J. D.; Kaufman, R. J.; Sartor, M. A. Novel bioinformatics method for identification of genome-wide noncanonical spliced regions using RNA-Seq data. PLoS One 2014, 9 (7), e100864. (13) Gaudet, P.; Argoud-Puy, G.; Cusin, I.; Duek, P.; Evalet, O.; Gateau, A.; Gleizes, A.; Pereira, M.; Zahn-Zabal, M.; Zwahlen, C.; Bairoch, A.; Lane, L. neXtProt: organizing protein knowledge in the context of human proteome projects. J. Proteome Res. 2013, 12 (1), 293−8. (14) Yu, X.; Sun, S. Comparing a few SNP calling algorithms using low-coverage sequencing data. BMC Bioinf. 2013, 14, 274. (15) Raineri, E.; Ferretti, L.; Esteve-Codina, A.; Nevado, B.; Heath, S.; Perez-Enciso, M. SNP calling by sequencing pooled samples. BMC Bioinf. 2012, 13, 239. (16) McCormick, R. F.; Truong, S. K.; Mullet, J. E. RIG: Recalibration and Interrelation of Genomic Sequence Data with the GATK. G3: Genes, Genomes, Genet. 2015, 5 (4), 655−65.

(17) Kodama, Y.; Shumway, M.; Leinonen, R. The Sequence Read Archive: explosive growth of sequencing data. Nucleic Acids Res. 2012, 40, D54−6. (18) Pirooznia, M.; Kramer, M.; Parla, J.; Goes, F. S.; Potash, J. B.; McCombie, W. R.; Zandi, P. P. Validation and assessment of variant calling pipelines for next-generation sequencing. Hum. Genomics 2014, 8, 14. (19) Torres-Garcia, W.; Zheng, S.; Sivachenko, A.; Vegesna, R.; Wang, Q.; Yao, R.; Berger, M. F.; Weinstein, J. N.; Getz, G.; Verhaak, R. G. PRADA: pipeline for RNA sequencing data analysis. Bioinformatics 2014, 30 (15), 2224−6. (20) Zhu, P.; He, L.; Li, Y.; Huang, W.; Xi, F.; Lin, L.; Zhi, Q.; Zhang, W.; Tang, Y. T.; Geng, C.; Lu, Z.; Xu, X. OTG-snpcaller: an optimized pipeline based on TMAP and GATK for SNP calling from ion torrent data. PLoS One 2014, 9 (5), e97507. (21) Nagaraj, S. H.; Waddell, N.; Madugundu, A. K.; Wood, S.; Jones, A.; Mandyam, R. A.; Nones, K.; Pearson, J. V.; Grimmond, S. M. PGTools: A Software Suite for Proteogenomic Data Analysis and Visualization. J. Proteome Res. 2015, 14 (5), 2255−66. (22) Yu, K.; Salomon, A. R. HTAPP: high-throughput autonomous proteomic pipeline. Proteomics 2010, 10 (11), 2113−22. (23) Jagtap, P. D.; Johnson, J. E.; Onsongo, G.; Sadler, F. W.; Murray, K.; Wang, Y.; Shenykman, G. M.; Bandhakavi, S.; Smith, L. M.; Griffin, T. J. Flexible and accessible workflows for improved proteogenomic analysis using the Galaxy framework. J. Proteome Res. 2014, 13 (12), 5898−908. (24) Zgoda, V. G.; Kopylov, A. T.; Tikhonova, O. V.; Moisa, A. A.; Pyndyk, N. V.; Farafonova, T. E.; Novikova, S. E.; Lisitsa, A. V.; Ponomarenko, E. A.; Poverennaya, E. V.; Radko, S. P.; Khmeleva, S. A.; Kurbatov, L. K.; Filimonov, A. D.; Bogolyubova, N. A.; Ilgisonis, E. V.; Chernobrovkin, A. L.; Ivanov, A. S.; Medvedev, A. E.; Mezentsev, Y. V.; Moshkovskii, S. A.; Naryzhny, S. N.; Ilina, E. N.; Kostrjukova, E. S.; Alexeev, D. G.; Tyakht, A. V.; Govorun, V. M.; Archakov, A. I. Chromosome 18 transcriptome profiling and targeted proteome mapping in depleted plasma, liver tissue and HepG2 cells. J. Proteome Res. 2013, 12 (1), 123−34. (25) Nagaraj, N.; Wisniewski, J. R.; Geiger, T.; Cox, J.; Kircher, M.; Kelso, J.; Paabo, S.; Mann, M. Deep proteome and transcriptome mapping of a human cancer cell line. Mol. Syst. Biol. 2011, 7, 548. (26) Segura, V.; Medina-Aunon, J. A.; Mora, M. I.; MartinezBartolome, S.; Abian, J.; Aloria, K.; Antunez, O.; Arizmendi, J. M.; Azkargorta, M.; Barcelo-Batllori, S.; Beaskoetxea, J.; Bech-Serra, J. J.; Blanco, F.; Monteiro, M. B.; Caceres, D.; Canals, F.; Carrascal, M.; Casal, J. I.; Clemente, F.; Colome, N.; Dasilva, N.; Diaz, P.; Elortza, F.; Fernandez-Puente, P.; Fuentes, M.; Gallardo, O.; Gharbi, S. I.; Gil, C.; Gonzalez-Tejedo, C.; Hernaez, M. L.; Lombardia, M.; Lopez-Lucendo, M.; Marcilla, M.; Mato, J. M.; Mendes, M.; Oliveira, E.; Orera, I.; Pascual-Montano, A.; Prieto, G.; Ruiz-Romero, C.; Sanchez del Pino, M. M.; Tabas-Madrid, D.; Valero, M. L.; Vialas, V.; Villanueva, J.; Albar, J. P.; Corrales, F. J. Surfing transcriptomic landscapes. A step beyond the annotation of chromosome 16 proteome. J. Proteome Res. 2014, 13 (1), 158−72. (27) Bolger, A. M.; Lohse, M.; Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 2014, 30 (15), 2114−20. (28) Kim, D.; Pertea, G.; Trapnell, C.; Pimentel, H.; Kelley, R.; Salzberg, S. L. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome biology 2013, 14 (4), R36. (29) Trapnell, C.; Roberts, A.; Goff, L.; Pertea, G.; Kim, D.; Kelley, D. R.; Pimentel, H.; Salzberg, S. L.; Rinn, J. L.; Pachter, L. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 2012, 7 (3), 562−78. (30) Warden, C. D.; Adamson, A. W.; Neuhausen, S. L.; Wu, X. Detailed comparison of two popular variant calling packages for exome and targeted exon studies. PeerJ 2014, 2, e600. (31) Cheng, A. Y.; Teo, Y. Y.; Ong, R. T. Assessing single nucleotide variant detection and genotype calling on whole-genome sequenced individuals. Bioinformatics 2014, 30 (12), 1707−13. 3736

DOI: 10.1021/acs.jproteome.5b00490 J. Proteome Res. 2015, 14, 3729−3737

Article

Journal of Proteome Research

replication in birds and mosquitoes. PLoS Pathog. 2014, 10 (11), e1004447. (48) Firth, A. E.; Jagger, B. W.; Wise, H. M.; Nelson, C. C.; Parsawar, K.; Wills, N. M.; Napthine, S.; Taubenberger, J. K.; Digard, P.; Atkins, J. F. Ribosomal frameshifting used in influenza A virus expression occurs within the sequence UCC_UUU_CGU and is in the +1 direction. Open Biol. 2012, 2 (10), 120109. (49) Ivanov, I. P.; Atkins, J. F. Ribosomal frameshifting in decoding antizyme mRNAs from yeast and protists to humans: close to 300 cases reveal remarkable diversity despite underlying conservation. Nucleic Acids Res. 2007, 35 (6), 1842−58. (50) Wills, N. M.; Atkins, J. F. The potential role of ribosomal frameshifting in generating aberrant proteins implicated in neurodegenerative diseases. RNA 2006, 12 (7), 1149−53. (51) Gray, M. W. Evolutionary origin of RNA editing. Biochemistry 2012, 51 (26), 5235−42. (52) Picardi, E.; D’Erchia, A. M.; Montalvo, A.; Pesole, G. Using REDItools to Detect RNA Editing Events in NGS Datasets. Front. Bioeng. Biotechnol. 2015, 49, 12 12 1−12 12 15. (53) Jiang, Y.; Fan, S. L.; Song, M. Z.; Yu, J. N.; Yu, S. X. Identification of RNA editing sites in cotton (Gossypium hirsutum) chloroplasts and editing events that affect secondary and threedimensional protein structures. GMR, Genet. Mol. Res. 2012, 11 (2), 987−1001. (54) Nigita, G.; Veneziano, D.; Ferro, A. A-to-I RNA Editing: Current Knowledge Sources and Computational Approaches with Special Emphasis on Non-Coding RNA Molecules. Front. Bioeng. Biotechnol. 2015, 3, 37. (55) Nilsson, C. L.; Mostovenko, E.; Lichti, C. F.; Ruggles, K.; Fenyo, D.; Rosenbloom, K. R.; Hancock, W. S.; Paik, Y. K.; Omenn, G. S.; LaBaer, J.; Kroes, R. A.; Uhlen, M.; Hober, S.; Vegvari, A.; Andren, P. E.; Sulman, E. P.; Lang, F. F.; Fuentes, M.; Carlsohn, E.; Emmett, M. R.; Moskal, J. R.; Berven, F. S.; Fehniger, T. E.; Marko-Varga, G. Use of ENCODE resources to characterize novel proteoforms and missing proteins in the human proteome. J. Proteome Res. 2015, 14 (2), 603−8. (56) Ellis, J. D.; Barrios-Rodiles, M.; Colak, R.; Irimia, M.; Kim, T.; Calarco, J. A.; Wang, X.; Pan, Q.; O’Hanlon, D.; Kim, P. M.; Wrana, J. L.; Blencowe, B. J. Tissue-specific alternative splicing remodels protein-protein interaction networks. Mol. Cell 2012, 46 (6), 884−92.

(32) Wang, K.; Li, M.; Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010, 38 (16), e164. (33) Liu, Q.; Guo, Y.; Li, J.; Long, J.; Zhang, B.; Shyr, Y. Steps to ensure accuracy in genotype and SNP calling from Illumina sequencing data. BMC Genomics 2012, 13 (Suppl 8), S8. (34) Liu, H.; Xie, Z.; Tan, S.; Zhang, X.; Yang, S. Relationship between amino acid usage and amino acid evolution in primates. Gene 2015, 557 (2), 182−7. (35) Lin, H.; Moghe, G.; Ouyang, S.; Iezzoni, A.; Shiu, S. H.; Gu, X.; Buell, C. R. Comparative analyses reveal distinct sets of lineage-specific genes within Arabidopsis thaliana. BMC Evol. Biol. 2010, 10, 41. (36) Supply, P.; Marceau, M.; Mangenot, S.; Roche, D.; Rouanet, C.; Khanna, V.; Majlessi, L.; Criscuolo, A.; Tap, J.; Pawlik, A.; Fiette, L.; Orgeur, M.; Fabre, M.; Parmentier, C.; Frigui, W.; Simeone, R.; Boritsch, E. C.; Debrie, A. S.; Willery, E.; Walker, D.; Quail, M. A.; Ma, L.; Bouchier, C.; Salvignol, G.; Sayes, F.; Cascioferro, A.; Seemann, T.; Barbe, V.; Locht, C.; Gutierrez, M. C.; Leclerc, C.; Bentley, S. D.; Stinear, T. P.; Brisse, S.; Medigue, C.; Parkhill, J.; Cruveiller, S.; Brosch, R. Genomic analysis of smooth tubercle bacilli provides insights into ancestry and pathoadaptation of Mycobacterium tuberculosis. Nat. Genet. 2013, 45 (2), 172−9. (37) Carson, A. R.; Smith, E. N.; Matsui, H.; Braekkan, S. K.; Jepsen, K.; Hansen, J. B.; Frazer, K. A. Effective filtering strategies to improve data quality from population-based whole exome sequencing studies. BMC Bioinf. 2014, 15, 125. (38) Wang, C.; Davila, J. I.; Baheti, S.; Bhagwate, A. V.; Wang, X.; Kocher, J. P.; Slager, S. L.; Feldman, A. L.; Novak, A. J.; Cerhan, J. R.; Thompson, E. A.; Asmann, Y. W. RVboost: RNA-seq variants prioritization using a boosting method. Bioinformatics 2014, 30 (23), 3414−6. (39) Zook, J. M.; Samarov, D.; McDaniel, J.; Sen, S. K.; Salit, M. Synthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing. PLoS One 2012, 7 (7), e41356. (40) Vallania, F. L.; Druley, T. E.; Ramos, E.; Wang, J.; Borecki, I.; Province, M.; Mitra, R. D. High-throughput discovery of rare insertions and deletions in large cohorts. Genome Res. 2010, 20 (12), 1711−8. (41) Kircher, M.; Stenzel, U.; Kelso, J. Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome biology 2009, 10 (8), R83. (42) Maruyama, Y.; Wakamatsu, A.; Kawamura, Y.; Kimura, K.; Yamamoto, J.; Nishikawa, T.; Kisu, Y.; Sugano, S.; Goshima, N.; Isogai, T.; Nomura, N. Human Gene and Protein Database (HGPD): a novel database presenting a large quantity of experiment-based results in human proteomics. Nucleic Acids Res. 2009, 37 (Database), D762−6. (43) Vanderperre, B.; Lucier, J. F.; Roucou, X. HAltORF: a database of predicted out-of-frame alternative open reading frames in human. Database 2012, 2012, bas025. (44) Vanderperre, B.; Lucier, J. F.; Bissonnette, C.; Motard, J.; Tremblay, G.; Vanderperre, S.; Wisztorski, M.; Salzet, M.; Boisvert, F. M.; Roucou, X. Direct detection of alternative open reading frames translation products in human significantly expands the proteome. PLoS One 2013, 8 (8), e70698. (45) Menschaert, G.; Van Criekinge, W.; Notelaers, T.; Koch, A.; Crappe, J.; Gevaert, K.; Van Damme, P. Deep proteome coverage based on ribosome profiling aids mass spectrometry-based protein and peptide discovery and provides evidence of alternative translation products and near-cognate translation initiation events. Mol. Cell. Proteomics 2013, 12 (7), 1780−90. (46) Niu, S.; Cao, S.; Wong, S. M. An infectious RNA with a heptaadenosine stretch responsible for programmed −1 ribosomal frameshift derived from a full-length cDNA clone of Hibiscus latent Singapore virus. Virology 2014, 449, 229−34. (47) Melian, E. B.; Hall-Mendelin, S.; Du, F.; Owens, N.; BoscoLauth, A. M.; Nagasaki, T.; Rudd, S.; Brault, A. C.; Bowen, R. A.; Hall, R. A.; van den Hurk, A. F.; Khromykh, A. A. Programmed ribosomal frameshift alters expression of west nile virus genes and facilitates virus 3737

DOI: 10.1021/acs.jproteome.5b00490 J. Proteome Res. 2015, 14, 3729−3737