PPLine: An Automated Pipeline for SNP, SAP, and Splice Variant

Jul 6, 2015 - The fundamental mission of the Chromosome-Centric Human Proteome Project (C-HPP) is the research of human proteome diversity, including ...
1 downloads 4 Views 1MB Size
Subscriber access provided by NEW YORK UNIV

Article

PPLine: an automated pipeline for SNP/SAP and splice variant detection in the context of proteogenomics George Sergeevich Krasnov, Alexey Alexandrovich Dmitriev, Anna Viktorovna Kudryavtseva, Alexander Valerievich Shargunov, Dmitry Sergeevich Karpov, Leonid Andreevich Uroshlev, Natalya Vladimirovna Melnikova, Vladimir Mikhailovich Blinov, Ekaterina V Poverennaya, Alexander I. Archakov, Andrey V Lisitsa, and Elena A Ponomarenko J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.5b00490 • Publication Date (Web): 06 Jul 2015 Downloaded from http://pubs.acs.org on July 11, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

PPLine: an automated pipeline for SNP/SAP and splice variant detection in the context of proteogenomics George Sergeevich Krasnov§¶¦*, Alexey Alexandrovich Dmitriev§, Anna Viktorovna Kudryavtseva§ǁ, Alexander Valerievich Shargunov¶¦, Dmitry Sergeevich Karpov§¶, Leonid Andreevich Uroshlev§, Natalya Vladimirovna Melnikova§, Vladimir Mikhailovich Blinov¶¦, Ekaterina Vladimirovna Poverennaya¶, Alexander Ivanovich Archakov¶, Andrey Valerievich Lisitsa¶, Elena Alexandrovna Ponomarenko¶ §

Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, 111991 Russia ¶

Orekhovich Institute of Biomedical Chemistry, Russian Academy of Medical Sciences, Moscow, 119121 Russia ¦

Mechnikov Research Institute of Vaccines and Sera, Moscow, 105064 Russia

ǁ

Herzen Moscow Cancer Research Institute, Ministry of Healthcare of the Russian Federation, Moscow, 125284 Russia

*[email protected]

The fundamental mission of Chromosome-centric Human Proteome Project (C-HPP) is the research of human proteome diversity including rare variants. Liver tissues, HepG2 cells and plasma were selected as one of the major objects for C-HPP studies. Proteogenomic approach, a recently introduced technique, is a powerful method to predict and validate proteoforms coming ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

from alternative splicing, mutations and transcript editing. We developed PPLine, Python-based proteogenomic pipeline providing automated SAP/indels and alternative spliced variants discovery basing on raw transcriptome/exome sequence data, SNP annotation and filtration, prediction of proteotypic peptides (available at https://sourceforge.net/projects/ppline). In this work, we performed deep transcriptome sequencing of HepG2 cells and liver tissues using two platforms – Illumina HiSeq and Applied Biosystems SOLiD. Using PPLine, we revealed 7756 SAP and indels for HepG2 and liver (including 659 variants non-annotated in dbSNP). We found 17 indels in transcripts associated with translation of alternate reading frames (ARF) longer than 300 bp. ARF products of two genes, SLMO1 and TMEM8A, demonstrate signatures of caspase binding domain and Gcn5-related N-acetyltransferase. Alternative splicing analysis predicted novel proteoforms encoded by 203 (liver) and 475 (HepG2) genes according to both Illumina and SOLiD data. The results of present work represent a basis for subsequent proteomic studies of C-HPP consortium.

Keywords: C-HPP, RNA-Seq, SNP, SAP, indel, alternative reading frames, alternative splicing, proteotypic peptides

INTRODUCTION The Chromosome-centric Human Proteome Project (C-HPP) is one of the largest international projects driven by the consortium of scientific groups from 25 countries. The final goal of CHPP is comprehensive annotation of human proteome diversity coming from alternative splicing, presence of gene allele variants, mRNA editing and posttranslational modifications.1, 2 The Russian part of C-HPP is devoted to the research of proteins encoded by the genes of

ACS Paragon Plus Environment

Page 2 of 30

Page 3 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

chromosome 18. Three types of biomaterial – HepG2 cells, liver tissue and human depleted plasma, were selected for the pilot series of C-HPP.3 Proteogenomic approach is two-staged powerful techniques enabling identification of modified protein species (proteoforms) containing single amino acid polymorphisms (SAP) or translated from alternatively spliced transcripts. At the first stage, exome or transcriptome sequencing is executed to reveal mutations, and in the case of RNA-Seq – splicing alterations, mRNA editing, and fusion transcripts. Then, protein prediction and mass-spectrometric database construction is performed basing on the transcriptomic data. The second stage assumes mass-spectrometric analysis aimed to the detection or quantification of the predicted proteoforms. This approach have been successfully used for the analysis of cancer cell lines and clinical tumor samples.4-7 The pilot phase of C-HPP (2010-2015) was devoted to the identification and quantification of canonical proteins, regardless a presence of SAP, mRNA editing, transcript splice variants or protein posttranslational modifications.1 Introduction of these characteristics immeasurably increases the field of study: while the complete number of protein-coding genes and, therefore, canonical proteins only slightly exceeds 20 thousands, the number of various proteoforms can reach several millions. Within the frame of C-HPP, proteogenomic approach will be applied for the deep analysis of proteome diversity of HepG2 cells and liver tissues. In this work, we present the results of the first phase of this study: transcriptomic characterization of the biomaterial and prediction of proteotypic peptides which are specific for splice variants and proteoforms translated from SNP-containing alleles or edited transcripts. These variations include frameshift mutations or transcript edition events which can lead to the translation of long alternate reading frames (more than 300 bp). The phenomena of alternate reading frame translation is essential for many viruses, but also known to play important role for mammals, in normal conditions and during development of cancer.8-10 With the results of alternative splicing profiling of 8 human tissues reported earlier,11 and the results obtained by other members of C-HPP,12 these data provide the basis for proteogenomic analysis that will be performed in the next phase of C-HPP.

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The results of these studies will be included to the neXtProt knowledge database aimed to complement UniProtKB.13 The development of exome and transcriptome sequencing pipelines is one of most rapidly growing areas of bioinformatics.14-16 The explosive growth of experimental sequencing data needs robust, accurate and effective analysis.17 Nowadays several pipelines particularly sharing functionality of PPLine have been developed. Recently Pirooznia et al presented a unified pipeline for processing NGS data that encompasses four modules: mapping, filtering, realignment and recalibration, and variant calling which represents a combination of a set of popular tools – BWA aligner, picard, samtools, and GATK.18 PRADA, a pretty Python-based pipeline which offers gene expression profiling, quality metrics, detection of unsupervised and supervised fusion transcripts, detection of intragenic fusion variants.19 One also should mention OTG-snpcaller, SNP caller pipeline based on TMAP and GATK and designed especially for Ion Torrent/Proton.20 On the other hand, several software suites have been developed to analyze proteomic data: PGTools,21 HTAPP,22 and Galaxy-based approaches.23 In this study, we present PPLine, a Python-based proteogenomic application providing complete pipeline from read trimming to prediction of unique proteotypic peptides for detection of SAP-containing proteins and quantification of alternatively spliced proteoforms. This is a bridge between raw sequence data and further proteomic analysis.

MATERIALS AND METHODS Tissue samples and cell lines We analyzed pooled liver tissue samples and HepG2 cultures previously used in the other studies of C-HPP consortium.3, 24 Liver samples were purchased from the ILSBioBiobank;

ACS Paragon Plus Environment

Page 4 of 30

Page 5 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

HepG2 cells were grown, harvested and processed by the standard protocol (see details in article of Zgoda et al.3).

RNA purification, reverse transcription, library preparation, sequencing RNA purification, reverse transcription, library preparation and sequencing was performed using techniques described in the transcriptomic articles of the other members of C-HPP team.25, 26

Briefly, total RNA was extracted using RNeasy Mini kit (Qiagen, Germany) and tested for

integrity using Agilent 2100 Bioanalyzer (Agilent Technologies, USA). RNA enrichment with poly-A transcripts and library preparation was performed using TruSeq RNA Sample Prep Kit (Illumina, USA) and SureSelect RNA Capture Enrichment System (Applied Biosystems, USA). Sequencing was performed using Illumina HiSeq 2000 and Applied Biosystems SOLiD 4 platforms with single-end 50 bp reads (except Illlumina liver transcriptome sequencing that was done with paired-end 50 bp reads).

Bioinformatics Here we present our novel tool PPLine performing all steps of the bioinformatics analysis: read filtration and trimming, read alignment, base quality score (BQS) recalibration, SNP calling, indel realignment, SAP prediction and annotation, alternative splicing profiling, integrating the data, proteomic fasta-database generation and prediction of proteotypic peptides (Figure 1). PPLine integrates into a single pipeline a set of popular tools: Trimmomatic, Tophat2, samtools, GATK, Cufflinks, and Annovar.

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Initially, adapters are removed and reads are 5’-trimmed using Trimmomatic.27 Colorspace SOLiD reads are trimmed using our in-house Python script. Then, reads are mapped to transcriptome and genome (GRCh37.p10) with splice-aware aligner Tophat2.28 Run parameters are adjusted automatically by PPLine (2 mismatches/gaps for reads less 60 bp and forced read remapping if at least one mismatch/gap is found). Discovery of novel splice junctions using segment mapping method can be enabled. Then, novel splice variants for HepG2 cells and liver tissues were predicted with Cufflinks.29 To improve the accuracy of subsequent SNP calling, read deduplication, BQS recalibration is performed using picard tools, GATK and dbSNP release 142 (2 Jan 2015).30 SNP calling is performed using samtools and bcftools with min. BQS > 20 and mapping quality > 20.31 Additionally, insertion/deletion (indel) realignment is done using GATK. Using Annovar, the derived SNP were scanned for annotations of dbSNP, 1000 Genomes, Exome Sequencing Project (ESP), ClinVar, COSMIC and NCI60.32 SIFT, Polyphen, LRT, MetaLRT scores indicating functional impact of non-synonymous SNP (nsSNP) were also evaluated using Annovar. Integration of the results, translation of predicted mRNA splice variants and SNP-containing transcripts is performed using PPLine. To avoid excessive falsepositive prediction rates we used only reference start codons annotated in Ensembl to translate predicted alternatively spliced transcripts. Generation of target and decoy protein fasta-databases and prediction of unique proteotypic peptides was also done using PPLine. The reference protein database was derived by merging UniprotKB proteins (both Swiss-Prot and TrEMBL, Oct 2014) and translated Ensembl proteins (GRCh37 release 74). Tables containing SNP/SAP info, proteotypic peptides sequences, dbSNP, ESP, COSMIC annotations and other info were generated by PPLine. All bioinformatics steps were performed with the computational resources of “Genome” Center (Engelhardt Institute of Molecular Biology, Moscow) The crucial step of SNP analysis is the selection of appropriate threshold values of read coverage or Phred quality score of SNP detection. Phred quality score Q is calculated with the formula:  = −10 log , where P is error probability. PPLine allows to assess these values on ACS Paragon Plus Environment

Page 6 of 30

Page 7 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

the basis of dependences between ratio of known SNP, non-synonymous SNP, and read coverage/Phred quality values. The details are described in the Results and Discussion section.

RESULTS AND DISCUSSION SNP calling HepG2 and normal liver transcriptomes were sequenced using two platforms, Illumina HiSeq 2000 and Applied Biosystems SOLiD 4. Initially we found 168 thousand SNP candidates (all chromosomes; 12675 non-synonymous SNP, or SAP) in transcribed fragments for both liver tissues and HepG2 cells with at least 1 read coverage. After applying a procedure of base quality score (BQS) recalibration, the total amount of SAP candidates was reduced from 12675 to 7756. The further filtration of SAP candidates was based on the read coverage and Phred quality, two major characteristics of SAP prediction reliability. However, one should not rely on raw Phred quality scores “as is” because they could not reflect real false discovery rate, but one should need to evaluate the results from different angles to ensure analysis accuracy.33 Each experimental design is characterized with its own false discovery rate. HepG2 cell and liver tissue transcriptomes have been sequenced with different platforms at different times. Thus, one should expect different false positive rates for each experiment. Our new tool, PPLine offers an interesting method to evaluate appropriate threshold values of these parameters for accurate SNP identification. We proceed as follows.

Selection of thresholds basing on the proportion of non-synonymous SNP in coding regions Non-synonymous SNP resulting to SAP occur less frequently than would be expected from a random SNP distribution. In case of the random distribution, about 74% SNP/mutations occurring in coding regions should be non-synonymous. However, in the reality, nonACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

synonymous mutation rate is significantly lower because of high likelihood of functional impairment of encoded proteins and negative selection of these broken allele variants. This has been shown for various organisms in cross-taxon and cross-lineages comparison studies.34-36 We evaluated the relative nsSNP frequency  =   ⁄(  +   ) for various read coverage thresholds (CT) and quality thresholds (QT; sometimes a proportion of nonsynonymous to synonymous mutation rates, dN/dS, or Ka/Ks is used). Here, N is a number of SNP found in the coding sequences (synSNP – synonymous SNP). For ideal data, with no false positives,  is expected to be nearly constant, independently of QT and CT. For real experimental data, it is expected to be increased for low-coverage SNP due to higher false positive rate. The observed dependencies are illustrated in Figure 2. Indeed, we observed high values of  for low CT, especially for non-recalibrated data (liver tissues) indicating their poor quality. In contrast, recalibrated data demonstrated much smoother threshold-dependence. Of course, Phred quality and read coverage values are interconnected, and the higher will be coverage, the greater Phred quality. Applied Biosystems SOLiD and Illumina HiSeq data behave differently. A general trend toward reduction and stabilization of  for higher QT and CT values was more pronounced for Illumina. SOLiD reads demonstrated slower reduction of  ; the stabilization of this value was observed only for QT > 35 for BQS-recalibrated data whereas no stabilization was revealed even at QT = 55 for the raw data. Although SOLiD sequencer reads each nucleotide twice that suggests higher read quality, Illumina demonstrates itself as a more confident source of data for SNP calling. The desired  was evaluated as 40-43% (for certainly sufficient coverage thresholds, e.g. 20 reads). BQS recalibration procedure significantly increases the quality of SAP/SNP prediction that is indicated by much lower values of  for the recalibrated data. We have chosen minimum CT to exclude areas with rapidly growing  and to fit desired range of 40-43%

ACS Paragon Plus Environment

Page 8 of 30

Page 9 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(Figure 2). Generally, the acceptable thresholds for high-confidence SAP (FDR < 0.02) can be selected as: QT = 15 / CT = 4 for Illumina and QT = 40 / CT = 12 for SOLiD data. Filtration with CT rather than QT demonstrates better results for SOLiD data (Figure 2). Basing on comparison of  between the selected CT/QT and obviously high coverage/quality thresholds values, one should expect FDR < 1% for Illumina data and < 2.5% for SOLiD data.

Selection of thresholds basing on the proportion of known variants Another indicator of SAP prediction quality is the share of known variants within the pool of predicted SNP. Using PPLine, we calculated the dependence between proportion of known SAP (annotated in dbSNP) and minimal coverage (Figure 3). We revealed the strong increase of fraction of unknown SNP for liver tissue at low coverage rates also indicating poor quality of the data or mis-evaluated BQS. Moreover, liver tissues demonstrated higher mutation rate compared to HepG2 at low CT. This is unexpected and suggests systemic error. BQS recalibration procedure has significantly improved the accuracy of SNP calling as evidenced by the elimination of strong gains of unannotated SNP counts at low CT. Ratio of liver/HepG2 SAP counts was changed to biologically relevant (Figure 3). With respect to our data, BQS recalibration works better with Illumina data rather than SOLiD:  and share of unannotated SNP variants decreases more significantly for Illumina predictions, whereas its absolute values are generally lower. This is not surprising: the features of SOLiD sequencing (two base encoding system) is the source of peculiar read errors that complicates BQS recalibration procedure. On the basis of these data, we consider previously selected coverage threshold values (CT = 4 for Illumina and CT = 12 for SOLiD) as acceptable. We marked SAP with passed coverage and quality values as high-confidence. However, there were plenty of polymorphisms found by both Illumina and SOLiD platforms with coverage less than CT for each of them. These SAP were ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 30

also accepted as reliable if summary read count and Phred quality exceed the Illumina thresholds. Additionally, all known polymorphisms annotated in dbSNP, Exome Sequencing Project, NCI60 or COSMIC were marked as high-confidence. Finally, 7756 high-confidence SAP were found (105 SAP for 63 genes of chromosome 18; Figure 4). 659 of 7758 SAP (8.5%, complete genome) and 8 of 105 (8%, chromosome 18) were not annotated earlier in dbSNP, 1000 Genomes, Exome Sequencing Project (ESP), COSMIC or NCI60 whole exome sequencing databases (see Supporting table S1). Potential proteotypic peptides, unique among the human genome, for subsequent mass-spectrometric detection of the amino acid substitutions were identified for 63 polymorphic variants (60%) of chromosome 18 proteins. FASTA protein databases were generated (see Supporting material S2-S3)

PPLine as proteogenomic pipeline Both approaches of CT/QT selection implemented in PPLine work better with exome sequencing data, rather than RNA-Seq. One should expect a biological tendency of lowering  for highly expressed genes with lots of RNA-Seq reads. There is a greater proportion of genes encoding crucial cellular regulators, important metabolic enzymes or structural proteins among the fraction of highly expressed genes. Non-synonymic mutations impairing functionality of the encoded protein products of these genes are under strong negative selection. In contract, the fraction of genes with low expression level contains many minor regulators, predicted genes or transcribed pseudogenes with no protein product. Thus, the proportion of nsSNP is expected to be greater for the genes with low coverage. In case of exome sequencing,  would be nearly constant with no tendency to decrease for high CT/QT. Up to date, several SNP calling tools are known (GATK, samtools, CASAVA, VarScan, glfTools, SOAPsnp). They demonstrate greatly different specificity and sensitivity values for various read coverage.14 PPLine implements samtools SNP caller, which along with GATK and VarScan reveals the greatest ACS Paragon Plus Environment

Page 11 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

accuracy for low-coverage data (5x) with several technical repeats.30, 31 However, CASAVA reveals the optimal combinations of specificity and sensitivity for high coverage exome sequencing (>30x).31 RNA-Seq does not give single coverage for all the genes; this value directly depends on the gene expression, and samtools should be more suitable for RNA-Seq which gave low coverage values for many genes, whereas GATK have issues with SNP calling for spliced reads. Generally, BQS recalibration procedure reveals itself as a mandatory step of SAP calling pipeline.33, 37 PPLine tool uses GATK BaseRecalibrator, the most popular tool to perform BQS recalibration.30 Note to mention that there are other tools for BQS recalibration – for example RVboost, which demonstrates higher accuracy values compared to GATK and once more RNASeq variant-calling pipeline SNPiR.38 However, installation of additional uncommon software could be challenging for a user, so PPLine uses implements GATK.

PPLine represents a bridge between RNA-Seq or exome sequencing raw data and proteomic analysis. It demonstrates advantages of a single pipeline offering splice-aware read alignment, original approach of SAP filtration, possibilities of automated SAP annotation, alternative splicing profiling, combination of SAP and splice variants for FASTA database generation, and prediction of unique proteotypic peptides. Thus, PPLine provides the analysis of experimental RNA-seq or exome sequencing data and prepares a basis for the second phase of proteogenomic study – mass-spectrometry detection of predicted proteoforms. In addition to the methods of SNP-filtration implemented in PPLine, other approaches and experimental techniques are known to improve or filter exome/transcriptome sequencing results. Two of them deserve special attention. The first one is based on the calculation of the ratio between transition (interchanges of two-ring purines, A and G or one-ring pyrimidines, C and T) and transversion (interchanges of purine for pyrimidine bases) mutations/SNP.33 Ti/Tv ratio is an ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 30

indicator of the result quality; the higher Ti/Tv ratio generally suggests higher accuracy and should be near 2.1 for genome and 3.2 for exome sequencing. However, distinct sequencing platforms might demonstrate preferable Ti or Tv-type errors, and  -based approach seems to be more universal and platform-independent. Another SNP filtration strategy assumes addition of synthetic RNA spike-in standards without homology to almost all biological organisms to the RNA being analyzed.39-41

Indels associated with long alternate reading frames Among 348 indels in transcripts we found 40 variations associated with possible translation of alternate reading frames (ARF) longer than 150 bp (50 aa) and 17 indels – with reading frames longer than 300 bp. Indel quality scoring is principally different from SNV scoring, so the methods described above cannot be applied to classify found indels. However, the number of indels for HepG2 (293) was significantly higher than for normal liver (70), and this is biologically relevant. Nine of these indels were common for liver tissues and HepG2 cells; 5 of these 9 variations were associated with ARF longer than 120 bp. The abnormal distribution of ARF lengths (enriched in long ARF), unexpected from the assumption on their random occurrence, suggests their functional impact. PROSITE and Pfam analysis revealed the presence in ARF products: •

Gcn5-related N-acetyltransferase signature (TMEM8A gene, e-value < 10-10);



Signature of caspase binding domain (SLMO1 gene, e-value < 10-20);



6 glycosylation sites, 14 amidation sites, 84 protein kinase C phosphorylation

sites, 22 cAMP- and cGMP-dependent protein kinase phosphorylation sites Surprisingly, some ARF-encoded protein products have striking homology to their orthologs (in the reference proteome of these organisms). This suggests the occurrence of similar indels

ACS Paragon Plus Environment

Page 13 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

during phylogenesis of the organisms. So, potassium voltage-gated channel KCNA3 ARF product (126 aa) has 77% local homology (62 aa) with its ortholog in Myotis davidii proteome; FOXC1 ARF (205 aa) demonstrates 84% homology (56 aa) to mCG11671 protein of Mus musculus and 67% to Chelonia mydas forkhead box D3. Also it was noticed that a part of SLMO1 shares high homologies (60-70%) to LYR motifcontaining protein, B-cell CLL/lymphoma 7 protein, PDZ and LIM domain protein, BEN domain-containing protein and few other human reference proteins that suggests a presence of functional motif in the encoded ARF product; however this motif is not annotated in PROSITE and Pfam. Some of the predicted ARF (IQSEC1, RYBP, MPV17) could come from GRCh37.p10 misannotation; this is fixed in the latest human genome and transcriptome assembly GRCh38.p3. Parts of CDK11A, MFSD12 ARFs are known to be normally included in the transcript due to alternative splicing. One-nucleotide insertion in SORBS2 leads to translation of alternative reading frame that had been reported earlier in the NEDO human cDNA sequencing and ORF prediction project.42 One should mention another human ARF database, HAltORF. This is a catalogue of out-of-frame alternative ORFs predicted on the basis of the genomic sequence.43 Recent proteomic analysis enabled direct detection of many protein products of ARF localized in untranslated regions or overlapping the reference ORFs.44 , 45 Transcript insertions/deletions associated with activation of ARF may results either from genomic mutations or from transcript editing. Activation of alternate reading frames via programmed ribosomal frameshifts is a mechanism essential for replication of many viruses. 46-48 In cellular organisms, ribosomal frameshift also takes place49, 50 but RNA editing plays a significant role in ARF activation.8, 51 RNA editing, an important post-transcriptional process causing base substitutions, insertions and deletions, have many biological effects including regulation of alternative splicing and gene expression.8, 51, 52 RNA editing is best studied for plant

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 30

chloroplasts where some cytosines are deaminated to uredines.53 In mammals, the most studied RNA editing event is adenosine-to-inosine substitutions.54 Translation of alternate reading frames due to deletions and insertions in transcripts is studied less intensively.51 Only 4 of 100 top indels with the longest ARFs are annotated in dbSNP whereas >90% SNP are known according to dbSNP data. Just exome and genome, not transcriptome sequencing results are the major source of information for dbSNP database comprising data of Exome Sequencing Project, 1000 Genomes and other resources. This suggests RNA editing as the primary mechanism of the occurrence of these indels. However, to prove this hypothesis, exome sequencing data for these samples are needed. It remains a task of future research.

Novel splice variants in normal liver tissues and HepG2 cells Alternative splicing introduces an additional level of proteome complexity. The results from combining proteoENCODEdb searches with experimental mass spectral data indicate that some alternative splicing forms detected at the transcript level are in fact translated to proteins.55 Alternative splicing profiles demonstrate pronounced tissue specificity.56 In our previous study, a meta-analysis of transcriptomic NCBI Sequence Read Archive data for eight human tissues, liver along with brain demonstrated the highest complexity of transcript isoforms coming from alternative splicing.11 The present study is aimed to predict possible proteoforms translated from alternatively spliced transcripts in normal liver tissues and HepG2 cells basing on the results of deep transcriptome sequencing performed using two platforms, Illumina HiSeq and Applied Biosystems SOLiD. Alternative splicing analysis with the discovery of novel exon junctions allowed to predict previously unannotated in UniProtKB proteoforms for 642 and 1001 genes basing on liver RNASeq data from Illumina and SOLiD platforms, accordingly (10/12 genes of chromosome 18), and 1681/1389 genes (30/16 genes of chromosome 18) for HepG2 cells. Proteoforms for 204 (liver ACS Paragon Plus Environment

Page 15 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

tissues) and 475 (HepG2 cells) genes were predicted basing on both Illumina and SOLiD data. In chromosome 18, Illumina and SOLiD data simultaneously suggested a presence of novel splice variants for PIEZO2, SS18, NEDD4L, CCBE1 genes in liver tissues and PIEZO2, SPIRE1, CEP192, ROCK1, MIB1, SMAD7, NARS, KIAA1468 genes in HepG2 cells. The results are summarized in the Supporting table S3.

CONCLUSIONS Chromosome-centric Human Proteome Project aims to research the diversity and functions of human proteins. One of the most powerful techniques, two-staged protegenomic approach, is eligible to identify protein variations coming from genomic polymorphisms and mRNA editing, as well as alternative transcript splicing – three major sources of proteome diversity. The results of the present work should serve as a basis for the second phase of proteogenomic studies being performed by the members of C-HPP consortium. In the present work, we present PPLine – a pipeline developed as a bridge between RNA-Seq or exome sequencing data and proteomic analysis. PPLine is designed to process raw sequencing data and aimed to identify protein SAP and splice variants and predict SAP/splice-specific proteotypic peptides for subsequent proteomic studies. Using PPLine, we found 7756 variations in HepG2 cells and liver tissues (659 variants are not annotated in dbSNP v139) at the transcriptome level, including 17 indels in mRNA that can be associated with translation of alternative reading frames longer than 300 bp (maximum 687 bp). PROSITE and Pfam analysis revealed the presence of caspase binding domain motif and Gcn5-related N-acetyltransferase signature in SLMO1 and TMEM8A genes. Ortholog screening indicated the translation of KCNA3 and FOXC1 ARFs in other organisms.

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 30

DECLARATION OF INTEREST The authors declare no conflict of interest.

AUTHOR INFORMATION Corresponding Author Dr. George S. Krasnov, [email protected], +7-916-983-6952

Author Contributions The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript.

SUPPORTING INFORMATION Supporting table S1. Catalogue of SAP (single amino acid residue variations and indels) for liver tissues and HepG2 cells analyzed in this study. Annotations with dbSNP, 1000 Genomes Project, Exome Sequencing Project, ClinVar databases are provided. Predicted functional impact scores of SAP/mutations are presented according to Polyphen2, SIFT, LRT, PhyloP and other algorithms. Supporting material S2. Fasta databases for SAP-containing (Proteins.Seq.Alt.SAP.fa) and reference proteins (*.Ref.fa). Sequence identifiers are presented in format: Uniprot ID | Ensembl protein ID | chromosome : position in chr. : ref.variant (nucl.) : alt.variant (nucl.) : strand : ref.variant (a.a.) : position in the protein : alt.variant (a.a.). Supporting table S3. Novel isoforms predicted for liver tissues and HepG2 cells on the basis of Illumina HiSeq and Applied Biosystems SOLiD RNA-Seq data. Read coverage, FPKM

ACS Paragon Plus Environment

Page 17 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(fragments per kbp per million) values, predicted protein sequences and isoform-specific proteotypic peptides are provided. This material is available free of charge via http://pubs.acs.org/

ACKNOWLEDGEMENT

This work was supported by Russian Science Foundation, grant # 14-25-00132.

Part of this work (computational analysis) was performed using the equipment of EIMB RAS "Genome" center (http://www.eimb.ru/RUSSIAN_NEW/INSTITUTE/ccu_genome_c.php) under the financial support by the Ministry of Education and Science of the Russian Federation (Contract 14.621.21.0001, project's unique identifier RFMEFI6114X0001).

ABBREVIATIONS

BQS, base quality score; SAP, single amino acid polymorphism; CT, read coverage threshold; QT, Phred quality score threshold; synSNP, synonymous SNP; nsSNP, non-synonymous SNP; Ti, transitions; Tv, transversions.

REFERENCES 1.

Paik, Y. K.; Jeong, S. K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Cho, S. Y.; Lee, H. J.;

Na, K.; Choi, E. Y.; Yan, F.; Zhang, F.; Zhang, Y.; Snyder, M.; Cheng, Y.; Chen, R.; MarkoVarga, G.; Deutsch, E. W.; Kim, H.; Kwon, J. Y.; Aebersold, R.; Bairoch, A.; Taylor, A. D.; Kim, K. Y.; Lee, E. Y.; Hochstrasser, D.; Legrain, P.; Hancock, W. S., The Chromosome-

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 30

Centric Human Proteome Project for cataloging proteins encoded in the genome. Nature biotechnology 2012, 30, (3), 221-3. 2.

Lane, L.; Bairoch, A.; Beavis, R. C.; Deutsch, E. W.; Gaudet, P.; Lundberg, E.; Omenn,

G. S., Metrics for the Human Proteome Project 2013-2014 and strategies for finding missing proteins. Journal of proteome research 2014, 13, (1), 15-20. 3.

Ponomarenko, E. A.; Kopylov, A. T.; Lisitsa, A. V.; Radko, S. P.; Kiseleva, Y. Y.;

Kurbatov, L. K.; Ptitsyn, K. G.; Tikhonova, O. V.; Moisa, A. A.; Novikova, S. E.; Poverennaya, E. V.; Ilgisonis, E. V.; Filimonov, A. D.; Bogolubova, N. A.; Averchuk, V. V.; Karalkin, P. A.; Vakhrushev, I. V.; Yarygin, K. N.; Moshkovskii, S. A.; Zgoda, V. G.; Sokolov, A. S.; Mazur, A. M.; Prokhortchouck, E. B.; Skryabin, K. G.; Ilina, E. N.; Kostrjukova, E. S.; Alexeev, D. G.; Tyakht, A. V.; Gorbachev, A. Y.; Govorun, V. M.; Archakov, A. I., Chromosome 18 transcriptoproteome of liver tissue and HepG2 cells and targeted proteome mapping in depleted plasma: update 2013. Journal of proteome research 2014, 13, (1), 183-90. 4.

Li, H. D.; Menon, R.; Omenn, G. S.; Guan, Y., Revisiting the identification of canonical

splice isoforms through integration of functional genomics and proteomics evidence. Proteomics 2014, 14, (23-24), 2709-18. 5.

Karpova, M. A.; Karpov, D. S.; Ivanov, M. V.; Pyatnitskiy, M. A.; Chernobrovkin, A. L.;

Lobas, A. A.; Lisitsa, A. V.; Archakov, A. I.; Gorshkov, M. V.; Moshkovskii, S. A., Exomedriven characterization of the cancer cell lines at the proteome level: the NCI-60 case study. Journal of proteome research 2014, 13, (12), 5551-60. 6.

Fanayan, S.; Smith, J. T.; Lee, L. Y.; Yan, F.; Snyder, M.; Hancock, W. S.; Nice, E.,

Proteogenomic analysis of human colon carcinoma cell lines LIM1215, LIM1899, and LIM2405. Journal of proteome research 2013, 12, (4), 1732-42. 7.

Zhang, B.; Wang, J.; Wang, X.; Zhu, J.; Liu, Q.; Shi, Z.; Chambers, M. C.; Zimmerman,

L. J.; Shaddox, K. F.; Kim, S.; Davies, S. R.; Wang, S.; Wang, P.; Kinsinger, C. R.; Rivers, R. C.; Rodriguez, H.; Townsend, R. R.; Ellis, M. J.; Carr, S. A.; Tabb, D. L.; Coffey, R. J.; Slebos,

ACS Paragon Plus Environment

Page 19 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

R. J.; Liebler, D. C.; Nci, C., Proteogenomic characterization of human colon and rectal cancer. Nature 2014, 513, (7518), 382-7. 8.

Gott, J. M., Expanding genome capacity via RNA editing. Comptes rendus biologies

2003, 326, (10-11), 901-8. 9.

Korff, S.; Woerner, S. M.; Yuan, Y. P.; Bork, P.; von Knebel Doeberitz, M.; Gebert, J.,

Frameshift mutations in coding repeats of protein tyrosine phosphatase genes in colorectal tumors with microsatellite instability. BMC cancer 2008, 8, 329. 10.

Delgado, A. P.; Brandao, P.; Chapado, M. J.; Hamid, S.; Narayanan, R., Open reading

frames associated with cancer in the dark matter of the human genome. Cancer genomics & proteomics 2014, 11, (4), 201-13. 11.

Shargunov, A. V.; Krasnov, G. S.; Ponomarenko, E. A.; Lisitsa, A. V.; Shurdov, M. A.;

Zverev, V. V.; Archakov, A. I.; Blinov, V. M., Tissue-specific alternative splicing analysis reveals the diversity of chromosome 18 transcriptome. Journal of proteome research 2014, 13, (1), 173-82. 12.

Bai, Y.; Hassler, J.; Ziyar, A.; Li, P.; Wright, Z.; Menon, R.; Omenn, G. S.; Cavalcoli, J.

D.; Kaufman, R. J.; Sartor, M. A., Novel bioinformatics method for identification of genomewide non-canonical spliced regions using RNA-Seq data. PloS one 2014, 9, (7), e100864. 13.

Gaudet, P.; Argoud-Puy, G.; Cusin, I.; Duek, P.; Evalet, O.; Gateau, A.; Gleizes, A.;

Pereira, M.; Zahn-Zabal, M.; Zwahlen, C.; Bairoch, A.; Lane, L., neXtProt: organizing protein knowledge in the context of human proteome projects. Journal of proteome research 2013, 12, (1), 293-8. 14.

Yu, X.; Sun, S., Comparing a few SNP calling algorithms using low-coverage sequencing

data. BMC bioinformatics 2013, 14, 274. 15.

Raineri, E.; Ferretti, L.; Esteve-Codina, A.; Nevado, B.; Heath, S.; Perez-Enciso, M.,

SNP calling by sequencing pooled samples. BMC bioinformatics 2012, 13, 239.

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

16.

Page 20 of 30

McCormick, R. F.; Truong, S. K.; Mullet, J. E., RIG: Recalibration and Interrelation of

Genomic Sequence Data with the GATK. G3 2015, 5, (4), 655-65. 17.

Kodama, Y.; Shumway, M.; Leinonen, R.; International Nucleotide Sequence Database,

C., The Sequence Read Archive: explosive growth of sequencing data. Nucleic acids research 2012, 40, (Database issue), D54-6. 18.

Pirooznia, M.; Kramer, M.; Parla, J.; Goes, F. S.; Potash, J. B.; McCombie, W. R.; Zandi,

P. P., Validation and assessment of variant calling pipelines for next-generation sequencing. Human genomics 2014, 8, 14. 19.

Torres-Garcia, W.; Zheng, S.; Sivachenko, A.; Vegesna, R.; Wang, Q.; Yao, R.; Berger,

M. F.; Weinstein, J. N.; Getz, G.; Verhaak, R. G., PRADA: pipeline for RNA sequencing data analysis. Bioinformatics 2014, 30, (15), 2224-6. 20.

Zhu, P.; He, L.; Li, Y.; Huang, W.; Xi, F.; Lin, L.; Zhi, Q.; Zhang, W.; Tang, Y. T.;

Geng, C.; Lu, Z.; Xu, X., OTG-snpcaller: an optimized pipeline based on TMAP and GATK for SNP calling from ion torrent data. PloS one 2014, 9, (5), e97507. 21.

Nagaraj, S. H.; Waddell, N.; Madugundu, A. K.; Wood, S.; Jones, A.; Mandyam, R. A.;

Nones, K.; Pearson, J. V.; Grimmond, S. M., PGTools: A Software Suite for Proteogenomic Data Analysis and Visualization. Journal of proteome research 2015, 14, (5), 2255-66. 22.

Yu, K.; Salomon, A. R., HTAPP: high-throughput autonomous proteomic pipeline.

Proteomics 2010, 10, (11), 2113-22. 23.

Jagtap, P. D.; Johnson, J. E.; Onsongo, G.; Sadler, F. W.; Murray, K.; Wang, Y.;

Shenykman, G. M.; Bandhakavi, S.; Smith, L. M.; Griffin, T. J., Flexible and accessible workflows for improved proteogenomic analysis using the Galaxy framework. Journal of proteome research 2014, 13, (12), 5898-908. 24.

Zgoda, V. G.; Kopylov, A. T.; Tikhonova, O. V.; Moisa, A. A.; Pyndyk, N. V.;

Farafonova, T. E.; Novikova, S. E.; Lisitsa, A. V.; Ponomarenko, E. A.; Poverennaya, E. V.; Radko, S. P.; Khmeleva, S. A.; Kurbatov, L. K.; Filimonov, A. D.; Bogolyubova, N. A.;

ACS Paragon Plus Environment

Page 21 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Ilgisonis, E. V.; Chernobrovkin, A. L.; Ivanov, A. S.; Medvedev, A. E.; Mezentsev, Y. V.; Moshkovskii, S. A.; Naryzhny, S. N.; Ilina, E. N.; Kostrjukova, E. S.; Alexeev, D. G.; Tyakht, A. V.; Govorun, V. M.; Archakov, A. I., Chromosome 18 transcriptome profiling and targeted proteome mapping in depleted plasma, liver tissue and HepG2 cells. Journal of proteome research 2013, 12, (1), 123-34. 25.

Nagaraj, N.; Wisniewski, J. R.; Geiger, T.; Cox, J.; Kircher, M.; Kelso, J.; Paabo, S.;

Mann, M., Deep proteome and transcriptome mapping of a human cancer cell line. Molecular systems biology 2011, 7, 548. 26.

Segura, V.; Medina-Aunon, J. A.; Mora, M. I.; Martinez-Bartolome, S.; Abian, J.; Aloria,

K.; Antunez, O.; Arizmendi, J. M.; Azkargorta, M.; Barcelo-Batllori, S.; Beaskoetxea, J.; BechSerra, J. J.; Blanco, F.; Monteiro, M. B.; Caceres, D.; Canals, F.; Carrascal, M.; Casal, J. I.; Clemente, F.; Colome, N.; Dasilva, N.; Diaz, P.; Elortza, F.; Fernandez-Puente, P.; Fuentes, M.; Gallardo, O.; Gharbi, S. I.; Gil, C.; Gonzalez-Tejedo, C.; Hernaez, M. L.; Lombardia, M.; Lopez-Lucendo, M.; Marcilla, M.; Mato, J. M.; Mendes, M.; Oliveira, E.; Orera, I.; PascualMontano, A.; Prieto, G.; Ruiz-Romero, C.; Sanchez del Pino, M. M.; Tabas-Madrid, D.; Valero, M. L.; Vialas, V.; Villanueva, J.; Albar, J. P.; Corrales, F. J., Surfing transcriptomic landscapes. A step beyond the annotation of chromosome 16 proteome. Journal of proteome research 2014, 13, (1), 158-72. 27.

Bolger, A. M.; Lohse, M.; Usadel, B., Trimmomatic: a flexible trimmer for Illumina

sequence data. Bioinformatics 2014, 30, (15), 2114-20. 28.

Kim, D.; Pertea, G.; Trapnell, C.; Pimentel, H.; Kelley, R.; Salzberg, S. L., TopHat2:

accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome biology 2013, 14, (4), R36. 29.

Trapnell, C.; Roberts, A.; Goff, L.; Pertea, G.; Kim, D.; Kelley, D. R.; Pimentel, H.;

Salzberg, S. L.; Rinn, J. L.; Pachter, L., Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature protocols 2012, 7, (3), 562-78.

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

30.

Page 22 of 30

Warden, C. D.; Adamson, A. W.; Neuhausen, S. L.; Wu, X., Detailed comparison of two

popular variant calling packages for exome and targeted exon studies. PeerJ 2014, 2, e600. 31.

Cheng, A. Y.; Teo, Y. Y.; Ong, R. T., Assessing single nucleotide variant detection and

genotype calling on whole-genome sequenced individuals. Bioinformatics 2014, 30, (12), 170713. 32.

Wang, K.; Li, M.; Hakonarson, H., ANNOVAR: functional annotation of genetic variants

from high-throughput sequencing data. Nucleic acids research 2010, 38, (16), e164. 33.

Liu, Q.; Guo, Y.; Li, J.; Long, J.; Zhang, B.; Shyr, Y., Steps to ensure accuracy in

genotype and SNP calling from Illumina sequencing data. BMC genomics 2012, 13 Suppl 8, S8. 34.

Liu, H.; Xie, Z.; Tan, S.; Zhang, X.; Yang, S., Relationship between amino acid usage

and amino acid evolution in primates. Gene 2015, 557, (2), 182-7. 35.

Lin, H.; Moghe, G.; Ouyang, S.; Iezzoni, A.; Shiu, S. H.; Gu, X.; Buell, C. R.,

Comparative analyses reveal distinct sets of lineage-specific genes within Arabidopsis thaliana. BMC evolutionary biology 2010, 10, 41. 36.

Supply, P.; Marceau, M.; Mangenot, S.; Roche, D.; Rouanet, C.; Khanna, V.; Majlessi,

L.; Criscuolo, A.; Tap, J.; Pawlik, A.; Fiette, L.; Orgeur, M.; Fabre, M.; Parmentier, C.; Frigui, W.; Simeone, R.; Boritsch, E. C.; Debrie, A. S.; Willery, E.; Walker, D.; Quail, M. A.; Ma, L.; Bouchier, C.; Salvignol, G.; Sayes, F.; Cascioferro, A.; Seemann, T.; Barbe, V.; Locht, C.; Gutierrez, M. C.; Leclerc, C.; Bentley, S. D.; Stinear, T. P.; Brisse, S.; Medigue, C.; Parkhill, J.; Cruveiller, S.; Brosch, R., Genomic analysis of smooth tubercle bacilli provides insights into ancestry and pathoadaptation of Mycobacterium tuberculosis. Nature genetics 2013, 45, (2), 1729. 37.

Carson, A. R.; Smith, E. N.; Matsui, H.; Braekkan, S. K.; Jepsen, K.; Hansen, J. B.;

Frazer, K. A., Effective filtering strategies to improve data quality from population-based whole exome sequencing studies. BMC bioinformatics 2014, 15, 125.

ACS Paragon Plus Environment

Page 23 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

38.

Wang, C.; Davila, J. I.; Baheti, S.; Bhagwate, A. V.; Wang, X.; Kocher, J. P.; Slager, S.

L.; Feldman, A. L.; Novak, A. J.; Cerhan, J. R.; Thompson, E. A.; Asmann, Y. W., RVboost: RNA-seq variants prioritization using a boosting method. Bioinformatics 2014, 30, (23), 3414-6. 39.

Zook, J. M.; Samarov, D.; McDaniel, J.; Sen, S. K.; Salit, M., Synthetic spike-in

standards improve run-specific systematic error analysis for DNA and RNA sequencing. PloS one 2012, 7, (7), e41356. 40.

Vallania, F. L.; Druley, T. E.; Ramos, E.; Wang, J.; Borecki, I.; Province, M.; Mitra, R.

D., High-throughput discovery of rare insertions and deletions in large cohorts. Genome research 2010, 20, (12), 1711-8. 41.

Kircher, M.; Stenzel, U.; Kelso, J., Improved base calling for the Illumina Genome

Analyzer using machine learning strategies. Genome biology 2009, 10, (8), R83. 42.

Maruyama, Y.; Wakamatsu, A.; Kawamura, Y.; Kimura, K.; Yamamoto, J.; Nishikawa,

T.; Kisu, Y.; Sugano, S.; Goshima, N.; Isogai, T.; Nomura, N., Human Gene and Protein Database (HGPD): a novel database presenting a large quantity of experiment-based results in human proteomics. Nucleic acids research 2009, 37, (Database issue), D762-6. 43.

Vanderperre, B.; Lucier, J. F.; Roucou, X., HAltORF: a database of predicted out-of-

frame alternative open reading frames in human. Database : the journal of biological databases and curation 2012, 2012, bas025. 44.

Vanderperre, B.; Lucier, J. F.; Bissonnette, C.; Motard, J.; Tremblay, G.; Vanderperre, S.;

Wisztorski, M.; Salzet, M.; Boisvert, F. M.; Roucou, X., Direct detection of alternative open reading frames translation products in human significantly expands the proteome. PloS one 2013, 8, (8), e70698. 45.

Menschaert, G.; Van Criekinge, W.; Notelaers, T.; Koch, A.; Crappe, J.; Gevaert, K.; Van

Damme, P., Deep proteome coverage based on ribosome profiling aids mass spectrometry-based protein and peptide discovery and provides evidence of alternative translation products and near-

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 30

cognate translation initiation events. Molecular & cellular proteomics : MCP 2013, 12, (7), 1780-90. 46.

Niu, S.; Cao, S.; Wong, S. M., An infectious RNA with a hepta-adenosine stretch

responsible for programmed -1 ribosomal frameshift derived from a full-length cDNA clone of Hibiscus latent Singapore virus. Virology 2014, 449, 229-34. 47.

Melian, E. B.; Hall-Mendelin, S.; Du, F.; Owens, N.; Bosco-Lauth, A. M.; Nagasaki, T.;

Rudd, S.; Brault, A. C.; Bowen, R. A.; Hall, R. A.; van den Hurk, A. F.; Khromykh, A. A., Programmed ribosomal frameshift alters expression of west nile virus genes and facilitates virus replication in birds and mosquitoes. PLoS pathogens 2014, 10, (11), e1004447. 48.

Firth, A. E.; Jagger, B. W.; Wise, H. M.; Nelson, C. C.; Parsawar, K.; Wills, N. M.;

Napthine, S.; Taubenberger, J. K.; Digard, P.; Atkins, J. F., Ribosomal frameshifting used in influenza A virus expression occurs within the sequence UCC_UUU_CGU and is in the +1 direction. Open biology 2012, 2, (10), 120109. 49.

Ivanov, I. P.; Atkins, J. F., Ribosomal frameshifting in decoding antizyme mRNAs from

yeast and protists to humans: close to 300 cases reveal remarkable diversity despite underlying conservation. Nucleic acids research 2007, 35, (6), 1842-58. 50.

Wills, N. M.; Atkins, J. F., The potential role of ribosomal frameshifting in generating

aberrant proteins implicated in neurodegenerative diseases. Rna 2006, 12, (7), 1149-53. 51.

Gray, M. W., Evolutionary origin of RNA editing. Biochemistry 2012, 51, (26), 5235-42.

52.

Picardi, E.; D'Erchia, A. M.; Montalvo, A.; Pesole, G., Using REDItools to Detect RNA

Editing Events in NGS Datasets. Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.] 2015, 49, 12 12 1-12 12 15. 53.

Jiang, Y.; Fan, S. L.; Song, M. Z.; Yu, J. N.; Yu, S. X., Identification of RNA editing

sites in cotton (Gossypium hirsutum) chloroplasts and editing events that affect secondary and three-dimensional protein structures. Genetics and molecular research : GMR 2012, 11, (2), 987-1001.

ACS Paragon Plus Environment

Page 25 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

54.

Nigita, G.; Veneziano, D.; Ferro, A., A-to-I RNA Editing: Current Knowledge Sources

and Computational Approaches with Special Emphasis on Non-Coding RNA Molecules. Frontiers in bioengineering and biotechnology 2015, 3, 37. 55.

Nilsson, C. L.; Mostovenko, E.; Lichti, C. F.; Ruggles, K.; Fenyo, D.; Rosenbloom, K.

R.; Hancock, W. S.; Paik, Y. K.; Omenn, G. S.; LaBaer, J.; Kroes, R. A.; Uhlen, M.; Hober, S.; Vegvari, A.; Andren, P. E.; Sulman, E. P.; Lang, F. F.; Fuentes, M.; Carlsohn, E.; Emmett, M. R.; Moskal, J. R.; Berven, F. S.; Fehniger, T. E.; Marko-Varga, G., Use of ENCODE resources to characterize novel proteoforms and missing proteins in the human proteome. Journal of proteome research 2015, 14, (2), 603-8. 56.

Ellis, J. D.; Barrios-Rodiles, M.; Colak, R.; Irimia, M.; Kim, T.; Calarco, J. A.; Wang, X.;

Pan, Q.; O'Hanlon, D.; Kim, P. M.; Wrana, J. L.; Blencowe, B. J., Tissue-specific alternative splicing remodels protein-protein interaction networks. Molecular cell 2012, 46, (6), 884-92.

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 1. PPLine workflow. PPLine combines several RNA-Seq/exome sequencing tools to perform raw data processing, read alignment, deduplication, BQS recalibration, SNP calling and indel realignment, annotation of revealed variants, alternative splicing profiling and integrates the derived data into single database and tables 207x126mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 26 of 30

Page 27 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 2. A. Percentage of non-synonymous SNP (nsSNP) among all SNP in the coding regions (all chromosomes) depending on predicted Phred quality threshold. nsSNP occur less frequently than would be expected from random SNP distribution (74%), and their higher content refers to low data quality. B. Percentage of nsSNP among all SNP in the coding regions (all chromosomes) depending on read coverage. 124x142mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3. Distribution of SAP counts (across all the chromosomes) depending on minimum read coverage thresholds for Illumina data (A) and SOLiD data (B): before BQS recalibration (thin line), after recalibration (bold line). Corresponding numbers of known variants (dbSNP) are marked with dotted lines. 190x108mm (234 x 234 DPI)

ACS Paragon Plus Environment

Page 28 of 30

Page 29 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 4. Venn diagram illustrating high-confidence SAP counts for liver tissues and HepG2 cells using Illumina and SOLiD platforms for proteins of the chromosome 18 (large digits) and complete genome (small digits in the brackets) 105x71mm (220 x 220 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Graphic abstract SAP counts for HepG2 cells and liver tissue used for study in Chromosome-centric Human Proteome Project and derived using Illumina HiSeq and Applied Biosystems SOLiD platforms. Large digits – proteins of the chromosome 18; small digits – complete genome 105x71mm (220 x 220 DPI)

ACS Paragon Plus Environment

Page 30 of 30