Tissue-Specific Alternative Splicing Analysis Reveals the Diversity of

To identify the signatures of protein domains and functional sites, we used pfsearch ... conditions, cell differentiation programs, and the generation...
0 downloads 0 Views 1MB Size
Article pubs.acs.org/jpr

Tissue-Specific Alternative Splicing Analysis Reveals the Diversity of Chromosome 18 Transcriptome Alexander V. Shargunov,†,‡ George S. Krasnov,*,†,‡ Elena A. Ponomarenko,‡,§ Andrey V. Lisitsa,‡,§ Mikhail A. Shurdov,∥ Vitaliy V. Zverev,† Alexander I. Archakov,‡ and Vladimir M. Blinov†,‡ †

I. I. Mechnikov Institute of Vaccines and Sera of the Russian Academy of Medical Sciences, 5A, Maly Kazenny per., 105064 Moscow, Russia ‡ Bioinformatics and Postgenome Research, V. N. Orekhovich Institute of Biomedical Chemistry of the Russian Academy of Medical Sciences, 10, Pogodinskaya Street, 119121 Moscow, Russia § LLC PostGenTech, 10, Pogodinskaya Street, 119121 Moscow, Russia ∥ LLC Panagen, 16/1, Dockukina street, 129226 Moscow, Russia S Supporting Information *

ABSTRACT: The Chromosome-centric Human Proteome Project (C-HPP) is aimed to identify the variety of protein products and transcripts of the number of chromosomes. The Russian part of C-HPP is devoted to the study of the human chromosome 18. Using widely accepted Tophat and SpliceGrapher, a tool for accurate splice sites and alternative mRNA isoforms prediction, we performed the extensive mining of the splice variants of chromosome 18 transcripts and encoded protein products in liver, brain, lung, kidney, blood, testis, derma, and skeletal muscles. About 6.1 billion of the reads represented by 450 billion of the bases have been analyzed. The relative frequencies of splice events as well as gene expression profiles in normal tissues are evaluated. Using ExPASy PROSITE, the novel features and possible functional sites of previously unknown splice variants were highlighted. A set of unique proteotypic peptides enabling the identification of novel alternative protein species using mass-spectrometry is constructed. The revealed data will be integrated into the genecentric knowledgebase of the Russian part of C-HPP available at http://kb18.ru and http://www.splicing.zz.mu/. KEYWORDS: alternative splicing, next-generation sequencing, functional sites, proteotypic peptides, exon skipping, intron retention, splice junctions



databases and newly created HEXEvent,5 which easily allows extract alternative splicing (AS) information for individual or genome-wide human internal exons, but representativeness of the source datasets forming the basis of most of the available resources is unfortunately lacking. The goal of performing the exhaustive alternative splicing mining is becoming more and more actual because new bioinformatics tools for the accurate prediction of splicing events have been developed. To identify possible transcripts, we used SpliceGrapher as one of the most appropriate tools for these purposes. Machine learning approaches implemented in this application favorably distinguish it from widely used engines such Cufflinks,6 Scripture,7 and TAU.8,9 One of the most common issues of the alternative splicing analysis comes from the large dynamic range of mRNA content

INTRODUCTION The recent years brought the widespread availability of nextgeneration sequencing (NGS) techniques allowing us to perform extensive surveys of genome and transcriptome to conduct analyses such ChIP-seq and HITS-CLIP elucidating the gene regulation features. Alternative splicing provides immense diversity of proteome from a limited number of genes, and the mining of tissue-specific splicing is quite important: the knowledge of these mRNA variants and protein products as well as tissue-specific expressed genes sheds light on tissue developmental and functional programs employed in this cell type. Examples of such genes can be represented by neurofascin (NFASC) or other neuronal cell adhesion molecules silenced in tissues other than neurons.1 From the time of widespread of NGS in the late 2000s, many efforts aimed at discovering a variety of gene transcripts have been performed. For example, AS-ALPS database provides alternative mRNA splice events that cause 3-D alterations of the protein structure2 One should also mention SpliceAid3 and ASPicDB4 as the most informative transcriptome splice variants © XXXX American Chemical Society

Special Issue: Chromosome-centric Human Proteome Project Received: August 7, 2013

A

dx.doi.org/10.1021/pr400808u | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Article

Table 1. Representativeness of the Alternative Splicing Analysis Source Data Across Various Tissues

a

tissue

single-end reads analyzed, Ma

paired-end reads analyzed, Ma

bases analyzed, Gb

average read length

brain liver blood kidney muscle derma testis lung all tissues

478.4 164.9 389.3 198.4 784.1 472.8 297.3 540.7 3190.9

561.6 306.3 331.6 223.8 422.2 805.2 263.7 266.3 2950.6

70.5 33.6 54.4 27.7 102.9 100.2 35.8 55.9 452.3

67.8 71.2 75.4 65.7 85.3 78.4 63.8 69.3 73.7

M: millions. bG: billions.

nlm.nih.gov/Traces/sra/). The list of used data sets is provided in the Supporting Table S1 in the Supporting Information. The representativeness of NCBI SRA has been exponentially increasing about five times every year and maintains steady growing in the current moment. We used sequences derived from normal tissue samples, both total RNA and poly-A RNA fraction. No tumor samples or cell lines were used. The information about used sequences is summarized in the Table 1. Total representativeness of source data sets is ∼613 Gbases. To minimize sequencing errors, we truncated the reads at the 3′-ends using in-house developed tool FasTrunc for smart sequence trimming. This application analyzes read quality distribution along the transcribed fragments and provides optimal read length to exclude 3′-end regions with high sequence errors rate but preserve highly informative regions. FasTrunc also eliminates low-quality reads and enables merging/splitting of fastq files. Unlike FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_ toolkit/), FasTrunc performs completely automated analysis and processing. Splice junction mapping was performed using widely accepted TopHat2,17 which implements Bowtie2 aligner.18 The workflow of the performed analysis is illustrated in Figure 1. To speed up the analysis, we performed preliminary mapping reads to the human transcriptome (RefSeq, release 56, November 2012) using Bowtie2. The fraction of unaligned reads included the sequences derived from transcribed introns, intergenic regions, or non-RefSeq spliced RNAs. This group of reads was then mapped to the human genome (without search for splice junctions). In this stage, the unaligned reads included only sequences derived from non-RefSeq RNAs containing splice junctions. These reads were mapped to the genome with the search of splice boundaries. Finally, the alternative splicing analysis was done using Python-based SpliceGrapher allowing both identification of splice variants and visualization of results.9 The search was performed using several thresholds − quantities of reads mapped to the splice junctions or reads entirely aligned to the included exons. The genomic maps illustrating exon skipping (ES) and intron retention (IR) events along with alternative transcription start/stops, read density, and splice junctions frequencies were prepared using a built-in SpliceGrapher plotter. Differential expression patterns of the chromosome 18 genes were evaluated using Cufflinks, also allowing de novo assembly of transcripts and estimating their abundances.6,19 We performed calculation of FPKM (fragments per kilobase of transcript per million of aligned reads) for each gene (or gene candidate) and each subset/sample within the source tissue data sets. These values were averaged with different weights depending on subset representativeness. We removed the outliers with FPKM that deviated from the

of the various genes in a cell. Highly expressed genes, such as beta-actin, which is constitutively active in most of tissues, or guanylate kinase-associated protein (DLGAP1), highly enriched in synaptosomal preparations of the brain, can be transcribed into thousands or more mRNA copies.10 Most of the genes have 0; and maximum miscleavages: 0, 1, or 2. We used ExPASy PROSITE21 to identify possible functional domains and motifs in the regions included in the proteins as a result of pre-mRNA alternative splicing. In particular, we extracted fragments of translated mRNA sequences previously not described in RefSeq (both intron retention and novel splice junction regions). To identify the signatures of protein domains and functional sites, we used pfsearch (ftp://ftp.expasy.org/databases/ prosite/ps_scan/) to compare these translated fragments against the PROSITE library. These signatures are the markers of functional significance of such regions.

Identification of Alternative Transcript Variants

There are several basic strategies of alternative splicing that were generally recognized: alternative donor or acceptor sites leading to the partial absence of exon or the incorporation of intronic sequences and complete intron retention or exon C

dx.doi.org/10.1021/pr400808u | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Article

CEP192 (centrosomal protein 192 kDa; major regulator of pericentriolar material recruitment, centrosome maturation and centriole duplication), TCF4 (a basic helix−loop−helix transcription factor 4), and other genes (Supporting Table S2 in the Supporting Information). The most prominent tissue-specific patterns of alternative splicing were registered for METTL4 (methylatransferase-like), CNDP1 (homodimeric dipeptidase), NOL4 (unknown function), CABLES1 (cyclin-dependent kinase binding protein; positively affects neuronal outgrowth), and also CELF4, which shows the maximum number of protein isoforms in brain. This gene encodes the protein product involved in the regulation of pre-mRNA alternative splicing;31 it may be also involved in mRNA editing and translation regulation of a number of genes, including ones associated with the regulation of synaptic function.32 Another set of genes, ATP9B (a subunit of ATPase), FHOD3 (actin-organizing protein that may cause stress fiber formation), and TRAPPC8 (a member of trafficking protein particle complex) demonstrated the greatest AS isoform diversity in muscle tissues. We consider SpliceGrapher 0.2.3 as one the most accurate tools for prediction of a variety of mRNA splice variants. Previously it was shown that SpliceGrapher predictions of transcript isoforms are more consistent with curated gene models (such as RefSeq) than predictions made by other known tools TAU8 and Cufflinks.6,9 TAU predicts isoforms by assembly of all feasible combinations of exons that have been identified in the previous alignment step. These predictions are exhaustive − not one possible mRNA isoform is ignored, so at the first sight this strategy may provide valuable results for further PROSITE analysis enabling identification of all functional amino acid motifs that could cover the splice junctions in the corresponding mRNA.21 However, most of the IR events predicted by TAU (which comprise ∼80% of all the splice events predicted by TAU) seem to have a high probability of being false-positive9 − actually, the suitability of TAU IR prediction results for PROSITE analysis is limited. In contrast with TAU, Cufflinks aims to identify the smallest set of mRNA transcripts that “explains” the observed data. For Arabidopsis thaliana test sample, the fraction of IR events reaches only 10% of all alternative splicing events predicted by Cufflinks.9 The identification of splice junctions is the crucial step of the alternative splicing analysis.33 We used TopHat217 to map sequences transcribed in various tissues to human genome and transcriptome. Powered by Bowtie2, TopHat2 is an all-in-one rapid tool for alignment and splice junctions prediction. However, the results of TopHat2 should be further accurately filtered by SpliceGrapher, which implements a system of splice site classifiers providing very high accuracy of predictions. In some cases, up to 85% of splice junctions recognized by TopHat2, supersplat,34 or other tools may be not accepted by SpliceGrapher, as it was shown for the A. thaliana test sample.9 We focused on the deep discovery of alternative transcripts of the chromosome 18 and identification of the tissue-specific splicing patterns and expression profiles. This work is performed in the framework of the C-HPP. kb18.ru is the main web resource of the Russian part of the C-HPP. Like ASPicDB4 and HEXEvent,5 kb18.ru provides gene-centric data indicating the splice events identified for the current gene. The evaluated frequencies of splice events along with gene expression profiles in various tissues are also provided. Like HEXEvent, all data can be downloaded for further analysis.

skipping. Moreover, alternative transcription starts from different promoters and alternative polyadenylation sites introduce more diversity into the human transcriptome.27 In total, the analysis of chromosome 18 alternative splicing patterns revealed 2248 novel exons with at least one bound not described in the RefSeq (release 56, November 2012) and 2034 in the Ensembl (release 73, 23 August 2013). The results are summarized in Table 2. Figure 2 illustrates the number of genes Table 2. Summary of the Alternative Splicing Mining Results of the Chromosome 18a tissue brain

novel novel novel novel 3′- and transcription exons introns 5′-splice sites starts

815 706 liver 258 198 blood 319 281 kidney 103 68 muscle 601 502 derma 491 431 testis 229 171 lung 540 442 all tissues 2248 2034 RefSeq 3677 Ensembl 3748

1610 1346 495 358 703 590 154 101 1185 940 808 643 367 252 978 751 3763 3287 3374 3508

863 638 263 161 285 209 87 37 654 453 395 269 256 152 582 399 2205 1766 7279 7416

28 28 12 12 14 14 0 0 13 13 20 20 13 13 10 10 107 107 481 481

novel poly-A sites

possible protein species

15 15 7 7 11 11 3 3 10 10 24 23 7 7 12 12 90 89 424 424

7236 7131 1713 1658 2512 2470 915 878 9295 9202 2750 2688 1783 1734 4718 4636 22040 21880 396 558

a Analysis allowed identification of a number of ES and IR events and alternative donor and acceptor sites not described in RefSeq (bold) and Ensembl (not bold). The second column shows the number of newly identified exons with at least one bound not described within the RefSeq or Ensembl. Similarly, the third column shows a number of introns. The next column shows a quantity of newly identified 5′- (donor) and 3′- (acceptor) splice sites. The last column shows a number of possible protein species (Archakov et al., 2014, in press) revealed from different permitted combinations of cassette exons and other AS events.

having alternatively spliced transcripts across various tissues. Figure 3 demonstrates the distribution of the numbers of alternative transcript isoforms. Not surprisingly, we observed the maximum abundance of splicing alterations in brain − about half of AS events can be found in this tissue (Table 2). Four decades ago, brain, in particular, cerebral cortex, was observed to have much higher expression of single-copy mRNAs than other human tissues − liver, kidney, and spleen.28 The relatively recent studies allowed us to identify the brain as an organ with the maximum transcriptome diversity across the entire human organism.29,30 According to the results of the present study, most of the ES and IR events are partial due to alternative donor/acceptor splice sites, while the complete exon skipping and complete retention of introns were rarer. Across different tissues, the greatest diversity of alternative splicing products was observed for MBD1 gene (methyl-CpG binding domain protein 1; involved in transcription repression), D

dx.doi.org/10.1021/pr400808u | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Article

Figure 2. Number of the genes of chromosome 18 with novel alternatively spliced transcripts (blue bars, left vertical axis). The representativeness of the source sequence data for various tissues (orange bars, right vertical axis).

Figure 3. Distribution of the alternatively spliced transcript numbers for various tissues.

most frequent combinations of observed splice events (e.g., inclusions of exons). Unfortunately, despite the ability of SpliceGrapher to integrate EST and high-throughput data in the one stream, the representativeness of dbEST is remaining low to reveal all of the spectrum of splice events, so it remains to be a goal of the targeted gene splicing studies. Moreover, dbEST contains many clone libraries derived from tumor tissues while a significant part of libraries lacks correct annotation.15 The incorporation of the EST data could lead to the inclusion of tumor transcripts into the source data set.

However, there are several other alternative splicing databases available online. We consider AS-ALPS as one of the outstanding resources.2 Being regularly updated, AS-ALPS integrates the results of Ensembl alternative splicing studies with its own RNA-seq data analysis. AS-ALPS is able to indicate the changes in protein structure coming from the pre-mRNA splicing events. Moreover, AS-EAST (alternative splicing effects assessment tool), the module of AS-ALPS, allows us to analyze and annotate custom transcript sequences and to predict possible splicing events as well as their effect on the protein product.35 Like our results, AS-EAST also provides information about the inclusion of functional domains within the affected regions. It uses InterProScan36 to reveal specific protein signatures in alternatively spliced regions or custom peptide sequence. However, the representativeness of AS-ALPS is insufficient; most of the splice isoforms are originated from Ensembl. Our data set includes up to 10 times more RNA-seq aligned data. The number of protein species predicted in the present study varies quite a lot for the chromosome 18 genes. (See Supporting Table 2 in the Supporting Information.) This is the theoretically predicted number of isoforms, the appearance of which does not contradict to the NGS data used. The great contribution to the diversity of transcriptome comes from the cassette exons. Interestingly, Drosophila melanogaster Dscam (Down syndrome cell-adhesion molecule) gene is predicted to transcribe into 19 008 distinct mRNA isoforms. The recent experimental studies detected 18 496 of them.37 These variants were expressed across a broad dynamic range, with high specificity in the context of cell/tissue types and developmental stages.37 Enrichment of the source data sets with paired-end reads or long reads (more than 400 bp) or EST would highlight the

Estimation of Relative Diversity of Transcriptome Across the Tissues

In many ways, the character of the splice events distribution among the tissues comes from the diversity of the analyzed data volume (Figures 1 and 2, Table 2): the representativeness of the data sets freely available in the NCBI SRA (at the time of the present analysis) differed dramatically between various tissues. The sizes of brain or muscle data sets exceed the volume of source data used for liver, kidney or testis transcriptome mining by two times. Thus we tried to evaluate the relative diversity of alternatively spliced products generated in the tissues. To perform this assessment, we proceed in the following way. We made the prediction of transcript variants for data subsets of various sizes. These subsets were derived from original data sets by means of random selection of reads. However, to perform the predictions here, we used Cufflinks instead of SpliceGrapher due to its greater speed and ease of use. Despite the lower accuracy of Cufflinks, it is quite suitable here because of the relative character of analysis. The results are E

dx.doi.org/10.1021/pr400808u | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Article

Figure 4. Number of novel exons predicted by Cufflinks depending on the source data representativeness. The subsets were derived from initial fullsize tissue data sets by means of the random selection of reads.

analysis tools can be useful − PROSITE,21 Pfam,42 ProDom,43 TIGRFAMs,44 and so on. In the present work, we have made an attempt to reveal the maximum number of novel alternative exons or 5′-/3′-splice junctions and uncover the possible new protein functions as a result of AS. PROSITE analysis identified new functional motifs in the proteins encoded by eight genes: CEP192, CTDP1, EPB41L3, IMPACT, PSTPIP2, RIT2, TWSG1, and ZADH2. The data are provided in the Supporting Table S3 in the Supporting Information. In particular, we found the insertion of the region with NADPH-dependent aldo/keto reductase signatures in the alternatively spliced C-terminal domain phosphatase 1 (CTDP1) mRNA. This protein dephosphorylates the C-terminus of POLR2A, a subunit of RNA polymerase II providing positive regulation of gene expression. Mutations in this gene are associated with congenital cataracts, facial dysmorphism, and neuropathy syndrome.45 The insertion of new elements into CTDP1 does not seem to grant it the aldo/keto reductase enzymatic activity, but this may enable a link between intracellular conditions and gene expression. We also have found the insertion of so-called EF-hand calcium-binding domain into the IMPACT, an expression regulator that ensures constantly high levels of mRNA translation under amino acid starvation (information is provided by GeneCards v.3.10, http:// genecards.org), and this splice event may provide another link between the intracellular conditions (in particular, Ca2+ concentrations) and gene expression regulation. In the present work, we used only the RefSeq translation starts to predict possible protein species. However, this could reduce the uncovered diversity of proteome − we found 107 possible alternative transcription starts previously not described in RefSeq (481 for chromosome 18), but there are only

shown in the Figure 4B. As it was expected, Cufflinks revealed ∼7000 novel exons against 2248, suggested by SpliceGrapher. The similar ratios were derived for the other tissues. Analogous results were previously described in the article of Rogers et al.9 However, the final decision and exact false-positive ratio may only come from targeted PCR amplification and detection of specified mRNA fragments. As we can see from Figure 4B, the ratio of predicted exon counts across the tissues is varying depending on the sequencing depth. This may come from the diversity of mRNA expression distribution across various tissues, for example, activation of rare splice variants in brain.28−30 For example, the brain shows the highest diversity of splice events with 100 Mb (per chromosome 18; 100 Mb of chromosome 18 reads can be extracted from ∼10 Gb of full-genome sequences) and 10 Mb data sets but not 1 Mb, which is completely insufficient to uncover the transcriptome variety. The similar situation is observed for blood and skin. Figure 4A shows the relative diversity of alternative transcripts uncovered at 10 Mb and 100 Mb depths per chromosome 18. As we can see, along with brain, liver tissues show a great variety of transcripts. The increase in the splicing diversity for kidney suggests an enrichment of transcriptome by alternatively spliced mRNA with low expression. Identification of Functional Amino Acid Signatures

Despite great attention paid to alternative splicing analysis, one can hardly understand functions and effects of AS events in the lots of the genes.38 While the splicing alterations may be subtle, their effect can be dramatic − a change in alternative splicing may lead to the development of various diseases.39−41 However, for most of AS observations, the effects and functions are unknown or not studied well.38 Here the protein motifs F

dx.doi.org/10.1021/pr400808u | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Article

Table S5 in the Supporting Information. STEPP scores20 illustrating the expected suitability of a peptide as proteotypic are shown.

3 common transcription starts found for two or more tissues. This raises the question about the predominant reasons for such disparity: tissue specific nature of choice of alternative transcription starts or possible analysis artifacts. The experimental studies should be performed to clarify this question. Several methods may be used to predict translation starts − StackTIS,46 MetWAMer,47 and MetaProdigal.48 The occurrence of alternative transcripts can be associated with the possibility of translation using alternative reading frames (ARFs) that may lead to the synthesis of completely distinct proteins. This mechanism is widely used in viruses to enhance the coding potential of their small genomes − for example, hepatitis C49 or influenza.50 This phenomenon had been previously described also for human organism,51 while the recent studies showed the abundance of such proteins in human tissues.52 This aspect deserves attention because of the high potential of ARF proteins as disease biomarkers. HAltORF, the special database devoted to the alternative human ORFs, has been developed.53 Another question that may arise here is in regards to the emergence of new stop codons and the possibility of insertion of selenocysteines (Sec). This amino acid is an analogue of cysteine with a selenium-containing group instead of the sulfurcontaining thiol group. It is present in several crucial enzymes − thioredoxin reductases, glutathione peroxidases, and so on.54 The incorporation of Sec into a protein occurs when UGA is recognized as sense codon. This event requires the presence of Sec insertion sequence (SECIS) in the 3′-UTR of the transcript.54 Several tools allowing the identification of SECIS elements in the eukaryotic genomes can be used: Covels, Erpin, or SECISearch3.55,56

Prospects for Research in This Area

Alternative splicing is significantly altered under disease conditions, resulting in the occurrences of functional splice variants that are not normally expressed.60 In particular, splicing aberrations are the hallmark of cancer,61 so carrying out an analysis aimed to identify tumor-specific splice aberrations is highly important.62 In turn, this will help to reveal the unobserved molecular features of tumor cell transformation process, to identify tumor-specific mRNA isoforms encoding protein products that may play a protooncogenic role. One of the best known examples is cyclin D1 (CCND1) with two alternative transcripts, the common D1a and frequently overexpressed in tumors D1b.63 CCND1b has a higher oncogenic potential: unlike cyclin D1a, CCND1b is capable of maintaining nuclear localization through the cell cycle.64 Moreover, tumor suppressor genes such as p53, BRCA1, and PTEN, have splicing variants associated with cancer.65 Alternative splicing is regulated by several distinct factors: RNA secondary structure,66 availability of splicing factors in a particular cell type,67 histone modifications,68 and finally by intragenic DNA methylation.69 Recently, it has been reported that DNA methylation inhibits CTCF binding to exons, and this prevents CTCF-mediated Pol II pausing and spliceosome assembly and thus facilitates exon exclusion.70 DNA methylation may facilitate exon inclusion via recruitment of the multifunctional CpG-binding protein MeCP2 maintaining local histone hypoacetylation.71 The second mechanism is predominant − DNA methylation is enriched in included alternatively spliced exons (ASEs) and abnormalities of DNA methylation result in aberrant splicing of them.71 Thus all three major sources of splicing mechanisms regulation − aberrations of DNA methylation,69−71 histone modifications,68 and occurrence of mutations72 − are emerging in the earliest steps of carcinogenesis.73−75 Therefore, tumor-specific protein variants with altered structure can be good cancer biomarkers effective in the early stages of the disease.76,77 Nevertheless, there is currently no database devoted entirely to the cancer-specific splicing aberrations observed in various tumors. This should be the task of further research.

Estimation of Gene Expression

Transcription patterns of 270 genes, 128 pseudogenes, 29 microRNA genes, and 144 gene (or pseudogene) candidates of the chromosome 18 were evaluated across 8 tissues. Not surprisingly, the most striking expression differences were observed for genes with tissue-specific protein products: for example, NETO1 (neuropilin and tolloid-like 1)57 showed the 20-fold higher expressional level in brain compared with the other tissues; liver-specific expression was observed for ONECUT2 encoding a regulatory protein that binds to specific DNA sequences and stimulates expression of target genes including genes involved in melanocyte and hepatocyte differentiation.58 One can also mention cerebellin 2 (CBLN2), which was previously found to be predominantly expressed in synaptically connected neurons and spinal cord; CBLN2 is necessary for the formation and maintenance of parallel fiber− Purkinje cell synapses.59 According to the result of present work, CBLN2 expression in brain is more than 200 times higher compared with other studied tissues. In general, after blood, the brain tissues have the second place in terms of specificity of gene expression calculated as the average of deviations for each gene. Gene expression data are included in the Supporting Table S4 in the Supporting Information.



CONCLUSIONS Alternative splicing, one of the most complex processes in eukaryotic cells, adds an additional layer of proteome diversity and acts as the next level of gene expression regulation. In the present work, we performed the analysis of both alternative splicing patterns and differential gene expression profiles across different tissues: liver, brain, lung, kidney, blood, testis, derma, and skeletal muscles. As it was expected, brain tissues revealed maximum transcriptome and proteome diversity. The analysis of specific PROSITE patterns revealed new probable functional domains within alternative isoforms of POLR2A, CTDP1, IMPACT, PSTPIP2, and other proteins encoded by the genes of chromosome 18. The set of proteotypic peptides enables mass spectrometry identification of the predicted protein species.

Identification of Splice-Specific Proteotypic Peptides



The set of proteotypic peptides was created. These peptides permit the identification of predicted protein species translated from the alternatively spliced transcripts not described in the RefSeq. The peptides include unique motifs − the sequences corresponding to either intron retention regions or novel exon junction regions, including cases of alternative donor/acceptor splice sites. The list of peptides is provided in the Supporting

ASSOCIATED CONTENT

* Supporting Information S

List of used data sets from NCBI Sequence Read Archive (SRA). Numbers of novel splice events (alternate 5′/3′-splice sites, transcription starts, etc.) not described in the Enbembl or G

dx.doi.org/10.1021/pr400808u | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Article

(9) Rogers, M. F.; Thomas, J.; Reddy, A. S.; Ben-Hur, A. SpliceGrapher: detecting patterns of alternative splicing from RNASeq data in the context of gene models and EST data. Genome Biol. 2012, 13 (1), R4. (10) Naisbitt, S.; Kim, E.; Weinberg, R. J.; Rao, A.; Yang, F. C.; Craig, A. M.; Sheng, M. Characterization of guanylate kinase-associated protein, a postsynaptic density protein at excitatory synapses that interacts directly with postsynaptic density-95/synapse-associated protein 90. J. Neurosci. 1997, 17 (15), 5687−96. (11) Darzacq, X.; Singer, R. H.; Shav-Tal, Y. Dynamics of transcription and mRNA export. Curr. Opin. Cell Biol. 2005, 17 (3), 332−9. (12) Santoro, M.; Masciullo, M.; Bonvissuto, D.; Bianchi, M. L.; Michetti, F.; Silvestri, G. Alternative splicing of human insulin receptor gene (INSR) in type I and type II skeletal muscle fibers of patients with myotonic dystrophy type 1 and type 2. Mol. Cell. Biochem. 2013, 380, 259−265. (13) Sanchez-Pla, A.; Reverter, F.; Ruiz de Villa, M. C.; Comabella, M. Transcriptomics: mRNA and alternative splicing. J. Neuroimmunol. 2012, 248 (1−2), 23−31. (14) Bryant, D. W., Jr.; Priest, H. D.; Mockler, T. C. Detection and quantification of alternative splicing variants using RNA-seq. Methods Mol. Biol. 2012, 883, 97−110. (15) Radeva, M.; Hofmann, T.; Altenberg, B.; Mothes, H.; Richter, K. K.; Pool-Zobel, B.; Greulich, K. O. The database dbEST correctly predicts gene expression in colon cancer patients. Curr. Pharm. Biotechnol. 2008, 9 (6), 510−5. (16) Archakov, A.; Aseev, A.; Bykov, V.; Grigoriev, A.; Govorun, V.; Ivanov, V.; Khlunov, A.; Lisitsa, A.; Mazurenko, S.; Makarov, A. A.; Ponomarenko, E.; Sagdeev, R.; Skryabin, K. Gene-centric view on the human proteome project: the example of the Russian roadmap for chromosome 18. Proteomics 2011, 11 (10), 1853−6. (17) Kim, D.; Pertea, G.; Trapnell, C.; Pimentel, H.; Kelley, R.; Salzberg, S. L. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013, 14 (4), R36. (18) Langmead, B.; Trapnell, C.; Pop, M.; Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10 (3), R25. (19) Trapnell, C.; Roberts, A.; Goff, L.; Pertea, G.; Kim, D.; Kelley, D. R.; Pimentel, H.; Salzberg, S. L.; Rinn, J. L.; Pachter, L. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 2012, 7 (3), 562−78. (20) Webb-Robertson, B. J.; Cannon, W. R.; Oehmen, C. S.; Shah, A. R.; Gurumoorthi, V.; Lipton, M. S.; Waters, K. M. A support vector machine model for the prediction of proteotypic peptides for accurate mass and time proteomics. Bioinformatics 2010, 26 (13), 1677−83. (21) Sigrist, C. J.; Cerutti, L.; de Castro, E.; Langendijk-Genevaux, P. S.; Bulliard, V.; Bairoch, A.; Hulo, N. PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Res. 2010, 38 (Database issue), D161−6. (22) Black, D. L. Mechanisms of alternative pre-messenger RNA splicing. Annu. Rev. Biochem. 2003, 72, 291−336. (23) Han, J.; Xiong, J.; Wang, D.; Fu, X. D. Pre-mRNA splicing: where and when in the nucleus. Trends Cell Biol. 2011, 21 (6), 336− 43. (24) Listerman, I.; Sapra, A. K.; Neugebauer, K. M. Cotranscriptional coupling of splicing factor recruitment and precursor messenger RNA splicing in mammalian cells. Nat. Struct. Mol. Biol. 2006, 13 (9), 815− 22. (25) Shukla, S.; Oberdoerffer, S. Co-transcriptional regulation of alternative pre-mRNA splicing. Biochim. Biophys. Acta 2012, 1819 (7), 673−83. (26) Ponomarenko, E.; Poverennaya, E.; Pyatnitskiy, M.; Lisitsa, A.; Moshkovskii, S.; Ilgisonis, E.; Chernobrovkin, A.; Archakov, A. Comparative ranking of human chromosomes based on post-genomic data. OMICS 2012, 16 (11), 604−11. (27) Kornblihtt, A. R.; Schor, I. E.; Allo, M.; Dujardin, G.; Petrillo, E.; Munoz, M. J. Alternative splicing: a pivotal step between eukaryotic

RefSeq for the genes of chromosome 18. Genes encoding protein products with novel functional signatures inserted as the result of alternative splicing. Expression profiles of chromosome 18 genes across 8 tissues. The list of proteotypic peptides enabling the identification of predicted splice events that are not described in the RefSeq. This material is available free of charge via the Internet at http://pubs.acs.org.



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Tel: +7-916-983-6952. Author Contributions

The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work was done in accordance with the “Human Proteome” program of the Russian Academy of Medical Science and was funded by Ministry of Education and Science of the Russian Federation (agreement no. 8274). The authors declare no competing financial interest.

■ ■

ABBREVIATIONS AS, alternative splicing; IR, intron retention; ES, exon skipping REFERENCES

(1) Katidou, M.; Vidaki, M.; Strigini, M.; Karagogeos, D. The immunoglobulin superfamily of neuronal cell adhesion molecules: lessons from animal models and correlation with human disease. Biotechnol. J. 2008, 3 (12), 1564−80. (2) Shionyu, M.; Yamaguchi, A.; Shinoda, K.; Takahashi, K.; Go, M. AS-ALPS: a database for analyzing the effects of alternative splicing on protein structure, interaction and network in human and mouse. Nucleic Acids Res. 2009, 37 (Database issue), D305−9. (3) Piva, F.; Giulietti, M.; Nocchi, L.; Principato, G. SpliceAid: a database of experimental RNA target motifs bound by splicing proteins in humans. Bioinformatics 2009, 25 (9), 1211−3. (4) Martelli, P. L.; D’Antonio, M.; Bonizzoni, P.; Castrignano, T.; D’Erchia, A. M.; D’Onorio De Meo, P.; Fariselli, P.; Finelli, M.; Licciulli, F.; Mangiulli, M.; Mignone, F.; Pavesi, G.; Picardi, E.; Rizzi, R.; Rossi, I.; Valletti, A.; Zauli, A.; Zambelli, F.; Casadio, R.; Pesole, G. ASPicDB: a database of annotated transcript and protein variants generated by alternative splicing. Nucleic Acids Res. 2011, 39 (Database issue), D80−5. (5) Busch, A.; Hertel, K. J. HEXEvent: a database of Human EXon splicing Events. Nucleic Acids Res. 2013, 41 (Database issue), D118− 24. (6) Trapnell, C.; Williams, B. A.; Pertea, G.; Mortazavi, A.; Kwan, G.; van Baren, M. J.; Salzberg, S. L.; Wold, B. J.; Pachter, L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 2010, 28 (5), 511−5. (7) Guttman, M.; Garber, M.; Levin, J. Z.; Donaghey, J.; Robinson, J.; Adiconis, X.; Fan, L.; Koziol, M. J.; Gnirke, A.; Nusbaum, C.; Rinn, J. L.; Lander, E. S.; Regev, A. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 2010, 28 (5), 503−10. (8) Filichkin, S. A.; Priest, H. D.; Givan, S. A.; Shen, R.; Bryant, D. W.; Fox, S. E.; Wong, W. K.; Mockler, T. C. Genome-wide mapping of alternative splicing in Arabidopsis thaliana. Genome Res. 2010, 20 (1), 45−58. H

dx.doi.org/10.1021/pr400808u | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Article

transcription and translation. Nature reviews. Mol. Cell Biol. 2013, 14 (3), 153−65. (28) Grouse, L.; Omenn, G. A.; McCarthy, B. J. Studies by DNARNA hybridization of transcriptional diversity in human brain. J. Neurochem. 1973, 20 (4), 1063−73. (29) Johnson, M. B.; Kawasawa, Y. I.; Mason, C. E.; Krsnik, Z.; Coppola, G.; Bogdanovic, D.; Geschwind, D. H.; Mane, S. M.; State, M. W.; Sestan, N. Functional and evolutionary insights into human brain development through global transcriptome analysis. Neuron 2009, 62 (4), 494−509. (30) Yeo, G.; Holste, D.; Kreiman, G.; Burge, C. B. Variation in alternative splicing across human tissues. Genome Biol. 2004, 5 (10), R74. (31) Singh, G.; Charlet, B. N.; Han, J.; Cooper, T. A. ETR-3 and CELF4 protein domains required for RNA binding and splicing activity in vivo. Nucleic Acids Res. 2004, 32 (3), 1232−41. (32) Wagnon, J. L.; Briese, M.; Sun, W.; Mahaffey, C. L.; Curk, T.; Rot, G.; Ule, J.; Frankel, W. N. CELF4 regulates translation and local abundance of a vast set of mRNAs, including genes associated with regulation of synaptic function. PLoS Genet. 2012, 8 (11), e1003067. (33) Mortazavi, A.; Williams, B. A.; McCue, K.; Schaeffer, L.; Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 2008, 5 (7), 621−8. (34) Bryant, D. W., Jr.; Shen, R.; Priest, H. D.; Wong, W. K.; Mockler, T. C. Supersplat–spliced RNA-seq alignment. Bioinformatics 2010, 26 (12), 1500−5. (35) Shionyu, M.; Takahashi, K.; Go, M. AS-EAST: a functional annotation tool for putative proteins encoded by alternatively spliced transcripts. Bioinformatics 2012, 28 (15), 2076−7. (36) Mulder, N.; Apweiler, R. InterPro and InterProScan: tools for protein sequence classification and comparison. Methods Mol. Biol. 2007, 396, 59−70. (37) Sun, W.; You, X.; Gogol-Doring, A.; He, H.; Kise, Y.; Sohn, M.; Chen, T.; Klebes, A.; Schmucker, D.; Chen, W. Ultra-deep profiling of alternatively spliced Drosophila Dscam isoforms by circularizationassisted multi-segment sequencing. EMBO J. 2013, 32 (14), 2029−38. (38) Kelemen, O.; Convertini, P.; Zhang, Z.; Wen, Y.; Shen, M.; Falaleeva, M.; Stamm, S. Function of alternative splicing. Gene 2013, 514 (1), 1−30. (39) Kalea, A. Z.; Schmidt, A. M.; Hudson, B. I. Alternative splicing of RAGE: roles in biology and disease. Front. Biosci. 2011, 16, 2756− 70. (40) Lai, T. S.; Greenberg, C. S. TGM2 and implications for human disease: role of alternative splicing. Front. Biosci. 2013, 18, 504−19. (41) Shi, J.; Qian, W.; Yin, X.; Iqbal, K.; Grundke-Iqbal, I.; Gu, X.; Ding, F.; Gong, C. X.; Liu, F. Cyclic AMP-dependent protein kinase regulates the alternative splicing of tau exon 10: a mechanism involved in tau pathology of Alzheimer disease. J. Biol. Chem. 2011, 286 (16), 14639−48. (42) Sammut, S. J.; Finn, R. D.; Bateman, A. Pfam 10 years on: 10,000 families and still growing. Briefings Bioinf. 2008, 9 (3), 210−9. (43) Bru, C.; Courcelle, E.; Carrere, S.; Beausse, Y.; Dalmar, S.; Kahn, D. The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res. 2005, 33 (Database issue), D212−5. (44) Haft, D. H.; Selengut, J. D.; Richter, R. A.; Harkins, D.; Basu, M. K.; Beck, E. TIGRFAMs and Genome Properties in 2013. Nucleic Acids Res. 2013, 41 (Databaseissue), D387−95. (45) Tzifi, F.; Pons, R.; Athanassaki, C.; Poulou, M.; Kanavakis, E. Congenital cataracts, facial dysmorphism, and neuropathy syndrome. J. Pediatr. Neurol. 2011, 45 (3), 206−8. (46) Tzanis, G.; Berberidis, C.; Vlahavas, I. StackTIS: a stacked generalization approach for effective prediction of translation initiation sites. Comput. Biol. Med. 2012, 42 (1), 61−9. (47) Sparks, M. E.; Brendel, V. MetWAMer: eukaryotic translation initiation site prediction. BMC Bioinf. 2008, 9, 381. (48) Hyatt, D.; LoCascio, P. F.; Hauser, L. J.; Uberbacher, E. C. Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics 2012, 28 (17), 2223−30.

(49) Branch, A. D.; Stump, D. D.; Gutierrez, J. A.; Eng, F.; Walewski, J. L. The hepatitis C virus alternate reading frame (ARF) and its family of novel products: the alternate reading frame protein/F-protein, the double-frameshift protein, and others. Semin. Liver Dis. 2005, 25 (1), 105−17. (50) Zamarin, D.; Ortigoza, M. B.; Palese, P. Influenza A virus PB1F2 protein contributes to viral pathogenesis in mice. J. Virol. 2006, 80 (16), 7976−83. (51) Klemke, M.; Kehlenbach, R. H.; Huttner, W. B. Two overlapping reading frames in a single exon encode interacting proteins–a novel way of gene usage. EMBO J. 2001, 20 (14), 3849−60. (52) Vanderperre, B.; Lucier, J. F.; Bissonnette, C.; Motard, J.; Tremblay, G.; Vanderperre, S.; Wisztorski, M.; Salzet, M.; Boisvert, F. M.; Roucou, X. Direct detection of alternative open reading frames translation products in human significantly expands the proteome. PloS One 2013, 8 (8), e70698. (53) Vanderperre, B.; Lucier, J. F.; Roucou, X. HAltORF: a database of predicted out-of-frame alternative open reading frames in human. Database 2012, 2012, bas025. (54) Donovan, J.; Copeland, P. R. Threading the needle: getting selenocysteine into proteins. Antioxid. Redox Signaling 2010, 12 (7), 881−92. (55) Mariotti, M.; Lobanov, A. V.; Guigo, R.; Gladyshev, V. N. SECISearch3 and Seblastian: new tools for prediction of SECIS elements and selenoproteins. Nucleic Acids Res. 2013, 41 (15), e149. (56) Gautheret, D.; Lambert, A. Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles. J. Mol. Biol. 2001, 313 (5), 1003−11. (57) Ng, D.; Pitcher, G. M.; Szilard, R. K.; Sertie, A.; Kanisek, M.; Clapcote, S. J.; Lipina, T.; Kalia, L. V.; Joo, D.; McKerlie, C.; Cortez, M.; Roder, J. C.; Salter, M. W.; McInnes, R. R. Neto1 is a novel CUBdomain NMDA receptor-interacting protein required for synaptic plasticity and learning. PLoS Biol. 2009, 7 (2), e41. (58) Jacquemin, P.; Lannoy, V. J.; Rousseau, G. G.; Lemaigre, F. P. OC-2, a novel mammalian member of the ONECUT class of homeodomain transcription factors whose function in liver partially overlaps with that of hepatocyte nuclear factor-6. J. Biol. Chem. 1999, 274 (5), 2665−71. (59) Reiner, A.; Yang, M.; Cagle, M. C.; Honig, M. G. Localization of cerebellin-2 in late embryonic chicken brain: implications for a role in synapse formation and for brain evolution. J. Comp. Neurol. 2011, 519 (11), 2225−51. (60) Tazi, J.; Bakkour, N.; Stamm, S. Alternative splicing and disease. Biochim. Biophys. Acta 2009, 1792 (1), 14−26. (61) Pal, S.; Gupta, R.; Davuluri, R. V. Alternative transcription and alternative splicing in cancer. Pharmacol. Ther. 2012, 136 (3), 283−94. (62) Omenn, G. S.; Yocum, A. K.; Menon, R. Alternative splice variants, a new class of protein cancer biomarker candidates: findings in pancreatic cancer and breast cancer with systems biology implications. Dis. Markers 2010, 28 (4), 241−51. (63) Betticher, D. C.; Thatcher, N.; Altermatt, H. J.; Hoban, P.; Ryder, W. D.; Heighway, J. Alternate splicing produces a novel cyclin D1 transcript. Oncogene 1995, 11 (5), 1005−11. (64) Lu, F.; Gladden, A. B.; Diehl, J. A. An alternatively spliced cyclin D1 isoform, cyclin D1b, is a nuclear oncogene. Cancer Res. 2003, 63 (21), 7056−61. (65) Okumura, N.; Yoshida, H.; Kitagishi, Y.; Nishimura, Y.; Matsuda, S. Alternative splicings on p53, BRCA1 and PTEN genes involved in breast cancer. Biochem. Biophys. Res. Commun. 2011, 413 (3), 395−9. (66) McManus, C. J.; Graveley, B. R. RNA structure and the mechanisms of alternative splicing. Curr. Opin. Genet. Dev. 2011, 21 (4), 373−9. (67) Chen, M.; Manley, J. L. Mechanisms of alternative splicing regulation: insights from molecular and genomics approaches. Nat. Rev. Mol. Cell Biol. 2009, 10 (11), 741−54. (68) Luco, R. F.; Pan, Q.; Tominaga, K.; Blencowe, B. J.; PereiraSmith, O. M.; Misteli, T. Regulation of alternative splicing by histone modifications. Science 2010, 327 (5968), 996−1000. I

dx.doi.org/10.1021/pr400808u | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Article

(69) Oberdoerffer, S. A conserved role for intragenic DNA methylation in alternative pre-mRNA splicing. Transcription 2012, 3 (3), 106−9. (70) Shukla, S.; Kavak, E.; Gregory, M.; Imashimizu, M.; Shutinoski, B.; Kashlev, M.; Oberdoerffer, P.; Sandberg, R.; Oberdoerffer, S. CTCF-promoted RNA polymerase II pausing links DNA methylation to splicing. Nature 2011, 479 (7371), 74−9. (71) Maunakea, A. K.; Chepelev, I.; Cui, K.; Zhao, K. Intragenic DNA methylation modulates alternative splicing by recruiting MeCP2 to promote exon recognition. Cell Res. 2013, 23 (11), 1256−69. (72) Zaphiropoulos, P. G. Genetic variations and alternative splicing: the Glioma associated oncogene 1, GLI1. Front. Genet. 2012, 3, 119. (73) Vogelstein, B.; Papadopoulos, N.; Velculescu, V. E.; Zhou, S.; Diaz, L. A., Jr.; Kinzler, K. W. Cancer genome landscapes. Science 2013, 339 (6127), 1546−58. (74) Jjingo, D.; Conley, A. B.; Yi, S. V.; Lunyak, V. V.; Jordan, I. K. On the presence and role of human gene-body DNA methylation. Oncotarget 2012, 3 (4), 462−74. (75) Shenker, N.; Flanagan, J. M. Intragenic DNA methylation: implications of this epigenetic mechanism for cancer research. Br. J. Cancer 2012, 106 (2), 248−53. (76) Menon, R.; Omenn, G. S. Proteomic characterization of novel alternative splice variant proteins in human epidermal growth factor receptor 2/neu-induced breast cancers. Cancer Res. 2010, 70 (9), 3440−9. (77) Menon, R.; Roy, A.; Mukherjee, S.; Belkin, S.; Zhang, Y.; Omenn, G. S. Functional implications of structural predictions for alternative splice proteins expressed in Her2/neu-induced breast cancers. J. Proteome Res. 2011, 10 (12), 5503−11.

J

dx.doi.org/10.1021/pr400808u | J. Proteome Res. XXXX, XXX, XXX−XXX