Identification of HPV Integration and Gene ... - ACS Publications

Feb 20, 2015 - Identification of HPV Integration and Gene Mutation in HeLa Cell. Line by Integrated Analysis of RNA-Seq and MS/MS Data. Han Sun,. †,...
1 downloads 12 Views 738KB Size
Subscriber access provided by STEVENS INST OF TECH

Article

Identification of HPV integration and gene mutation in HeLa cell line by integrated analysis of RNA-Seq and MS/MS data Han Sun, Chen Chen, Baofeng Lian, Menghuan Zhang, Xiaojing Wang, Bing Zhang, Yi-Xue Li, Pengyuan Yang, and Lu Xie J. Proteome Res., Just Accepted Manuscript • Publication Date (Web): 20 Feb 2015 Downloaded from http://pubs.acs.org on February 20, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Identification of HPV integration and gene mutation in HeLa cell line by integrated analysis of RNA-Seq and MS/MS data

Han Sun1,2§, Chen Chen3§, Baofeng Lian1, Menghuan Zhang1, Xiaojing Wang4, Bing Zhang4, Yixue Li1,2, Pengyuan Yang3* and Lu Xie1*

1

Shanghai Center for Bioinformation Technology, Shanghai Academy of Science and Technology, Shanghai, 201203, China

2

Key Laboratory of Systems Biology, Shanghai Institutes for Biological Science, Chinese Academy of Sciences, Shanghai, 200031, China

3

Department of Chemistry, Institutes of Biomedical Sciences, Fudan University, Shanghai, 200433, China

4

Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, 37232, USA

* Correspondence should be addressed to Dr. Lu Xie (Email: [email protected], Phone: +86 13301670946) or to Dr. Pengyuan Yang (Email: [email protected], Phone: +86 13917399418) § These authors contributed equally to this work.

1

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Abstract HeLa cell line, which was derived from cervical carcinoma, provides an idea platform to study both the integration of human papillomavirus and the massive mutations occurring on cancer cell genome. Proteogenomics is a field with the intersection of proteomics and genomics to perform gene annotation and identify gene mutation. In this work, we first identified the SNV/INDEL, structural variation (SV) and virus infection/integration events from RNA-Seq data of HeLa cell line, then by applying proteogenomics strategy we were able to detect some of the genomic events with the tandem mass spectrometry (MS/MS) data from the same sample. Furthermore, some of the mutated peptides were experimentally validated using multiple reaction monitoring (MRM) technology. The integrated analysis of the RNA-Seq and MS/MS data not only renders the discovery of HeLa cell genome variations more credible, but also illustrates a practical workflow for protein-coding mutation discovery in cancer related studies.

Keywords HeLa, HPV, MS/MS, MRM, Proteogenomics, RNA-Seq

Introduction A large majority of cervical carcinomas were found to have human papillomavirus (HPV) infection or integration, with HPV-16 and HPV-18 being the most prevalent types (1). Although it has been manifested that the expression of virus protein E6 and 2

ACS Paragon Plus Environment

Page 2 of 39

Page 3 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

E7, and in the meanwhile the absence of E2, are crucial for the loss of function of cellular protein p53 and pRb (2), the observations, such as not all the individuals infected or integrated with HPV develop carcinoma and even those who do typically take years to decades for the disease to demonstrate, raise the debate whether the deregulations of p53 and pRb are enough molecular factors for the cancer progression, or, other types of mutation of the cell genome are also indispensable (3). HeLa is the oldest and most commonly used human cell line (4), and as was derived from cervical carcinoma cells, it has many properties characteristic of carcinoma in general and of cervical carcinoma specifically. For example, HeLa is immortal and it can divide an unlimited number of times; HeLa possesses the infection or integration of human papillomavirus 18 (HPV-18), whose genome is a circle of double-stranded DNA; and HeLa is aneuploid with about 80 chromosomes rather than the normal diploid number of 46. All these characters make HeLa the promising candidate to identify HPV integration with cell genome and discover massive gene mutations that may contain hidden mechanisms for cervical carcinoma development. Previously, to find HPV infection or integration (5, 6) in HeLa cell genome, most works mainly focused on nucleotide level using PCR technology. Recently some works utilizing the next generation sequencing (NGS) made efforts to identify genome wide variation (7), or, using tandem mass spectrometry, to globally quantify the cellular proteins (8) in HeLa cell line. Focusing on only one level of technology analysis shows some limits. Genome level analysis makes it difficult to ascertain whether the changes discovered on nucleotides really function ultimately on protein 3

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

level. On the other hand, focusing on protein level only would probably miss a lot of novel genomic variations, since protein identification from tandem mass spectrometry (MS/MS) data mostly rely on database searching which usually contain only regular known protein sequences. Nagaraj et al. (9) tried integrative analysis of RNA-Seq and MS/MS data of the same HeLa cell line, but they were mainly interested in the expression association of protein coding genes in the transcriptome and proteome level. They paid no attention to the field so called proteogenomics where discovery of novel genomic events can be intended, or mutual validation of genetic variations and protein changes can be realized. Proteogenomics was defined as a field with the intersection of proteomics and genomics (10). In the beginning, it mainly intended to utilize the proteomics data to help improve genome annotation (11). In recent years, with the application of multiple-level high throughput technologies on disease samples, especially with both next-generation sequencing (NGS) and tandem mass spectrometry (MS/MS), proteogenomic strategy has also been taken to study cancers (12). As we all have known, cancers arise as the result of the accumulation of mutations occurring on cell genome that have impact to change protein functions ultimately. The genome mutations contain single nucleotide variation (SNV), alternative splicing (AS), small insertion and deletion (INDEL), structural variation (SV, including deletion, duplication, inversion, translocation etc.), and sometimes also contain virus infection or integration events. Proteogenomics in cancer research may validate or discover some of these types of mutations from both perspectives of proteomics and genomics. 4

ACS Paragon Plus Environment

Page 4 of 39

Page 5 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

For example, Li et al. identified the variant peptides based on SNV data from public database such as dbSNP and COSMIC (13), and following their work, Wang et al. studied the variant peptides using a customized database from RNA-Seq data (14). Sheynkman et al. (15) and Banfai et al. (16) also have tried to discover and validate the alternative splicing and lncRNAs from both MS/MS and RNA-Seq data. In this work, based on HeLa cell line studies [8], we would like to identify the SNV/INDEL, structural variation (SV) and virus infection/integration simultaneously by utilizing proteogenomics strategy, integrating both levels of RNA-Seq and MS/MS data. Eventually we hope to provide a comprehensive view for the mutations occurred on the HeLa genome. As illustrated in Fig. 1, we first tried to identify the virus infection or integration, structural variation and SNV/INDEL events from RNA-Seq data, then re-discovered them through proteogenomics strategy in the MS/MS data (17-19). Finally, we validated part of the peptides using multiple reaction monitoring (MRM) technology. The integrated analysis of the RNA-Seq and MS/MS data not only renders the discovery of HeLa cell genome variations more credible, but also illustrates a practical workflow for protein-coding mutation discovery in cancer related studies.

Methods Experiment Data 55,866,621 single end RNA-Seq reads were generated by Illumina Genome Analyzer IIx with each read length of 76bp. 30,059,731 of them (53.81%) could be mapped to 5

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

hg19 by BWA with those reads crossing splicing points unmapped. The average depth of each chromosome calculated by only the mapped reads was about 18 (Supplementary Fig. 6). 71 trypsin digested RAW MS/MS files (2,057,044 spectra) were generated by LTQ-Orbitrap Velos instrument. All the RNA-Seq and MS/MS data were got from Nagaraj et al.’s work (9).

Identify SNV/INDEL, Structural Variation and Virus Infection/Integration from RNA-Seq As illustrated in Fig. 1, the single end raw reads of HeLa cell line were first filtered with quality control steps such as removing reads containing N and removing reads whose quality score was lower than 20. Then the remaining high quality reads were mapped to human genome reference (hg19) using short read sequence alignment software (semi-global alignment algorithm), BWA (20) with default parameters in practice (21, 22). The results could be divided into two parts, Mapped and un-Mapped. The single nucleotide variant (SNV) or the small insertion and deletion (INDEL) were called from these Mapped reads using SAMtools (23). The un-Mapped reads were further aligned to hg19 using local alignment algorithm such as BLAST. If the two parts of one read, head and tail, could be both blasted to hg19, and their relative positions were appropriate, the read was regarded to be the Blasted, otherwise, it was classified to be the un-Blasted. From the Blasted reads, we could identify two types of events: Splicing or Structural Variation. If the reads were part of known mRNA sequences or their corresponding peptides were part of known protein sequences, the 6

ACS Paragon Plus Environment

Page 6 of 39

Page 7 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

reads characterized splicing events. Otherwise, the reads might characterize SV events, such as Deletion, Duplication, Inversion or Translocation. The un-Blasted reads were further blasted to hg19 and virus genomes simultaneously. If the reads were only from viral genomes, they characterized virus infection events. Otherwise, if the head and tail parts of the reads came from hg19 and viral genomes respectively, they could indicate virus integration events. Calling SNV/INDEL from RNA-Seq The SAM format output file of BWA was first converted to bam format file using view command of SAMtools, and then after several processes such as removing redundancy, sorting by chromosomal coordinates and indexing the sorted file for fast random access, the mpileup format file was generated by the mpileup command of SAMtools. Next, the SNV or INDEL were called by view command of bcftools contained in the SAMtools package. Then, the varFilter command of bcftools and manual checking based on the empirically derived parameters by VarScan2 (especially for INDEL) were also used to remove the false positive discovery (24). Finally, the called SNV and INDEL were further annotated by ANNOVAR (25), so that we would get the non-synonymous SNV, synonymous SNV, frame shift INDEL, non-frame shift INDEL and so on. Identify Structural Variation Events from RNA-Seq The un-Mapped reads were further aligned to hg19 using BLAST, and several filtering steps were taken to make sure the reliability of the results. First, the read should be aligned to two and only two positions among the reference (the low quality 7

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

and the too short alignments which might be caused by repetitive sequence of the reference were not counted). Second, the two alignments should indicate the head and tail parts of this read respectively (the length of most of the reads is 76bp, so the head means that this part roughly ranges from 1 to 35bp, and the tail ranges roughly from 36 to 76bp) and both have high alignment quality (the Identity should be nearly 100%). Third, the two parts (head or tail) should be aligned to different positions or strands of the human genome. Fourth, we regarded one event happened only that the event was supported by at least three unique reads (26, 27). The event should occur either because of the Splicing or the Structural Variation. If the read was part of known mRNA sequences or its corresponding peptide was part of known protein sequences, this was a Splicing event. Otherwise, it was regarded as a Structural Variation event. Focusing of the Structural Variation, we could classify the events into four different types (Deletion, Duplication, Inversion and Translocation) by the relative positions and strands of the chromosomes. It was worth mentioning that although we have tried to filter the Splicing from the Structural Variation events, some alternative splicing events might still be wrongly classified into Deletion, especially those not covered by known mRNA or protein sequences. Discover Virus Infection/Integration Events from RNA-Seq The genome sequences of 3535 viruses were downloaded from the viral genome resource of NCBI (28). The un-Blasted reads were aligned to both hg19 and the 3535 virus genomes simultaneously using blast with default parameters except for the output option which require tabular format. In one hand to identify those virus 8

ACS Paragon Plus Environment

Page 8 of 39

Page 9 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

infection events, the short reads should come from viruses genomes only. And in practice, we required the two indicators of blast (Identity and Alignment Length) to be 100%, 76bp and 90%, 60bp respectively (the length of most of the short reads is 76bp). On the other hand to identify the virus integration events, the process was similar to the identification of structural variation, except that the head and tail of the reads should be aligned to human and viruses genomes. Identify Variant Peptides from MS/MS data based on RNA-Seq Results In order to identify the variant peptides including the SNV/INDEL, Structural Variation and Virus Infection/Integration, a customized database which contains those peptides was needed. We must get the nucleotide sequences of these mutated peptides first and then translate them to the peptides in silico. For the SNV or INDEL, we have known the mutated position in the hg19 and also known the mutated allele from reference to alternative, so we could get the reference nucleotide sequences and their corresponding alternative nucleotide sequences from hg19. There were two conditions need to be given the consideration. If the mutated point was around the exon-intron boundary (40bp in practice), the sequences of the two neighboring exons were concatenated and then 80bp sequence crossing the mutation point was extracted (40bp upstream and 40bp downstream) from this concatenated exon sequence. Otherwise, the 80bp sequence crossing the mutation point (40bp upstream and 40bp downstream) was extracted directly from the reference. In both conditions, substituting the reference alleles with the alternative alleles could generate the corresponding alternative

nucleotide

sequences.

For

the

Structural

9

ACS Paragon Plus Environment

Variation

and

Virus

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

infection/integration, the raw RNA-Seq short reads, which cross on the break points, could be utilized directly. In the translating of the nucleotide sequences to the peptides, the six-frame-translation strategy was taken. It is worth noting that in order to reduce the number of the candidate peptides as far as possible, not all the six peptides for one nucleotide were remained for the following analysis. First, the peptides must not contain any stop codon. Second, if one of the six peptides of the reference nucleotide sequence was the substring of a known protein, then only this peptide and its corresponding variant peptide were remained (We believed that the peptide from this frame translation should be the most likely happened). If all the six peptides of the reference nucleotide sequence were not the substring of any known protein, then these stop free peptides and their corresponding variant peptides were remained. The database used for searching mass spectrua was comprised of three components, the mutated peptides which contain the information of SNV/INDEl, Structural Variation and Virus Infection/Integration. The known and predicted protein sequences from Ensembl database were also included as the competitor of the mutated peptides when searching by the search engines. Also the reversed sequences of all the mutated peptides, the known and predicted protein sequences were included for the FDR controlling. To improve the confidence of the peptide-spectrum matching, in addition to the adoption of the 0.01 value for the local FDR controlling, two popular search engines were also used to take the intersection. Validate MS/MS peptides using MRM Hela cells were cultured in DMEM medium with 10% bovine serum, harvested by 10

ACS Paragon Plus Environment

Page 10 of 39

Page 11 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

trypsinization, rinsed three times with phosphate buffer saline (PBS). Cells were lysed by sonication with a lysis buffer consisting of 0.1M Tris-HCl, PH 8.0, 0.1M DTT, and 2% SDS. After centrifugation at 25 000 g for 30 min at 4 °C, the supernatant was collected and stored at -80 °C. The total protein concentration was measured using PlusOne 2-D Quant Kit (GE, Amersham Biosciences). Peptides were synthesized using Solid phase peptide synthesis technology, lyophilized in a 96-well plate (~1 umol of peptide material per well; Shanghai ZiYu Biotech Co.) and used in an unpurified form. Crude peptides mixture were resuspended in 5% acetonitrile, 0.1% formic acid and analyzed by nano LC-MS/MS on the LTQ-Orbitrap XL mass spectrometer. For each peptide, the most intense six to fifteen peaks in MS/MS spectrum were picked manually as preliminary MRM transitions. MRM experiments were performed on a 6500 hybrid triple quadrupole/linear ion trap mass spectrometer (AB Sciex, CA) interfaced with a Eksigent nano 1D plus system (Waters, Milford, MA) (29).

Results HPV-18 integration into HeLa cell genome and into the upstream of MYC The RNA-Seq reads which could neither be mapped to human genome using BWA nor be aligned to human genome with BLAST (requiring both the head and tail of the reads be aligned, Fig. 1) were further aligned to both human genome and 3535 virus genomes simultaneously. If a read was only aligned to virus genomes with high quality, it could characterize a virus infection event. Otherwise, if the head and tail of 11

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

a read were aligned to the human genome and virus genomes respectively, this read could characterize a virus integration event. For the virus infection events, the most frequently observed virus was HPV-18. When the Identity and Length of BLAST were required to be 100% and 76 (the length of raw reads is 76), 2551 reads in total were aligned to virus genomes and among them 2522 (98.86%) reads were aligned to HPV-18. When the Identity and Length were set to be 90% and 60, 28375 reads in total were aligned to virus genomes and 27261 (96.07%) reads were aligned to HPV-18. The distribution of the reads observed among the genes of the HPV-18 was displayed in Fig. 2. It was consistent with the phenomenon reported by Corden et al. (1) that the transcription of three genes, E4, E5 and L2, could not be observed. The main reason may be that this part of DNA of the virus was lost during the infection or integration process. And also the relatively lower expression of E2 may be one part of reason to the deregulation of p53 and pRb (3). As illustrated in Fig. 3, seven different integration events were finally determined, when we set that each event must be supported by at least three unique RNA-Seq reads. From the aspect of HPV-18, 6 of the 7 break points were located in the gene region (5 break points were in E1, 1 in L1 and only 1 point was in the long control region before E6). For the aspect of human genome, all the 7 break points were located in the q24.21 region of chr8 and located between two cancer-related non-coding RNAs (PCAT1 and CCAT1) and the MYC oncogene. As their names indicate, the PCAT1 (30-32) and CCAT1 (33-35) is the prostate or colon cancer associated transcript respectively. MYC is a famous viral oncogene and as a 12

ACS Paragon Plus Environment

Page 12 of 39

Page 13 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

transcription factor it plays important role in cell cycle progression, apoptosis, and cellular transformation and so on. Recently, Adey et al. (4) also reported 4 integration sites for HPV-18 to the human genome in HeLa cell line and their results have one overlap with our 7 integration sites (chr8:128230632). It is also worth mentioning that the chimeric clone named BC106081 (36) which was identified in the cervical carcinoma had the same integration point (chr8:128241377) as in our results, this might be a further support to the reliability of our results. It was noted that the biological mechanism might require a short overlapping sequence for the HPV-18 to be integrated to the human genome (37). In the 7 integration events, 4 cases showed 2bp long overlapping sequence (3 were CT and 1 was TA). See Supplementary Table 1 for details. All these seven integrations were located in the upstream non-coding region of MYC. It is hard to anticipate that any peptides crossing the integration points would be observed in MS/MS data. However, considering the following two facts, it is still possible for these integration events to generate new fusion peptides. First, from the view of the virus side, most of the events were located in the gene regions of HPV-18 (5 in E1 and 1 in L1). The upstream sequences from the virus genes could lead the translation of the downstream sequences from the human genome and then give the rise of the fusion peptides of integrated virus-human genome sequences. Second, all these integration events were identified from the RNA-Seq data and it indicated that these integrations have already cast effects on the transcription level. Therefore, in the following analysis, we keep on paying attention to see whether the fusion peptides of 13

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

integrated virus-human genome could be identified in the MS/MS data. The structural variation (SV), single nucleotide variation (SNV) and small scale insertion and deletion (INDEL) in HeLa genome The RNA-Seq reads which could not be mapped to the human genome using BWA were further aligned to human genome using BLAST. If both the head and tail of a read could be aligned to the human genome to different positions or strands, it might characterize a structural variation event. When requiring at least three reads to support one event and supplemented with other strict filtering steps, we finally determined 2556 events (1174 genes) that were involved in structural variations. Among the 1174 genes, 787 were with Deletion, 54 with Duplication, 437 with Inversion and 29 with Translocation. Besides for the Deletion events which could contain some wrongly classified novel alternative splicing, it is noticeable that the number of Inversion was relatively large, and it might reflect one of the characters of HeLa genome. See Supplementary Fig. 1 for the Venn diagram of genes among the four different types of structural variation. As illustrated in the Supplementary Fig. 2, while most of these genes possessed only one structural variation; some genes possessed many. Among these genes, the most striking case was the chromosome X gene FLNA which had 2 Deletion, 8 Duplication and 50 Inversion (Supplementary Fig. 3 and Supplementary Table 2). The recurrence of Inversion events in FLNA was suggested to be the result of non-allelic homologous recombination (NAHR) between near-identical inverted duplications flanking the region (38). The reads which could be mapped to the hg19 using BWA were used to call SNV and 14

ACS Paragon Plus Environment

Page 14 of 39

Page 15 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

INDEL. After strict quality controlling, we identified 2205 non-synonymous SNV (6194 SNV in total) and 51 INDEL (Supplementary Table 3). Among them, 1951 (88.48%) non-synonymous SNV and 45 (88.24%) INDEL had already been reported by dbsnp135 (39). MYC was found to have two Inversion events which were both in the 3’ UTR region. As illustrated in Fig. 4, the first inversion was supported by 5 unique reads. The head (1-32bp) and tail (30-76bp) part of these reads could be aligned to the forward and reverse strand uniquely. The second inversion was supported by 3 unique reads. The head (1-32bp, 1-44bp with the identity as 95.5% and 1-32bp with the identity as 100%) and tail (38-76bp) part could also be aligned uniquely. It was interesting to see that there was a 3bp overlap and 5bp gap in the two inversion events respectively. Among the top 10 functional partners of MYC supplied by STRING network (40), two genes were affected by the structural variation events (SP1 with Deletion and EP300 with Duplication) and one gene (UBC) had a non-synonymous SNV (Supplementary Fig. 4). GO process enrichment analysis (41) of all these mutated genes suggested that they significantly occur in cellular component organization, chromatin modification, cell death, virus-host interaction, and so on (Supplementary Table 2).

Mutated peptides identified by MS/MS and validated by MRM All the virus infection/integration, structural variation and SNV/INDEL events discovered from RNA-Seq were utilized to generate the mutated peptides through 15

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

proteogenomics strategy. After some filtering steps to remove those obvious artificial peptides and then control the size of the database, we finally constructed a customized database containing three parts: the mutated peptides, known protein sequences and the reversal sequences of both the above two sequence sets. The mass spectra were searched against this database with two popular search engines, MaxQuant and X!Tandem, respectively. For each of the search engine, the FDR was set to be 0.01 and calculated at the peptide level (42). The result of each search engine could be classified into three groups. First, if the peptides were from the reversed sequences, these peptides were classified as reversed peptides. Second, if the peptides were from the known proteins, they were classified as known peptides. Third, the rest peptides must be from the mutated sequences and these peptides were classified as mutated peptides. To manually check the quality of the matching between the peptide and the spectrum, pFind software was applied to label the spectrum peaks against the peptide (43). Finally, 233 and 567 mutated peptides were finally identified by MaxQuant and X!Tandem respectively and 183 mutated peptides were identified by both the two engines (Supplementary Table 5). All these mutated peptides could not have been discovered easily by the regular MS/MS identification workflow without the integration of customized RNA-Seq data (44, 45). Further, we cultured HeLa cell line ourselves and validated the 183 mutated peptides with the more sensitive and specific mass spectrometry technology, multiple reaction monitoring (MRM). Eventually, 84 peptides (45.90%) could be validated by MRM and were regarded as the most reliable results (Supplementary Table 4). We have 16

ACS Paragon Plus Environment

Page 16 of 39

Page 17 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

expected to validate more or less peptides crossing the HPV integration points or mutated peptides caused by structural variations, however, except for the peptide, HYSDSVYGDTLEK, which was from the E6 gene of HPV-18, and the peptide, SQGVQPIPSQGGK, which was caused by an in-frame deletion of FAM120A, all the remaining peptides were from SNV events. The exclusion of 3bp in the mRNA of FAM120A, which caused 1AA deletion in the protein, may be difficult to detect solely from the analysis of sequencing data, because the site is just located on the side the splicing point. However, the evidences from all the RNA-Seq, MS/MS and MRM data have given us the confidence about the existence of this deletion (Fig. 5). The FAM120A is involved in the oxidative stress induced survival signaling pathway, and the function of this specific deletion which hasn’t been reported by dbSNP merits further functional experiment study.

Discussion In this work, we first tried to identify small-scale gene mutation (such as SNV and INDEL) and large-scale structural variation (including some novel alternative splicing) in the RNA-Seq data of HeLa cell line. In the meanwhile, we also paid attention to identify genomic events in HeLa cells caused by the infection and integration of the virus HPV-18, which was regarded to play important roles in the progression of human cervical carcinoma. We then tried to validate these genomic variation events identified from RNA-Seq data by the strategy of proteogenomics based on tandem mass spectrometry data (MS/MS) of the same HeLa cell line. We translated the 17

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

mutated RNA sequences to corresponding peptides in silico and combined these peptides with known protein sequences to generate a database for MS/MS data searching. After strict FDR controlling and several other filtering steps, such as taking the intersection of two popular search engines, we identified 183 mutated peptides which could not be identified from the mass spectrometry data without the integration of the analysis of RNA-Seq data. Finally, we tried to validate the mutated peptides identified from HeLa MS/MS data by using MRM technology. Among the 183 mutated peptides 84 were successfully validated by MRM experiment. These 84 peptides therefore could be supported by multiple levels of data of RNA-Seq, MS/MS and MRM. We followed the pipeline of proteogenomics in our integrated analysis of both RNA-Seq and MS/MS data. There have been several studies in this direction. For examples, Wang et al. (14) and Sheynkmanet al. (15) have tried to discover new SNV and alternative splicing (AS) respectively through integration of RNA-Seq and MS/MS data. However, to our knowledge, it is rare to promote this integration to the discovery and validation of structural variations and virus infection or integration. There may be two possible reasons for this. First, it is more difficult to analyze RNA-Seq to discover structural variation and virus infection or integration events, rather than the SNV. Second, the destruction of the structural variation and virus infection or integration to host genome is greater than the SNV events. Most sequences affected by these large scale variations lost the capability to translate and generate complete or stable proteins, so that it is hard to identify such events from 18

ACS Paragon Plus Environment

Page 18 of 39

Page 19 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

MS/MS data. Different from previous works which only focused on either SNV or AS, we made an effort to analyze SNV, INDEL, structural variation and virus infection/integration simultaneously. We believe the results from different scales should provide more comprehensive view for the mutations occurring on both nucleotide and protein levels. Throughout the analysis, how to set a proper cut-off to achieve the balance between the reporting of as many as findings and reporting as less as false positives has been a plague for us. For instance, how many unique reads should be required to support one mutation event in RNA-Seq analysis? This cut-off not only affects the reliability of the RNA-Seq analysis results but also affects the size of the corresponding mutated peptide database and further influences the accuracy of the MS/MS peptide identification. Some popular software such as Pindel (26) and Tophat-Fusion (27) take 2 reads or 3 reads as default parameters and Sheynkman et al. (15) regarded at least 6 reads should be required. In our workflow, we determine to require at least 3 reads to support one mutation event when reporting the results of the analysis of RNA-Seq. However in the construction of the mutated peptide database for MS/MS searching, we tried to include some other high quality events even when the number of their RNA-Seq supporting reads is less than three. We relied on final MRM validation results to testify whether our action was rational. As illustrated in Supplementary Fig. 5, we found that in the 84 final successfully validated peptides by MRM, although the median number of RNA-Seq supporting reads is 12, 15 peptides (17.86%) were supported by only one RNA-Seq read, and this number was larger than we have 19

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

imagined. This finding suggests that in integrated analysis of multiple-level data, the cut-off for single-level analysis probably could be relaxed. We noticed that in the 84 validated peptides, 11 peptides were mutated from amino acid N to D. Although the mass difference between D and N is about only 1 da (the amino acid masses for D and N are 115.0886da and 114.1038da, respectively), the data from the MS/MS and MRM technology, especially the fragment ion mass, had already enough power to distinguish these two amino acids. Considering all these peptides could also be supported by the reads from the RNA-Seq data, we believe that this variation from N to D should not be regarded as a mismatch of MS/MS spectra but should be considered a special mutation pattern in the genome of HeLa cell line. We reported relatively large number of mutations, especially for the structural variations such as Inversion. Considering the strict filtering strategy we have taken, those reads which could perfectly support the mutation events seem not to come from mere sequencing noise. Further validation may be needed. The ultimate function of these mutations, such as the integration of HPV-18 to the upstream of MYC and the types of variations of MYC and its partners, is also worth of further functional experiments. We highlight the analysis of the infection and integration of HPV-18 although we couldn’t validate any peptides crossing the integration points. Many other viruses, such as Hepatitis B Virus (HBV) in liver (46) cancer and adenovirus in human embryonic kidney 293 (HEK293) cell line, have been reported to be integrated to the cellular genome and in some cases the integration sites were also in the gene regions. It may be possible to observe the integration of those viruses both in the 20

ACS Paragon Plus Environment

Page 20 of 39

Page 21 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

RNA-Seq reads and in the MS/MS peptides with workflow similar to ours. In conclusion, we have tried to identify and validate several types of genome/proteome mutations by integrated RNA-Seq and MS/MS data analysis, and performed MRM experiments to further validate some mutated peptides. Such kind of integrated analysis could increase the reliability of identifying mutations on nucleotide level with corresponding impact on protein levels. The workflow could be flexibly applied to other datasets to help the research in cancer studies.

Author contribution Han Sun carried out the bioinformatics analysis and Chen Chen validated the mutated peptides in the MRM experiment. Han Sun and Lu Xie conceived the study and wrote the manuscript. Baofeng Lian, Menghuan Zhang,, Xiaojing Wang, Bing Zhang, Pengyuan Yang, Yixue Li helped with bioinformatics or experimental validations. All authors read and

approved the final manuscript.

Competing interests The authors declare no competing financial interests.

Declarations The research was funded by National Key Basic Research Program 2010CB912702.

21

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Acknowledgements We want to thank Henrietta Lacks and her surviving family members for their contributions to biomedical research. We also want to thank Dr. Thiesen and Dr. Glocker for their helpful comments and suggestions, as well as Dr. Geiger and Dr. Nagaraj for sharing their RNA-Seq and MS/MS data. This work was funded by National Key Basic Research Program 2010CB912702; National High Technology Project 2012AA020201; Key Infectious Disease Project (2012ZX10002012-014); and National Natural Science Foundation of China 31070752.

Reference 1.

Corden, S. A.; Sant-Cassia, L. J.; Easton, A. J.; Morris, A. G., The integration of

HPV-18 DNA in cervical carcinoma. Mol Pathol 1999, 52, (5), 275-82. 2. Peter, M.; Stransky, N.; Couturier, J.; Hupe, P.; Barillot, E.; de Cremoux, P.; Cottu, P.; Radvanyi, F.; Sastre-Garau, X., Frequent genomic structural alterations at HPV insertion sites in cervical carcinoma. Journal of Pathology 2010, 221, (3), 320-330. 3.

Goodwin, E. C.; DiMaio, D., Repression of human papillomavirus oncogenes in

HeLa cervical carcinoma cells causes the orderly reactivation of dormant tumor suppressor pathways. Proc Natl Acad Sci U S A 2000, 97, (23), 12513-8. 4.

Adey, A.; Burton, J. N.; Kitzman, J. O.; Hiatt, J. B.; Lewis, A. P.; Martin, B. K.;

Qiu, R.; Lee, C.; Shendure, J., The haplotype-resolved genome and epigenome of the aneuploid HeLa cancer cell line. Nature 2013, 500, (7461), 207-11. 5.

Schmitz, M.; Driesch, C.; Jansen, L.; Runnebaum, I. B.; Durst, M., Non-random 22

ACS Paragon Plus Environment

Page 22 of 39

Page 23 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

integration of the HPV genome in cervical cancer. PLoS One 2012, 7, (6), e39632. 6.

Papachristou, E. K.; Roumeliotis, T. I.; Chrysagi, A.; Trigoni, C.; Charvalos, E.;

Townsend, P. A.; Pavlakis, K.; Garbis, S. D., The shotgun proteomic study of the human

ThinPrep

cervical

smear

using

iTRAQ

mass-tagging

and

2D

LC-FT-Orbitrap-MS: the detection of the human papillomavirus at the protein level. J Proteome Res 2013, 12, (5), 2078-89. 7.

Landry, J. J.; Pyl, P. T.; Rausch, T.; Zichner, T.; Tekkedil, M. M.; Stutz, A. M.;

Jauch, A.; Aiyar, R. S.; Pau, G.; Delhomme, N.; Gagneur, J.; Korbel, J. O.; Huber, W.; Steinmetz, L. M., The genomic and transcriptomic landscape of a HeLa cell line. G3 (Bethesda) 2013, 3, (8), 1213-24. 8.

Thiede, B.; Koehler, C. J.; Strozynski, M.; Treumann, A.; Stein, R.; Zimny-Arndt,

U.; Schmid, M.; Jungblut, P. R., High Resolution Quantitative Proteomics of HeLa Cells Protein Species Using Stable Isotope Labeling with Amino Acids in Cell Culture(SILAC), Two-Dimensional Gel Electrophoresis(2DE) and Nano-Liquid Chromatograpohy Coupled to an LTQ-OrbitrapMass Spectrometer. Molecular & Cellular Proteomics 2013, 12, (2), 529-538. 9.

Nagaraj, N.; Wisniewski, J. R.; Geiger, T.; Cox, J.; Kircher, M.; Kelso, J.; Paabo,

S.; Mann, M., Deep proteome and transcriptome mapping of a human cancer cell line. Mol Syst Biol 2011, 7, 548. 10. Castellana, N.; Bafna, V., Proteogenomics to discover the full coding content of genomes: a computational perspective. J Proteomics 2010, 73, (11), 2124-35. 11. Xing, X. B.; Li, Q. R.; Sun, H.; Fu, X.; Zhan, F.; Huang, X.; Li, J.; Chen, C. L.; 23

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Shyr, Y.; Zeng, R.; Li, Y. X.; Xie, L., The discovery of novel protein-coding features in mouse genome based on mass spectrometry data. Genomics 2011, 98, (5), 343-51. 12. Zhang, B.; Wang, J.; Wang, X.; Zhu, J.; Liu, Q.; Shi, Z.; Chambers, M. C.; Zimmerman, L. J.; Shaddox, K. F.; Kim, S.; Davies, S. R.; Wang, S.; Wang, P.; Kinsinger, C. R.; Rivers, R. C.; Rodriguez, H.; Townsend, R. R.; Ellis, M. J.; Carr, S. A.; Tabb, D. L.; Coffey, R. J.; Slebos, R. J.; Liebler, D. C.; the, N. C.; National Cancer Institute Clinical Proteomics Tumor Analysis, C.; National Cancer Institute Clinical Proteomics Tumor Analysis Consortium, N. C.; National Cancer Institute Clinical Proteomics Tumor Analysis Consortium, N. C., Proteogenomic characterization of human colon and rectal cancer. Nature 2014. 13. Li, J.; Su, Z.; Ma, Z. Q.; Slebos, R. J.; Halvey, P.; Tabb, D. L.; Liebler, D. C.; Pao, W.; Zhang, B., A bioinformatics workflow for variant peptide detection in shotgun proteomics. Mol Cell Proteomics 2011, 10, (5), M110 006536. 14. Wang, X. J.; Slebos, R. J. C.; Wang, D.; Halvey, P. J.; Tabb, D. L.; Liebler, D. C.; Zhang, B., Protein Identification Using Customized Protein Sequence Databases Derived from RNA-Seq Data. Journal of Proteome Research 2012, 11, (2), 1009-1017. 15. Sheynkman, G. M.; Shortreed, M. R.; Frey, B. L.; Smith, L. M., Discovery and mass spectrometric analysis of novel splice-junction peptides using RNA-Seq. Mol Cell Proteomics 2013, 12, (8), 2341-53. 16. Banfai, B.; Jia, H.; Khatun, J.; Wood, E.; Risk, B.; Gundling, W. E., Jr.; Kundaje, A.; Gunawardena, H. P.; Yu, Y.; Xie, L.; Krajewski, K.; Strahl, B. D.; Chen, X.; 24

ACS Paragon Plus Environment

Page 24 of 39

Page 25 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Bickel, P.; Giddings, M. C.; Brown, J. B.; Lipovich, L., Long noncoding RNAs are rarely translated in two human cell lines. Genome Res 2012, 22, (9), 1646-57. 17. Hernandez, C.; Waridel, P.; Quadroni, M., Database construction and peptide identification strategies for proteogenomic studies on sequenced genomes. Curr Top Med Chem 2014, 14, (3), 425-34. 18. Woo, S.; Cha, S. W.; Merrihew, G.; He, Y.; Castellana, N.; Guest, C.; MacCoss, M.; Bafna, V., Proteogenomic database construction driven from large scale RNA-seq data. J Proteome Res 2014, 13, (1), 21-8. 19. Jagtap, P.; Goslinga, J.; Kooren, J. A.; McGowan, T.; Wroblewski, M. S.; Seymour, S. L.; Griffin, T. J., A two-step database search method improves sensitivity in peptide sequence matches for metaproteomics and proteogenomics studies. Proteomics 2013, 13, (8), 1352-7. 20. Li, H.; Durbin, R., Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25, (14), 1754-1760. 21. Rudin, C. M.; Durinck, S.; Stawiski, E. W.; Poirier, J. T.; Modrusan, Z.; Shames, D. S.; Bergbower, E. A.; Guan, Y.; Shin, J.; Guillory, J.; Rivers, C. S.; Foo, C. K.; Bhatt, D.; Stinson, J.; Gnad, F.; Haverty, P. M.; Gentleman, R.; Chaudhuri, S.; Janakiraman, V.; Jaiswal, B. S.; Parikh, C.; Yuan, W.; Zhang, Z.; Koeppen, H.; Wu, T. D.; Stern, H. M.; Yauch, R. L.; Huffman, K. E.; Paskulin, D. D.; Illei, P. B.; Varella-Garcia, M.; Gazdar, A. F.; de Sauvage, F. J.; Bourgon, R.; Minna, J. D.; Brock, M. V.; Seshagiri, S., Comprehensive genomic analysis identifies SOX2 as a frequently amplified gene in small-cell lung cancer. Nat Genet 2012, 44, (10), 1111-6. 25

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

22. Cancer Genome Atlas, N., Comprehensive molecular characterization of human colon and rectal cancer. Nature 2012, 487, (7407), 330-7. 23. Li, H.; Handsaker, B.; Wysoker, A.; Fennell, T.; Ruan, J.; Homer, N.; Marth, G.; Abecasis, G.; Durbin, R.; Proc, G. P. D., The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009, 25, (16), 2078-2079. 24. Koboldt, D. C.; Zhang, Q.; Larson, D. E.; Shen, D.; McLellan, M. D.; Lin, L.; Miller, C. A.; Mardis, E. R.; Ding, L.; Wilson, R. K., VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res 2012, 22, (3), 568-76. 25. Wang, K.; Li, M. Y.; Hakonarson, H., ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research 2010, 38, (16). 26. Ye, K.; Schulz, M. H.; Long, Q.; Apweiler, R.; Ning, Z. M., Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 2009, 25, (21), 2865-2871. 27. Kim, D.; Salzberg, S. L., TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biology 2011, 12, (8). 28. Bao, Y. M.; Federhen, S.; Leipe, D.; Pham, V.; Resenchuk, S.; Rozanov, M.; Tatusov, R.; Tatusova, T., National Center for Biotechnology Information Viral Genomes Project. J Virol 2004, 78, (14), 7291-7298. 29. Chen, C.; Liu, X.; Zheng, W.; Zhang, L.; Yao, J.; Yang, P., Screening of Missing Proteins in the Human Liver Proteome by Improved MRM-Approach-Based Targeted 26

ACS Paragon Plus Environment

Page 26 of 39

Page 27 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Proteomics. J Proteome Res 2014, 13, (4), 1969-78. 30. Prensner, J. R.; Iyer, M. K.; Balbin, O. A.; Dhanasekaran, S. M.; Cao, Q.; Brenner, J. C.; Laxman, B.; Asangani, I. A.; Grasso, C. S.; Kominsky, H. D.; Cao, X.; Jing, X.; Wang, X.; Siddiqui, J.; Wei, J. T.; Robinson, D.; Iyer, H. K.; Palanisamy, N.; Maher, C. A.; Chinnaiyan, A. M., Transcriptome sequencing across a prostate cancer cohort identifies PCAT-1, an unannotated lincRNA implicated in disease progression. Nat Biotechnol 2011, 29, (8), 742-9. 31. Ge, X.; Chen, Y.; Liao, X.; Liu, D.; Li, F.; Ruan, H.; Jia, W., Overexpression of long noncoding RNA PCAT-1 is a novel biomarker of poor prognosis in patients with colorectal cancer. Med Oncol 2013, 30, (2), 588. 32. Prensner, J. R.; Chen, W.; Iyer, M. K.; Cao, Q.; Ma, T.; Han, S.; Sahu, A.; Malik, R.; Wilder-Romans, K.; Navone, N.; Logothetis, C. J.; Araujo, J. C.; Pisters, L. L.; Tewari, A. K.; Canman, C. E.; Knudsen, K. E.; Kitabayashi, N.; Rubin, M. A.; Demichelis, F.; Lawrence, T. S.; Chinnaiyan, A. M.; Feng, F. Y., PCAT-1, a long noncoding RNA, regulates BRCA2 and controls homologous recombination in cancer. Cancer Res 2014, 74, (6), 1651-60. 33. Nissan, A.; Stojadinovic, A.; Mitrani-Rosenbaum, S.; Halle, D.; Grinbaum, R.; Roistacher, M.; Bochem, A.; Dayanc, B. E.; Ritter, G.; Gomceli, I.; Bostanci, E. B.; Akoglu, M.; Chen, Y. T.; Old, L. J.; Gure, A. O., Colon cancer associated transcript-1: a novel RNA expressed in malignant and pre-malignant human tissues. Int J Cancer 2012, 130, (7), 1598-606. 34. Alaiyan, B.; Ilyayev, N.; Stojadinovic, A.; Izadjoo, M.; Roistacher, M.; Pavlov, V.; 27

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Tzivin, V.; Halle, D.; Pan, H.; Trink, B.; Gure, A. O.; Nissan, A., Differential expression of colon cancer associated transcript1 (CCAT1) along the colonic adenoma-carcinoma sequence. BMC Cancer 2013, 13, 196. 35. Yang, F.; Xue, X.; Bi, J.; Zheng, L.; Zhi, K.; Gu, Y.; Fang, G., Long noncoding RNA CCAT1, which could be activated by c-Myc, promotes the progression of gastric carcinoma. J Cancer Res Clin Oncol 2013, 139, (3), 437-45. 36. Strausberg, R. L.; Feingold, E. A.; Grouse, L. H.; Derge, J. G.; Klausner, R. D.; Collins, F. S.; Wagner, L.; Shenmen, C. M.; Schuler, G. D.; Altschul, S. F.; Zeeberg, B.; Buetow, K. H.; Schaefer, C. F.; Bhat, N. K.; Hopkins, R. F.; Jordan, H.; Moore, T.; Max, S. I.; Wang, J.; Hsieh, F.; Diatchenko, L.; Marusina, K.; Farmer, A. A.; Rubin, G. M.; Hong, L.; Stapleton, M.; Soares, M. B.; Bonaldo, M. F.; Casavant, T. L.; Scheetz, T. E.; Brownstein, M. J.; Usdin, T. B.; Toshiyuki, S.; Carninci, P.; Prange, C.; Raha, S. S.; Loquellano, N. A.; Peters, G. J.; Abramson, R. D.; Mullahy, S. J.; Bosak, S. A.; McEwan, P. J.; McKernan, K. J.; Malek, J. A.; Gunaratne, P. H.; Richards, S.; Worley, K. C.; Hale, S.; Garcia, A. M.; Gay, L. J.; Hulyk, S. W.; Villalon, D. K.; Muzny, D. M.; Sodergren, E. J.; Lu, X. H.; Gibbs, R. A.; Fahey, J.; Helton, E.; Ketteman, M.; Madan, A.; Rodrigues, S.; Sanchez, A.; Whiting, M.; Madan, A.; Young, A. C.; Shevchenko, Y.; Bouffard, G. G.; Blakesley, R. W.; Touchman, J. W.; Green, E. D.; Dickson, M. C.; Rodriguez, A. C.; Grimwood, J.; Schmutz, J.; Myers, R. M.; Butterfield, Y. S. N.; Kryzywinski, M. I.; Skalska, U.; Smailus, D. E.; Schnerch, A.; Schein, J. E.; Jones, S. J. M.; Marra, M. A.; Pro, M. G. C. M., Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences. Proc Natl Acad Sci U S A 28

ACS Paragon Plus Environment

Page 28 of 39

Page 29 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

2002, 99, (26), 16899-16903. 37. Ziegert, C.; Wentzensen, N.; Vinokurova, S.; Kisseljov, F.; Einenkel, J.; Hoeckel, M.; von Knebel Doeberitz, M., A comprehensive analysis of HPV integration loci in anogenital lesions combining transcript and genome-based amplification techniques. Oncogene 2003, 22, (25), 3977-84. 38. Caceres, M.; National Institutes of Health Intramural Sequencing Center Comparative Sequencing, P.; Sullivan, R. T.; Thomas, J. W., A recurrent inversion on the eutherian X chromosome. Proc Natl Acad Sci U S A 2007, 104, (47), 18571-6. 39. Sherry, S. T.; Ward, M. H.; Kholodov, M.; Baker, J.; Phan, L.; Smigielski, E. M.; Sirotkin, K., dbSNP: the NCBI database of genetic variation. Nucleic Acids Research 2001, 29, (1), 308-311. 40. Jensen, L. J.; Kuhn, M.; Stark, M.; Chaffron, S.; Creevey, C.; Muller, J.; Doerks, T.; Julien, P.; Roth, A.; Simonovic, M.; Bork, P.; von Mering, C., STRING 8-a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Research 2009, 37, D412-D416. 41. Eden, E.; Navon, R.; Steinfeld, I.; Lipson, D.; Yakhini, Z., GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. Bmc Bioinformatics 2009, 10. 42. Elias, J. E.; Gygi, S. P., Target-decoy search strategy for mass spectrometry-based proteomics. Methods Mol Biol 2010, 604, 55-71. 43. Wang, L. H.; Li, D. Q.; Fu, Y.; Wang, H. P.; Zhang, J. F.; Yuan, Z. F.; Sun, R. X.; Zeng, R.; He, S. M.; Gao, W., pFind 2.0: a software package for peptide and protein 29

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

identification via tandem mass spectrometry. Rapid Commun Mass Spectrom 2007, 21, (18), 2985-91. 44. Sun, H.; Xing, X.; Li, J.; Zhou, F.; Chen, Y.; He, Y.; Li, W.; Wei, G.; Chang, X.; Jia, J.; Li, Y.; Xie, L., Identification of gene fusions from human lung cancer mass spectrometry data. BMC Genomics 2013, 14 Suppl 8, S5. 45. Sun, H.; Chen, C.; Shi, M.; Wang, D.; Liu, M.; Li, D.; Yang, P.; Li, Y.; Xie, L., Integration of mass spectrometry and RNA-Seq data to confirm human ab initio predicted genes and lncRNAs. Proteomics 2014, 14, (23-24), 2760-8. 46. Sung, W. K.; Zheng, H. C.; Li, S. Y.; Chen, R. H.; Liu, X.; Li, Y. R.; Lee, N. P.; Lee, W. H.; Ariyaratne, P. N.; Tennakoon, C.; Mulawadi, F. H.; Wong, K. F.; Liu, A. M.; Poon, R. T.; Fan, S. T.; Chan, K. L.; Gong, Z. L.; Hu, Y. J.; Lin, Z.; Wang, G.; Zhang, Q. H.; Barber, T. D.; Chou, W. C.; Aggarwal, A.; Hao, K.; Zhou, W.; Zhang, C. S.; Hardwick, J.; Buser, C.; Xu, J. C.; Kan, Z. Y.; Dai, H. Y.; Mao, M.; Reinhard, C.; Wang, J.; Luk, J. M., Genome-wide survey of recurrent HBV integration in hepatocellular carcinoma. Nature Genetics 2012, 44, (7), 765-U188.

Figure Legends Figure 1

The pipeline to identify SNV/INDEL, Structural Variation (SV) and

Virus Infection/Integration from RNA-Seq in HeLa cell line. SNV/INDEL were called from the reads which could be mapped to the human reference genome; SV were called from the reads which could be aligned human reference genome with different positions or strands; and virus infection/integration were called from the 30

ACS Paragon Plus Environment

Page 30 of 39

Page 31 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

reads which could either be aligned to virus genome uniquely or be aligned to human reference and virus genome simultaneously. Figure 2 The distribution of the reads among the eight genes of HPV-18. The number of reads represents the expression of the genes among HPV-18. The sequences of E4, E5 and L2 may have been lost during the infection or integration of HPV-18, so their expression couldn’t be observed. The lower expression of E2 may be helpful to the deregulation of p53 and pRb. Figure 3 The seven integration events of HPV-18 into the upstream of MYC. The upper and down triangles indicate the breakpoints in human and virus genome respectively. The pair of two breakpoints is connected by a line. All the seven breakpoints in the human genome located in the upstream of MYC. Among them, one (chr8:128230632) is the same with that reported by Adey et al. and another one (chr8:128241377) is consistent with the chimeric clone named BC106081 (colored as red). Figure 4 Two inversion events occurred in 3’ UTR of MYC. The first inversion was supported by 5 unique reads, with the head and tail part being aligned to the forward and reverse strand respectively. The second inversion was supported by 3 unique reads, and the head and tail part could also be aligned to different strands of the reference genome. Figure 5 The evidence from the RNA-Seq, MS/MS and MRM data for the deletion event of FAM120A. Around the alternative splicing point, 3bp (GCG) was deleted and gave rise to the peptide, SQGVQPIPSQGGK, which had 1AA acid (G) 31

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

lost comparing with the normal protein peptide, SQGGVQPIPSQGGK. Supporting information, this material is available free of charge via http://pubs.acs.org. Supplementary Figure 1 The number of genes involved in different types of structural variations. There were 1174 genes in total affected by at least one type of structural variations. Among them, 787 were with Deletion, 54 with Duplication, 437 with Inversion and 29 with Translocation. Apart from the Deletion events which could contain some wrongly classified novel alternative splicing, it was surprising that there were so many Inversion events, and it might reflect one of the characters of HeLa genome. Supplementary Figure 2 The number of structural variation events in genes. While most of these genes had only one structural variation; some genes had many. The most striking case was the chromosome X gene FLNA which had 60 structural variations. Supplementary Figure 3 The distribution of structural variations occurred in FLNA. Although it was unbelievable that almost every exon of this gene was detected to have some structural variations, the sequencing data seems not from the technical artificial after manual checking. The recurrence of Inversion events in FLNA was suggested to be the result of non-allelic homologous recombination (NAHR) between near-identical inverted duplications flanking the region. Supplementary Figure 4 The variations occurred in the functional partners of MYC. Among the top 10 functional partners of MYC supplied by STRING network, 32

ACS Paragon Plus Environment

Page 32 of 39

Page 33 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

SP1 was affected by Deletion, EP300 by Duplication and UBC had a non-synonymous SNV (color as red). Supplementary Figure 5 The number of reads for the validated peptides. 15 peptides which could be validated by MRM have been supported by only one read. The largest number of read is the peptide which could be supported by 137 reads. Supplementary Figure 6 The depth of RNA-Seq among each chromosome. The mean number of reads for the covering sites is about 18.

Supplementary Table 1 The information of those reads which supported virus integration. Supplementary Table 2 The information of those reads which supported structural variation. Supplementary Table 3 The information of those reads which supported SNV/INDEL. Supplementary Table 4 The mutated peptides validated by MRM experiment and the parameters for the experiment. Supplementary Table 5 The mutated peptides identified by MaxQuant and X!Tandem.

33

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

Page 34 of 39

HeLa RNA-Seq Reads quality control map to hg19

un-Mapped Reads

Mapped Reads

blast to hg19

SNV

INDEL

...

...

Synonymous Non-Synonymous

Non Frame Shift

Blasted Reads filter

Filtered Reads

un-Blasted Reads blast to hg19 and virus genomes

Blasted Reads filter

Stop-Gain Stop-Loss Splicing

...

Frame Shift

Splicing

Non-Splicing

...

Filtered Reads

SV Deletion Duplication Inversion Translocation ACS Paragon Plus Environment

Figure 1

Virus Infection

Virus Integration

Page 35 of 39

Identity=100%, Length=76 Identity=90%, Length=60

14

Number of Reads (log2 value)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

Journal of Proteome Research

12

10

8

6

4

2

0

E1

E2

E4

E5 E6 Gene of HPV-18

ACS Paragon Plus Environment

Figure 2

E7

L1

L2

Journal of Proteome Research

Page 36 of 39

chr8:128,000,000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

chr8:128,800,000

BC106081 MYC

PCAT1 CCAT1 hg19 128241548 128241377 128241370 128235913 128231213 128231055 128230632

23

5736

2497

930 929 929 929

HPV-18 E6

E1

E7 1

E4

E5

E2

L1

L2

ACS Paragon Plus Environment

Figure 3

7857

Page 37 of 39

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

Scale chr8:

Journal of Proteome Research 128,749,000

128,749,500

2 kb 128,750,000

128,750,500

128,751,000

128,751,500

128,752,000

hg19

128,752,500

128,753,000

128,753,500

Inversion-2 Inversion-1 MYC MYC MYC

HV975509

Inversion-1: GTCCAACTTGACCCTCTTGGCAGCAGGATAGTGTCAGAGTCCTGAGACAGATCAGCAACAACCGAAAATGCACCAG GTCCAACTTGACCCTCTTGGCAGCAGGATAGTGTCATAGTCCTGAGCCAGATCAGCAACAACCGAAAATGCACCAG GTCCAACTTGACCCTCTTGGCAGCAGGATAGTGTCAGAGTCCTGAGACAGATCAGCAACAACCGAAAATGCA ACAG CACTGTCCAACTTGACCCTCTTGGCAGCAGGATAGTGTCAGAGTCCTGAGACAGATCAGCAACAACCGAAAATGCA TGACACTGTCCAACTTGACCCTCTTGGCAGCAGGATAGTGTCAGAGTCCTGAGACAGATCAGCAACAACCGAAAAT Inversion-2: GTGGAGACGTGGCACCTCTTGAGGACCAGTGGACTCTGAGGAGGAACAAGAAGATGAGG TAGAAATCGATGTTGTT GTGGAGACGTGGCACCTCTTGAGGACCAGTGGACTCTGAGGAGGAACAAGAAGATGAGGAAGAAATCGATGTTGTT GGTGGAGACGTGGCACCTCTTGAGGACCAGTGGACTCTGAGGAGGAACAAGAAGATGAGGAAGAAATCGATGTTGT

ACS Paragon Plus Environment

Figure 4

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

Page 38 of 39

Reference intron: 2390 bp Genome GGCAGTGACAGCAGCAGGACTAGCAAGTCCCAGGGCG......................... GAGTCCAACCTATACCTTCTCAGGGAGGCAAACTAGAAATAG RefGene mRNA GGCAGTGACAGCAGCAGGACTAGCAAGTCCCAGGGCGGAGTCCAACCTATACCTTCTCAGGGAGGCAAACTAGAAATAG Read1

GGCAGTGACAGCAGCAGGACTAGCAAGTCCCAGG

Read2

GCAGTGACAGCAGCAGGACTAGCAAGTCCCAGG

Read3

CAGTGACAGCAGCAGGACTAGCAAGTCCCAGG

A)

Peptide

GAGTCCAACCTATACCCTCTCAGGGAGGCAAACTAGAAATAG GAGTCCAACCTATACCTTCTCAGGGAGGCAAACTAGAAATAGC GAGTCCAACCTATACCTTCTCAGGGAGGCAAACTAGAAATAGCT

RefGene

SQGGVQPIPSQGGK

INDEL

SQG VQPIPSQGGK

B)

ACS Paragon Plus Environment

Figure 5

SQGVQPIPSQGGK

C)

Page 39 of 39

Database Identified Peptides SNV/INDEL ExonA

Mutated Peptides

ExonB

ExonC Gene

Deletion

read

filter

hg19

Duplication

read hg19

Inversion +

six frame translation

read

-

Reversal of all the sequences (for FDR controlling)

read

MaxQuant

MRM Validated Peptides

SNV/INDEL X!Tandem

hg19

Structural Variation

Translocation

chr

chr

read hg19

filter

Ensembl Known Protein + Ensembl Predicted Proteins (for competiton)

Virus Integration HPV-18

Virus Infection read

Relative Intensity

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

Journal of Proteome Research

Virus Infection/Integration

m/z

MS/MS

Abstract Figure ACS Paragon Plus Environment

Function Inference