Proteogenomics of malignant melanoma cell lines - ACS Publications

Mass spectrometry data analysis. Fragmentation spectra were extracted from raw files using in-house developed software. Raw2MGF (200 most intense peak...
0 downloads 10 Views 3MB Size
Subscriber access provided by UNIV OF DURHAM

Proteogenomics of malignant melanoma cell lines: the effect of stringency of exome data filtering on variant peptide identification in shotgun proteomics Anna A. Lobas, Mikhail A. Pyatnitskiy, Alexey L Chernobrovkin, Irina Y. Ilina, Dmitry S Karpov, Elizaveta M. Solovyeva, Ksenia G. Kuznetsova, Mark V. Ivanov, Elena Y. Lyssuk, Anna A. Kliuchnikova, Olga E. Voronko, Sergey S. Larin, Roman A. Zubarev, Mikhail V Gorshkov, and Sergei A Moshkovskii J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.7b00841 • Publication Date (Web): 05 Apr 2018 Downloaded from http://pubs.acs.org on April 6, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Proteogenomics of malignant melanoma cell lines: the effect of stringency of exome data filtering on variant peptide identification in shotgun proteomics Anna A. Lobas1,2*, Mikhail A. Pyatnitskiy3,4, Alexey L. Chernobrovkin5, Irina Y. Ilina3, Dmitry S. Karpov3,6, Elizaveta M. Solovyeva2, Ksenia G. Kuznetsova3, Mark V. Ivanov2, Elena Y. Lyssuk7, Anna A. Kliuchnikova3,8, Olga E. Voronko3, Sergey S. Larin7, Roman A. Zubarev5, Mikhail V. Gorshkov2, Sergei A. Moshkovskii3,8* 1

Moscow Institute of Physics and Technology (State University), Dolgoprudny 141700, Moscow region,

Russia; 2V.L.

Talrose Institute for Energy Problems of Chemical Physics, Russian Academy of Sciences, Moscow

119334, Russia;

3Institute 4Higher

of Biomedical Chemistry, Moscow 119121, Russia;

School of Economics, Moscow 101000, Russia;

5Karolinska

Institutet, Stockholm 17177, Sweden;

6Engelhardt

Institute of Molecular Biology, Russian Academy of Sciences, Moscow 119991, Russia;

7Institute 8Pirogov

of Gene Biology, Russian Academy of Sciences, Moscow 119334, Russia;

Russian National Research Medical University, Moscow 117997, Russia

*correspondence to: Sergei A. Moshkovskii: [email protected], tel. +7 903 1018654 Anna A. Lobas: [email protected], tel. +7 499 1378257

Abstract Identification of genetically encoded variants at the proteome level is an important problem in cancer proteogenomics. Generation of customized protein databases from DNA or RNA sequencing data is a crucial stage of the identification workflow. Genomic data filtering applied at this stage may significantly modify variant search results, yet its effect is 1 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

generally left out of the scope of proteogenomic studies. In this work, we focused on this impact using data of exome sequencing and LC-MS/MS analyses of six replicates for eight melanoma cell lines processed by a proteogenomics workflow. The main objectives were identifying variant peptides and revealing the role of the genomic data filtering in the variant identification. A series of six confidence thresholds for single nucleotide polymorphisms (SNPs) and indels from the exome data were applied to generate customized sequence databases of different stringency. In the searches against unfiltered databases, between 100 and 160 variant peptides were identified for each of the cell lines using X!Tandem and MS-GF+ search engines. The recovery rate for variant peptides was around 1%, which is approximately three times lower than that of the wild-type peptides. Using unfiltered genomic databases for variant searches resulted in higher sensitivity and selectivity of the proteogenomic workflow, and positively affected the ability to distinguish the cell lines based on variant peptide signatures.

Keywords: proteogenomics, melanoma, cell line, cancer genome, next-generation sequencing, shotgun proteomics, data integration, missense mutation

2 ACS Paragon Plus Environment

Page 2 of 32

Page 3 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Introduction Integrated “omics” approaches are gaining interest in recent years due to the promise of a more precise characterization of molecular composition of biological samples, particularly, when genetic variation may be a key factor behind the pathological development of the cell, as it is in cancer.1,2 Proteogenomics is one of such integrated approaches which combines proteomics and genomics analyses, therefore, allowing, for example, re-annotation of genomes based on proteomics data or the identification of genomic variants at the proteome level.3–5 Currently, it is considered a promising strategy in cancer research with important basic science and clinical applications.6,7 Custom genomic variant data improves the overall coverage of the proteome and gives the opportunity to find the so-called missing proteins.8 It also facilitates the identification of important mutations that drive cancer progression2, and further helps to determine potential neo-antigens for cancer vaccine development.9 Results of variant peptide searches obviously depend on the sequence database obtained from the genomics data. The customized sequence databases either include information from public sources (e.g. COSMIC10, dbSNP11, UniProt12), or can be created from sample-specific next-generation sequencing (NGS) analysis of the studied samples.1,2,13 A number of bioinformatic tools have been developed for generating customized protein databases using genomic14 and/or transcriptomic15,16 data. One of the challenges in the NGS data analysis is the identification of true variants while removing the false positives originating from sequencing or data alignment errors. Confidence in reported genomics variants is of special importance for cancer genomes analysis, since they typically carry a lot of mutations and structural variations. Another challenge in cancer samples study is the intratumoral heterogeneity.17 An important step in the genomics data processing pipeline is the filtering procedure of the called genetic variants, which is required for generating sample-specific sequence databases. A number of filtering strategies employing machine learning were developed, for instance, based on Genome Analysis Toolkit (GATK) variant quality score recalibration (VQSR) algorithm.18,19 It is worth noting, however, that such algorithms are generally not applicable for the cancer 3 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

data , since they require prior knowledge of particular likely and unlikely mutations. Upon that, certain mutations may be present in cancer cells that are unlikely to appear in normal tissues, which makes it difficult or even impossible to calibrate the machine learning algorithm behind VQSR. In the context of genomic sequencing this issue is usually addressed by hard filtering,20 which implies setting fixed thresholds for each of the SNP/indel parameters reported by the calling software. For the researchers analyzing mass spectrometric data the problem of database generation and underlying filtering procedures often stay out of focus. It is quite typical that onco-proteogenomics rely on protein databases derived from publicly available genomics or transcriptomics data. In these cases, hard filtering has often been applied for the specific purposes of the original study; for instance, for the NCI-60 panel proteogenomic analyses13,21 based on the corresponding genomic study22 or the analyses of the HEK293 cell line.23,24 While some of the proteogenomic studies were based on hard-filtered DNA or RNA sequencing data for database generation,1 others suggest that such filtering is unnecessary since the LC-MS/MS data can itself serve as a filter.5,7,25 The existing uncertainty in this issue deserves more attention and justifies performing detailed case study. Indeed, the rate of mutations in cancer cells is elevated but still quite low – up to 100 per 1 million base pairs (for melanoma and lung cancer),26 which makes each mutation confirmed at the protein level precious to the researchers, as it may represent a novel drug target. On the other hand, falsely confirmed mutations may result in wasted resources used for their targeting. Therefore, minimizing both false positives and false negatives in cancer proteogenomics is an important goal. In this work, we performed proteogenomic analysis for eight established cell lines derived from malignant skin melanoma by acquiring and processing data on their exomes and deep proteomes. Main objectives of this research were to identify protein variants originating from coding mutations and assess the effect of stringency of exome data filtering on the efficiency of variant peptide identification.

4 ACS Paragon Plus Environment

Page 4 of 32

Page 5 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Experimental section 1. Cell lines Eight cell lines of human malignant melanoma were analyzed. All of them were derived in 2005-2008 from excised tissues of IV stage metastatic malignant skin melanomas, as described elsewhere,27,28 and stored frozen in liquid nitrogen in the biobank of the Institute of Gene Biology, Russian Academy of Sciences. All cell lines were defrosted and cultured. They experienced more than 80 passages by the time of this study. Cell lines KOR, P, G, KIS, SI, and ME were cultured in the RPMI-1640 medium supplemented by 10% (v/v) fetal calf serum, 2 mM L-glutamine, 100 U/mL penicillin and 100 mg/mL streptomycin. Cell lines 82 and 335 were cultured in the DME/F-12 medium supplemented by 10% (v/v) fetal calf serum, 2 mM L-glutamine, 100 U/mL penicillin, 100 mg/mL streptomycin and 15 mM HEPES. Cells were incubated at 37°C and 5% (v/v) CO2 refreshing the media every 3 days. All cell lines in the set were adherent, except the G cell line. Those adherent cell lines were subcultured upon reaching 70-90% confluence. To this end, the medium was withdrawn from Petri dishes with cells. The dishes were washed with warm Dulbecco’s phosphatebuffered saline (PBS) with depleted Ca2+ and Mg2+, then, 0.05% (w/v) trypsin solution containing 0.2 g/L EDTA was added. Dishes were incubated at 37°С for 5-10 minutes until cells were detached from the plastic. Equal volumes of fresh media were then added to the dishes, media were resuspended, and cells were planted out in 1:3 to 1:5 ratio. All reagents for cell culture were purchased from GE Healthcare Life Sciences (HyClone brand, Marlborough, MA). A semi-suspension culture of G cell line was subcultured upon reaching 70-90% confluence. For this culture, conditioned media withdrawn from dishes were not discarded and used in the following procedure. Cells were detached from plastic as above and those cell pellets were joined with cell suspension, the whole cell fraction was then precipitated by centrifugation for 5 minutes at 200 g and room temperature, and the cells were then seeded as above.

5 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Before the proteome analysis, cells were detached without trypsin by 0.2 g/L EDTA solution. For storage before analysis, detached cells were precipitated by centrifugation, washed three times by PBS and stored at -80°C. For the proteome analysis, six independent batches of each cell line were grown, each containing at least 3 million cells. Three of them were subjected to heat shock for the purposes of another study (42°C for 1 h, then back to 37°C for 2 h before collection). In this study, we pooled data of both heat-shocked and intact cells within each cell line to find genomic variants at the protein level. 2. Exome sequencing Exome sequencing of eight cell lines of human malignant melanoma was performed by Genotek (Moscow, Russia). Library preparation was performed using a combination of Illumina's TruSeq DNA PCR-Free Sample Preparation Kit and Agilent SureSelect Human All Exon Enrichment Kit V4 according to manufacturer’s protocols. Captured library was sequenced using Illumina HiSeq2000. The variant calling pipeline was performed according to GATK best practices.18 Briefly, the sequencing reads were mapped to the reference human genome sequence (hg38) using Burrows-Wheeler Aligner29 (BWA) version 0.7.12 followed by marking of duplicates using Picard (http://broadinstitute.github.io/picard/, version 1.133), indel realignment (GATK RealignerTargetCreator), base quality score recalibration (GATK BaseRecalibrator), and variant calling (GATK HaplotypeCaller). We used GATK version 3.5.0.18 3. Customized database generation We utilized six different schemes of GATK parameter settings for removing low-confidence sequence variants: ● “raw” database implying no variant filtering was obtained as an output of GATK HaplotypeCaller. ● “lev0” database was generated using default thresholds according to the recommendations from Broad Institute for SNPs ( QD ≥ 2.0, MQ ≥ 40.0, FS ≤ 60.0, SOR ≤ 4.0, ReadPosRankSum ≥ -8.0, MQRankSum ≥ -12.5) and for indels (QD ≥ 2.0, FS ≤ 200.0, SOR ≤ 10.0, ReadPosRankSum ≥ -20.0) 6 ACS Paragon Plus Environment

Page 6 of 32

Page 7 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

● “lev1” to ”lev4” databases were generated by setting thresholds for the most influential parameters QD, MQ and FS to 20%, 40%, 60% and 80%-quantiles of the corresponding distributions. For other parameters we used fixed thresholds: for SNPs, SOR = 3.5, 3, 2.5, 2; MQRankSum = -10, -8, -6, -4; for indels, SOR = 6, 5, 4, 3; ReadPosRankSum = -6, -5, -4, -3, for levels 1 to 4, respectively. Overall “lev4” approach corresponded to the most rigorous filtering which resulted in low number of variants, yet, presumably, the lowest level of false positives. Respectively, the “raw” database retained most variants with presumably highest level of false positives and lowest level of false negatives. 4. Proteomics sample preparation Six sample replicates were prepared for each cell line under study. The melanoma samples containing 1 million cells frozen in 100-800 µl PBS were thawed and centrifuged at 15,700 g for 10 min at 20°C (Centrifuge 5415R, Eppendorf, Hamburg, Germany). The supernatant (a protein fraction released from frozen and thawed cells in PBS) was collected for concentration into a new tube; the cell pellet was resuspended in lysis solution containing 0.1 % (w/v) Protease MAX Surfactant (Promega, Madison, WI), 50 mM ammonium bicarbonate and 10% (v/v) acetonitrile (ACN) and was shaken for 50 min at 550 rpm at room temperature. The supernatant was added to a filter unit (30 kDa, Millipore, Burlington, MA) and centrifuged at 15,700 g for 2 min at 20°C. After that, 100 µL of 50 mM ammonium bicarbonate were added to the filter unit and the sample was centrifuged at 15,700 g for 2 min at 20°C. This step was repeated 3 times to remove PBS. Then the filter unit was turned upside down to collect the proteins retained on the filter. The reverted filter unit was centrifuged at 1,000 g for 1 min at 20°C (recovery spin) to obtain about 50 µL of solution. Protease MAX was added to 0.1 % (w/v) and the received mixture was combined with cell pellet lysate also containing Protease MAX. The resulting sample was shaken for 10 min at 550 rpm at room temperature. The mixture was then subjected to sonication by Bandelin Sonopuls HD2070 (Bandelin electronic, Berlin, Germany) ultrasonic homogenizer (30% amplitude, short pulses for 5 7 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

min). The supernatant was collected after centrifugation at 15,700 g for 10 min at 20°C (Centrifuge 5415R, Eppendorf). Total protein concentration was measured using a bicinchoninic acid assay (BCA Kit, Sigma-Aldrich, USA). 500 mM dithiothreitol (DTT) in 50 mM ammonium bicarbonate buffer was added to the samples up to the final DTT concentration of 10 mM, followed by incubation for 20 min at 56°C. After that, 500 mM iodoacetamide (IAA) in 50 mM ammonium bicarbonate was added to the sample up to the final IAA concentration of 10 mM. The mixture was incubated in the darkness at the room temperature for 30 min. The protein samples were digested with trypsin (Trypsin Gold, Promega, USA). The enzyme was added at the ratio of 1:40 (w/w) to the total protein content and the mixture was incubated overnight at 37°C. Enzymatic digestion was terminated by addition of acetic acid (up to 5% w/v). After the reaction was stopped, the samples were shaken (500 rpm) for 30 min at 45°C followed by centrifugation at 15,700 g for 10 min at 20°C (Centrifuge 5415R, Eppendorf). The supernatant was then added to the filter unit (30 kDa, Millipore) and centrifuged at 13,200 g for 15 min at 20°C in the same centrifuge. After that, 100 µL of 50% formic acid were added to the filter unit and the samples were centrifuged at 13,200 g for 20 min at 20°C. The total received volume of the sample was divided into two aliquots containing 50 µg of peptides. Samples were dried up using a vacuum concentrator (Eppendorf, Hamburg, Germany) overnight (8 hours) at room temperature. Dried peptides were stored at -80°C until the LC-MS/MS analysis. 5. LC-MS/MS analysis The following buffers were used for chromatographic separation: buffer A - 2% (v/v) acetonitrile (ACN) in 0.1% (v/v) formic acid (FA), buffer B - 98% (v/v) ACN in 0.1% (v/v) FA. Peptide samples were redissolved in buffer A to the final concentration of 0.5 µg/µl. 2 µg peptides were loaded onto the pre-column (Acclaim Pepmap 100, 75 µm x 2 cm, C18, 3um, 100Å). Chromatographic separation of peptides was achieved using homemade C18 column, 25 cm (Silica Tip 360µm OD, 75µm ID, New Objective, Woburn, MA, USA) connected to the Ultimate 3000 RSLCnano chromatographic system (Thermo Scientific, Waltham, MA). Peptides were eluted at 300 nL/min flow rate using a linear gradient from 8 ACS Paragon Plus Environment

Page 8 of 32

Page 9 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

2% to 26% B over 120 min. Orbitrap Q Exactive Plus mass spectrometer (Thermo Fisher Scientific Waltham, MA) was used for the analysis of the eluted peptides that were ionized with electrospray ionization. The survey MS spectrum was acquired at the resolution of 140,000 in the range of m/z 300-1650. For each survey MS scan, MS/MS data acquired at a resolution of 17,500 were obtained using higher-energy collisional dissociation (HCD) in the range of m/z 200-2000 for 16 most intense precursor ions with charge state z>1. Precursor threshold intensity of 1,700, isolation window of 4 m/z with dynamic exclusion period of 90 seconds were used for data-dependent acquisition. The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE30 partner repository with the dataset identifier PXD007662. 6. Mass spectrometry data analysis Fragmentation spectra were extracted from raw files using in-house developed software Raw2MGF (200 most intense peaks were selected per MS/MS spectrum). The customized protein databases were merged with human SwissProt database (20,196 entries), and decoys were added by reversing the sequences. The searches were performed against particular cell line sequence databases and the databases of other cell lines ("all vs. all") with different confidence thresholds (raw, lev0-lev4). Data were processed using X!Tandem31 and MS-GF+32 search engines separately. The following search parameters were used: trypsin cleavage rule with maximum of 2 missed cleavage sites, precursor mass error of 10 ppm, fragment mass error was set at 0.02 Da for X!Tandem, and the “Q Exactive” detector type was selected for MS-GF+; cysteine carbamidomethylation was selected as fixed modification, and methionine oxidation as well as N-terminal acetylation as variable ones. For open search, X!Tandem was used with the same settings except precursor mass tolerance which was set at 500 Da and no variable modifications were considered. The following procedures were applied for post-search filtering. When analyzing the overall quality of the data, all identifications including wild-type ones were processed using MP score software33 for protein inference and filtering to 1% false discovery rate (FDR) at protein level based on target-decoy approach34, no additional filters were applied. For 9 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

variant peptide identifications, only peptide-spectrum matches (PSMs) corresponding to variant target and variant decoy sequences were considered and filtered to 1% FDR at PSM level using in-house scripts based on pyteomics library.35 Similar procedure was applied for open search results, the PSMs including decoys were grouped by mass shifts with the accuracy of 0.01 Da and filtered separately within each group. 7. Genotyping to confirm selected SNPs DNA was extracted from cells using the standard phenol-chloroform method. Polymorphic sites of 18 human genes (G211S in IBA57, G108A in SDSL, and E303Q in ACOT9 in cell line P; V1046I in IARS, I298V in CCAR2, L573F in SLC5A6, and R542Q in EZR in cell line SI; G377D in CD109, M645I in TJP2, G190E in ESD, V717I in CDCA2, H759D in LRRFIP1, I251V in ABHD10, I194M in CYP2C8, K439Q in ALDH7A1, I234T in PSMB4, G216E in CARS2 and 178_180del in MPRIP in cell line 82) were genotyped by Sanger sequencing using Applied Biosystems 3500xL genetic analyzer and SeqScape software (ThermoFisher Scientific, Waltham, MA). Initial polymerase chain reactions (PCRs) were performed in a 25 μL volume containing 50 ng genomic DNA template, 10x PCR buffer, 0.5 U of HS Taq DNA Polymerase, 0.2 mM dNTPs (all from Evrogen, Moscow, Russia), and 20 pmol of each primer. PCR cycling conditions were the same for all SNPs and were as follows: 95°C for 5 minutes, followed by 35 cycles of 94°C for 15 seconds, 58°C for 20 seconds, 72°C for 20 seconds and final elongation at 72°C for 6 minutes. Primers were designed using PerlPrimer free software (http://perlprimer.sourceforge.net/) and are shown in the Supplementary Table S1, columns “primer forward” and “primer reverse”. The same primers were used for sequencing reaction. PCR products were then cleaned up by incubation with the mix of 1 U of ExoI and 1 U of SAP (both enzymes from Thermo Fisher Scientific, Waltham, MA) at 37°C for 30 minutes and at 80°C for 15 minutes. The sequencing reactions followed by EDTA/ethanol purification were carried out using BigDye® Terminator v3.1 Cycle Sequencing Kit (Thermo Fisher Scientific, Waltham, MA) according to manufacturer’s instructions.

10 ACS Paragon Plus Environment

Page 10 of 32

Page 11 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Results and discussion 1. Exome sequencing and variant calling Full exome sequencing was performed for eight cell lines of human malignant melanoma and generated an average 104 million 100-bp paired-end reads. After mapping to human reference genome using the BWA tool, median fraction of target covered with at least 20 reads was 92.6% (Supplementary Figure S1). We utilized GATK toolbox, widely used software for NGS data analysis, following GATK’s Best Practices (GBP) to obtain a set of genomic variants. As suggested by GBP the variants were hard filtered,20 since self-adjusting technique (Variant Quality Score Recalibration) should only be used for large datasets including over 30 exomes. We used six schemes for the post-processing of the discovered variants with different filter thresholds varying from the most sensitive “raw”-level, which includes all potential variants with a high predicted fraction of false positives, down to the strictest “lev4” scheme, which minimizes this fraction (see Methods section). Obtained sets of variants were further used to generate customized protein sequence databases for each cell line using in-house R scripts. 2. Variant peptides identified using databases with different filtering Shotgun proteomic analyses were performed for six sample replicates of eight malignant melanoma cell lines. The results of the searches against wild-type human SwissProt database using X!Tandem and post-search filtering with MP score to 1% FDR at protein level yielded 10,000 to 20,000 peptide identifications corresponding to 2,500 of 4,000 identified protein groups per replicate. The results of these searches are summarized in the Supplementary Information section, Table S2 and Figures S2, S3. The variant peptide identification was performed using two search engines, X!Tandem and MS-GF+. The LC-MS/MS data for each cell line were searched against customized databases of all cell lines (“all vs. all” searches). The overall workflow including DNA and protein analysis as well as the proteomic data processing steps is shown in Figure 1. 11 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 1. Experimental and data processing workflow used in this study: eight melanoma cell lines were subject to exome sequencing and shotgun LC-MS/MS analysis after trypsin digestion. Customized databases of six confidence levels (raw databases without SNP/indel filtering, conventional hard filtering to lev0, and more strict filtering to obtain lev1-lev4 databases) were generated from exome sequencing data and used for “all vs. all” searches utilizing two search engines, X!Tandem and MS-GF+. Variant PSMs were filtered separately using target-decoy approach.

We analyzed the variant peptide identifications for a single cell line using customized databases with different confidence thresholds: raw level without filtering, lev0 after applying recommended hard filtering, and lev1-lev4 corresponding to more strict thresholds. The number of variant peptides in a database was used as a metric of its size. These peptides were generated in silico by applying trypsin specificity with up to two missed cleavages to both target and decoy sequences from the database. A peptide was considered a variant if it was only present in mutant or decoy mutant sequence. In Figure 2a the results are demonstrated for the cell line 82 as an example. Variant identifications were counted for all six sample replicates combined. One of the widely-accepted assumptions is that the large databases associated with the low confidence levels for the 12 ACS Paragon Plus Environment

Page 12 of 32

Page 13 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

sequences, such as the “raw” database generated without any filtering, have significant fraction of false positives. These false positives cannot be, consequently, confirmed by the proteome analysis. Moreover, increasing presence of false positives in the database will result in decreased number of identified true variant peptides. However, as shown in Figure 2a, the number of identified variant peptides grows almost linearly with the increase in the database size when we start loosening the SNP/indel filtering parameters. Importantly, most of the variants were identified by both search engines. Similar results were obtained for other cell lines and further shown in Supplementary Figure S4.

Figure 2. Number of variant peptides identified in searches against databases of different filtering levels. (a) The LC-MS/MS data for cell line 82 was searched against its own customized databases of six stringency levels using two search engines, X!Tandem and MS-GF+, the number of identified variant peptides grows with the increase in database size almost linearly. (b) The LCMS/MS data for cell line 82 was also searched against other cell lines’ databases with exclusion of the peptides intersecting with either hard filtered (marked as w/o 82 lev0) or non-filtered (marked as w/o 82 raw) databases of the cell line under study. The number of false matches drops dramatically when excluding the intersection with raw-level database of cell line 82.

We can assume that the observed growth in the number of variant identifications is due to the false matches which become more frequent as the size of the database increases. To investigate this assumption, we studied the effect of the database size on the number of false identifications. In this evaluation, the LC-MS/MS data for the same cell line were searched, firstly, against this cell line’s own databases of different filtering levels and, secondly, against the databases of all other cell lines combined. The variant peptides from other databases, which are also present in the database of cell line 82 (level 0 or raw) were excluded from the evaluation. The results of these searches are shown in Figure 2b. They demonstrate that the exclusion of variants present in lev0 database of cell line 82 from the 13 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

“wrong database search” results (labeled as “w/o 82 lev0”) leaves a substantial number of variant identifications, which are supposed to be false as they are not confirmed by the exome sequencing data for the respective cell line. On the contrary, the exclusion of the variants present in “raw” database of cell line 82 (labeled as “w/o 82 raw” in Figure 2b) resulted in only a few false variant identifications even for variant databases of large sizes. Therefore, this result clearly shows that an increase in database size is not a cause of getting more “false variant” matches, but rather is a consequence of the fact that the list of variant sequences in level 0 database is not complete. Excluding the intersection between other cell lines’ databases and raw-level database of cell line 82 diminishes this “false identification rate”. This further suggests that the use of unfiltered databases is preferable for proteogenomics studies. 3. Recovery rate and effect of replicates Recovery rate was calculated as a ratio between the number of identified peptides and the number of theoretical peptides in the corresponding database. Only peptides without missed cleavages, both among theoretical and identified peptides, were taken into account for this calculation. This limitation is justified because introducing missed cleavages changes the numbers of theoretical wild-type and variant peptides disproportionately, thereby imposing a bias in the difference between the corresponding recovery rates. The calculations were performed for wild-type and variant peptides, which were identified using databases with different filtering criteria (raw, lev0-lev4) for the cell line 82 depending on the number of sample replicates merged (Figure 3). In general, the recovery of wild-type peptides was about three times higher than that of variant ones. This observation is in accordance with the results reported previously for breast cancer2,25 and HEK293 cell line23. It can be explained by accumulation of mutations in genes not expressed in a particular tissue or cell line. The highest variant recovery was observed for the databases with the strictest SNP/indel filtering (lev3 & lev4). This can be due to lower amount of wrong variant sequences in the databases after such a stringent filtering. However, for other databases, including unfiltered ones at the “raw” level, no significant difference in the recovery rates was observed. The reported earlier growth in the variant peptide identification rate with addition of replicates25 has shown the same trend for wild14 ACS Paragon Plus Environment

Page 14 of 32

Page 15 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

type peptides, presumably, because sample replicates and not instrumental ones were used.

Figure 3. Recovery rates (in %) calculated for wild-type peptides and variants identified using databases of different filtering stringency vs. the number of sample replicates merged. The cell line under study is 82, peptide identification was performed using X!Tandem.

4. Open search to reveal the most frequent modifications Single amino acid variants identified in a proteogenomic search against customized database may be false matches due to modifications, which introduce mass shifts similar to the shifts occurring due to amino acid substitutions.13,23 The “open search” strategy employing a wide precursor mass window and small fragment mass tolerance can be used to define the modification pattern in LC-MS/MS datasets.36,37 In this work such search was performed using X!Tandem with precursor mass accuracy of 500 Da in order to find the most frequent mass shifts mimicking single amino acid substitutions. The pattern of obtained mass shifts is shown in Figure 4a. The most frequent mutation-mimicking modification was formylation; the whole list of such mass shifts is provided in Supplementary Information, Table S3. Thus, the identified variant peptides corresponding to the substitutions with the most abundant mass shifts in the pattern have higher probability to be false matches. We excluded the variants with the amino acid substitutions corresponding to the high-ranking mass shifts (having 15 or more PSMs identified in open search, 28 substitutions in total) from the search results for different database filtering 15 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

levels. As shown in Figure 4b, the fraction of excluded variants is similar for all such databases. This suggests that additional identifications from “raw”-level database are not enriched with mutation-mimicking modifications compared to a hard-filtered one (lev0).

Figure 4. (a) Mass shifts obtained in open search performed using X!Tandem with precursor mass tolerance of 500 Da. (b) Number of variant peptides identified in searches against customized databases with different filtering stringency before and after exclusion of the variants with mass shifts gaining 15 or more PSMs in open search (total of 28 substitutions excluded). The fraction of excluded peptides is nearly the same for all database filtering levels, suggesting that the subset of identified variants present only in unfiltered database is not enriched with false matches, occuring due to modifications.

5. Distinguishing the cell lines The results of “all vs. all” searches were used to check the ability to distinguish between the cell lines based on the identified variant peptides. Two database confidence levels were used for this comparison: “lev0” (standard hard filtering) and “raw” (no filtering), since the other filtering levels were too conservative to be considered practical. All variant peptides identified in all-vs-all searches against raw and level 0 databases are listed in the Supplementary Table S4 with corresponding PSM counts for X!Tandem and MS-GF+. For each sample, the number of variant peptides identified in each database was calculated along with the number of unique peptides (after excluding all the variant identifications shared between several cell line databases). Figure 5 presents the results for the intersection of variant peptide identifications obtained using the two search engines. Similar plots for these search engines applied separately are available in the Supplementary Information section, Figure S5.

16 ACS Paragon Plus Environment

Page 16 of 32

Page 17 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

All cell lines were correctly recognized by their variant peptides identified using both search engines for both database filtering levels. However, the recognition was more pronounced when the raw level databases were used rather than the level 0 ones, as shown in Figures 5a and 5c. This observation is confirmed by the calculation of z-scores (Figure 5e) and the corresponding p-values for the outliers. For raw-level identifications the pvalues of the outliers varied from 10-75 to 7·10-11 depending on the cell line, while the range of p-values for variants identified using filtered databases (lev0) was from 8.6·10-13 to 2.9·10-3. Excluding shared peptides and considering only the unique ones further helped improving the cell line distinguishing at both raw level (Figure 5b) and lev0 (Figure 5d, Figure 5f). The most profound recognition was achieved at raw level using the combination of both search engines. Employing both search engines has yielded only few identified peptides unique for the wrong databases, which is similar to the observations shown in Figure 2b. Although most of the additional variants identified at the raw database level compared to the lev0 are non-unique, taking those identifications into consideration resulted in higher selectivity. We also tested the exclusion of top-ranking amino acid substitutions from the variant search output based on open search results. The initial assumption was that these variants can potentially be false matches occurring due to modifications, if the mass shift caused by a substitution corresponds to a frequent mass shift caused by a modification found in the open search. Thus, we excluded 28 substitutions having 15 or more PSMs in the open search (results for X!Tandem search engine are shown in Supplementary Figure S6). These exclusions reduced the amount of unique variant identifications from the wrong cell lines, however, the overall cell line distinguishing pattern, including calculated outlier z-scores, did not change significantly.

17 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 5. Number of variant peptides identified by the combination of X!Tandem and MS-GF+ search engines in each of eight melanoma cell lines against each of the customized databases of (a) raw level (unfiltered), (b) raw level, only unique peptides considered, (c) level 0 (hard filtered), (d) level 0, only unique peptides considered. The z-scores were calculated for the outliers assuming normal distributions for the number of peptides identified in the databases of wrong cell lines. The z-score shows how well the correct cell line can be distinguished from the others based on (e) all variant peptides identified in all-vs-all searches, (f) unique peptides identified in all-vs-all searches. The use of unfiltered (raw) databases improves the distinguishing between the cell lines.

18 ACS Paragon Plus Environment

Page 18 of 32

Page 19 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

6. Sanger sequencing for confirmation of the variants identified at the “raw” level The variant peptides identified in the unfiltered database but not present at level 0 database after filtering were corresponding to SNPs/indels not passing basic quality thresholds in the whole exome sequencing. We suggested that proteomics data can serve itself as an alternative filter for the results of NGS.5,13,25 To test this assumption, we validated the presence of these genetic alterations in the genome and, thus, confirmed that the use of hard filtering of exome sequencing data is indeed not a necessary step in the proteogenomics workflow. We considered four cell lines with exome data of the highest quality and depth, namely P, SI, ME, and 82. For these cell lines in the studied panel we selected variant peptides using following criteria: (1) they were identified in the searches against unfiltered databases; (2) they were not present in databases of level 0; and (3) they were unique for the particular cell line. A list of 19 variant peptides corresponding to 18 polymorphic sites (Supplementary Table S1) was obtained; no peptides matching these criteria were found for cell line ME. For each of these peptides, Sanger sequencing was performed to validate the presence of the corresponding genetic alterations. According to the sequencing data, all polymorphic loci selected for the analysis were confirmed to be in variant homozygous states, except L573F in SLC5A6 and I234T in PSMB4, which were heterozygous. Wild-type peptides corresponding to these loci were not identified in mass spectrometric data. The confirmation of mutations at the selected loci by Sanger sequencing proved that the variant filtering algorithm applied to the NGS data yields a significant fraction of false negatives (FNR), which most likely combine with the initial FNR obtained from variant calling and sequencing procedures, a problem discussed by the community and partially addressed by the use of gold standards.38 The use of unfiltered databases for variant peptide identification allows lowering the FNR, while the FDR is controlled by the proteomics post-search validation.

19 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

7. Identified protein variants relevant to disease Since the analyzed melanoma cell lines were maintained in a biobank for a long period of time, the access to the germline DNA from patients, whom the cells were obtained from, became impossible by the time of this study. Therefore, the sequences from the cell line exomes did not allow us to distinguish somatic mutations from germline variants to further predict their contribution to the pathology. However, big metadata accumulated to date could help suggest which of the identified variants are more likely to be somatic. For example, these can be the sequences, which are not listed in the comprehensive ExAc database of human genomic variation containing data on 60,000+ individuals.39 Moreover, obtained variants may be searched against the COSMIC database of somatic mutations in cancer.10 Notably, some of the variants identified in present work were listed in both above databases. Obviously, these are so-called passenger mutations which can be either germline or somatic without causing a substantial functional impact. In total, variants from 21 genes identified at proteome level were not found in ExAc database (Supplementary Table S5, ExAc column). Most of them are not mentioned elsewhere as cancer-related genes. Only two variants, R282C in the hypoxia up-regulated 1 heat shock protein (HYOU1) and D422N in glycogen phosphorylase B (PYGB) are listed in COSMIC database. However, although both proteins may change their expression levels due to disease, the relevance of their mutations to cancer onset and progression is yet unknown. ABI1 gene product is known to interact with Abelson kinases and is involved in cell invasion in many tumor types.40,41 We identified its P133S variant in KOR cell line, although its functional impact remains unclear. Another cancer-related gene in our list was ANHAK, which codes the desmoyokin protein. Desmoyokin is localized in multiple cell compartments from desmosome to nucleus and was shown to have tumor suppressor features by down-regulation of the TGFβ/Smad pathway.42 The production of desmoyokin was decreased in many cancers including malignant melanoma.43 We found two presumably somatic variants of ANHAK product, S3411L in SI cell line and P5459L in G line, respectively. The CD74 receptor mutated in ME cell line with D19N substitution is known to be involved in antitumor immune response in breast cancer44 and melanoma.45 20 ACS Paragon Plus Environment

Page 20 of 32

Page 21 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

The DDX3X product, which is an RNA helicase, may influence global translation, thereby enhancing cancer cell development.46 In the cancer genome database from the TCGA project, this gene is listed as a pan-cancer gene specifically hypermutated in medulloblastoma.47 We found its variant, K319E, in the ME cell line. Last but not least, a mutation was found in perilipin 3 (PLIN3), a participant of the oncogenic RAS pathway, germline variants of which were recently associated with a risk of other common skin cancer, basal cell carcinoma.48 Among the identified variants, which were likely somatic, the widely-reported cancer genes were not found. It may be explained in part by the way of the cultivation of the cells obtained from the human body. Also, most of these cancer genes are tumor suppressors, which, after being mutated, often become down-regulated49 and may fail to be detected by proteomics technology. We also expected to find mutated RAS family kinases, which are frequently activated oncogenes in malignant melanoma,50 but we failed to observe such cases, presumably, due to the insufficient proteome coverage.

Conclusions We tested the effect of exome data filtering at the stage of customized databases generation on variant peptide identification in cancer proteogenomics. The approaches based on SNPs/indels hard filtering and lack of filtering were evaluated, and the use of unfiltered databases demonstrated higher sensitivity and specificity of searches. It was also shown that the recovery rate is almost the same for variant peptides corresponding to SNPs/indels with different sequencing quality and is three times lower compared with the wild-type peptides, supporting previous reports.2,23 Coding variants identified at proteome level but not passing the hard filtering threshold were genotyped using Sanger sequencing, and the presence of corresponding genetic alterations was confirmed in all cases. These results further support the notion that unfiltered NGS data should be used5,7,25 for database generation in proteogenomic studies, in agreement with a discussed issue of high false negative rate (FNR) in variant calling based on NGS data38 which is largely ignored in favor of the FDR control.

21 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Supporting Information The following files are available free of charge at ACS website http://pubs.acs.org: Supplementary Info.pdf. Contains Figures S1-S6 presenting extended results and Supplementary Tables captions. Table S1. The list of variant peptides selected for genotyping using Sanger sequencing. Table S2. The summary of the searches of 48 LC-MS/MS runs against wild-type human SwissProt database. Table S3. Mass shifts mimicking single amino acid substitutions from open search. Table S4. Full annotated lists of variant peptides identified in “all vs. all” searches against unfiltered (raw) and filtered (lev0) sequence databases. Table S5. Annotated lists of variant peptides identified for each cell line in the searches against its own databases of “raw” and “lev0” filtering levels.

Acknowledgments The work was supported by Russian Science Foundation, grant #17-15-01229 to S.A.M. Proteogenomic data processing tools used in this work were developed with financial support from Russian Foundation for Basic Research, grant #16-54-21006 to M.V.G. Authors thank Prof. Alexey Nesvizhskii (University of Michigan, Ann Arbor, MI) for his valuable comments on the use of open search strategy. Authors also thank Anna Kaznadzey for her kind help with text editing.

22 ACS Paragon Plus Environment

Page 22 of 32

Page 23 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

References (1)

Zhang, B.; Wang, J.; Wang, X.; Zhu, J.; Liu, Q.; Shi, Z.; Chambers, M. C.; Zimmerman, L. J.; Shaddox, K. F.; Kim, S.; et al. Proteogenomic characterization of human colon and rectal cancer. Nature 2014, 513 (7518), 382–387.

(2)

Mertins, P.; Mani, D. R.; Ruggles, K. V.; Gillette, M. A.; Clauser, K. R.; Wang, P.; Wang, X.; Qiao, J. W.; Cao, S.; Petralia, F.; et al. Proteogenomics connects somatic mutations to signalling in breast cancer. Nature 2016, 534 (7605), 55–62.

(3)

Nesvizhskii, A. I. Proteogenomics: concepts, applications and computational strategies. Nat. Methods 2014, 11 (11), 1114–1125.

(4)

Menschaert, G.; Fenyö, D. Proteogenomics from a bioinformatics angle: A growing field. Mass Spectrom. Rev. 2017, 36 (5), 584–599.

(5)

Ruggles, K. V; Fenyö, D. Next generation sequencing data and proteogenomics. In Proteogenomics; Végvári, Á., Ed.; Advances in Experimental Medicine and Biology; Springer International Publishing: Cham, 2016; Vol. 926, pp 11–19.

(6)

Sheynkman, G. M.; Shortreed, M. R.; Cesnik, A. J.; Smith, L. M. Proteogenomics: integrating next-generation sequencing and mass spectrometry to characterize human proteomic variation. Annu. Rev. Anal. Chem. 2016, 9 (1), 521–545.

(7)

Alfaro, J. A.; Sinha, A.; Kislinger, T.; Boutros, P. C. Onco-proteogenomics: cancer proteomics joins forces with genomics. Nat. Methods 2014, 11 (11), 1107–1113.

(8)

El Guoshy, A.; Hirao, Y.; XU, B.; Saito, S.; Quadery, A. F.; Yamamoto, K.; Mitsui, T.; Yamamoto, T. Identification and validation of human missing proteins and peptides in public proteome databases; Data mining strategy. J. Proteome Res. 2017, acs.jproteome.7b00423.

(9)

Bassani-Sternberg, M.; Bräunlein, E.; Klar, R.; Engleitner, T.; Sinitcyn, P.; Audehm, S.; Straub, M.; Weber, J.; Slotta-Huspenina, J.; Specht, K.; et al. Direct identification of clinically relevant neoepitopes presented on native human melanoma tissue by mass spectrometry. Nat. Commun. 2016, 7, 13404.

(10)

Forbes, S. A.; Beare, D.; Boutselakis, H.; Bamford, S.; Bindal, N.; Tate, J.; Cole, C. G.; Ward, S.; Dawson, E.; Ponting, L.; et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res. 2017, 45 (D1), D777–D783.

(11)

Sherry, S. T.; Ward, M. H.; Kholodov, M.; Baker, J.; Phan, L.; Smigielski, E. M.; Sirotkin, K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001, 29 (1), 308–311.

(12)

The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2017, 45 (D1), D158–D169.

(13)

Alfaro, J. A.; Ignatchenko, A.; Ignatchenko, V.; Sinha, A.; Boutros, P. C.; Kislinger, T. Detecting protein variants by mass spectrometry: a comprehensive study in cancer cell-lines. Genome Med. 2017, 9 (1), 62.

(14)

Zhang, J.; Yang, M.; Zeng, H.; Ge, F. GAPP: A proteogenomic software for genome annotation and global profiling of post-translational modifications in prokaryotes. Mol. Cell. Proteomics 2016, 15 (11), 3529–3539.

(15)

Wen, B.; Xu, S.; Zhou, R.; Zhang, B.; Wang, X.; Liu, X.; Xu, X.; Liu, S. PGA: an R/Bioconductor 23 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

package for identification of novel peptides using a customized database derived from RNASeq. BMC Bioinformatics 2016, 17 (1), 244. (16)

Wang, X.; Zhang, B. customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search. Bioinformatics 2013, 29 (24), 3235–3237.

(17)

Boutros, P. C.; Fraser, M.; Harding, N. J.; de Borja, R.; Trudel, D.; Lalonde, E.; Meng, A.; Hennings-Yeomans, P. H.; McPherson, A.; Sabelnykova, V. Y.; et al. Spatial genomic heterogeneity within localized, multifocal prostate cancer. Nat. Genet. 2015, 47 (7), 736–745.

(18)

Van der Auwera, G. A.; Carneiro, M. O.; Hartl, C.; Poplin, R.; del Angel, G.; Levy-Moonshine, A.; Jordan, T.; Shakir, K.; Roazen, D.; Thibault, J.; et al. From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. In Current Protocols in Bioinformatics; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2013; Vol. 11, p 11.10.1-11.10.33.

(19)

Carson, A. R.; Smith, E. N.; Matsui, H.; Brækkan, S. K.; Jepsen, K.; Hansen, J.-B.; Frazer, K. A. Effective filtering strategies to improve data quality from population-based whole exome sequencing studies. BMC Bioinformatics 2014, 15 (1), 125.

(20)

De Summa, S.; Malerba, G.; Pinto, R.; Mori, A.; Mijatovic, V.; Tommasi, S. GATK hard filtering: tunable parameters to improve variant calling for next generation sequencing targeted gene panel data. BMC Bioinformatics 2017, 18 (S5), 119.

(21)

Karpova, M. A.; Karpov, D. S.; Ivanov, M. V.; Pyatnitskiy, M. A.; Chernobrovkin, A. L.; Lobas, A. A.; Lisitsa, A. V.; Archakov, A. I.; Gorshkov, M. V.; Moshkovskii, S. A. Exome-driven characterization of the cancer cell lines at the proteome level: The NCI-60 case study. J. Proteome Res. 2014, 13 (12), 5551–5560.

(22)

Abaan, O. D.; Polley, E. C.; Davis, S. R.; Zhu, Y. J.; Bilke, S.; Walker, R. L.; Pineda, M.; Gindin, Y.; Jiang, Y.; Reinhold, W. C.; et al. The exomes of the NCI-60 panel: A genomic resource for cancer biology and systems pharmacology. Cancer Res. 2013, 73 (14), 4372–4382.

(23)

Lobas, A. A.; Karpov, D. S.; Kopylov, A. T.; Solovyeva, E. M.; Ivanov, M. V.; Ilina, I. Y.; Lazarev, V. N.; Kuznetsova, K. G.; Ilgisonis, E. V.; Zgoda, V. G.; et al. Exome-based proteogenomics of HEK293 human cell line: Coding genomic variants identified at the level of shotgun proteome. Proteomics 2016, 16 (14), 1980–1991.

(24)

Lin, Y.-C.; Boone, M.; Meuris, L.; Lemmens, I.; Van Roy, N.; Soete, A.; Reumers, J.; Moisse, M.; Plaisance, S.; Drmanac, R.; et al. Genome dynamics of the human embryonic kidney 293 lineage in response to cell biology manipulations. Nat. Commun. 2014, 5 (11), 4767.

(25)

Ruggles, K. V.; Tang, Z.; Wang, X.; Grover, H.; Askenazi, M.; Teubl, J.; Cao, S.; McLellan, M. D.; Clauser, K. R.; Tabb, D. L.; et al. An Analysis of the Sensitivity of Proteogenomic Mapping of Somatic Mutations and Novel Splicing Events in Cancer. Mol. Cell. Proteomics 2016, 15 (3), 1060–1071.

(26)

Lawrence, M. S.; Stojanov, P.; Polak, P.; Kryukov, G. V.; Cibulskis, K.; Sivachenko, A.; Carter, S. L.; Stewart, C.; Mermel, C. H.; Roberts, S. A.; et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 2013, 499 (7457), 214–218.

(27)

Mikhaĭlova, I. N.; Lukashina, M. I.; Baryshnikov, A. I.; Morozova, L. F.; Burova, O. S.; Palkina, T. N.; Kozlov, A. M.; Golubeva, V. A.; Cheremushkin, E. A.; Doroshenko, M. B.; et al. Melanoma cell lines as the basis for antitumor vaccine preparation. Vestn. Ross. Akad. meditsinskikh Nauk 2005, No. 7, 37–40.

(28)

Mikhaylova, I. N.; Kovalevsky, D. A.; Morozova, L. F.; Golubeva, V. A.; Cheremushkin, E. A.; Lukashina, M. I.; Voronina, E. S.; Burova, O. S.; Utyashev, I. A.; Kiselev, S. L.; et al. Cancer/testis 24 ACS Paragon Plus Environment

Page 24 of 32

Page 25 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

genes expression in human melanoma cell lines. Melanoma Res. 2008, 18 (5), 303–313. (29)

Li, H.; Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25 (14), 1754–1760.

(30)

Vizcaíno, J. A.; Csordas, A.; Del-Toro, N.; Dianes, J. A.; Griss, J.; Lavidas, I.; Mayer, G.; PerezRiverol, Y.; Reisinger, F.; Ternent, T.; et al. 2016 update of the PRIDE database and its related tools. Nucleic Acids Res. 2016, 44 (D1), D447–D456.

(31)

Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20 (9), 1466–1467.

(32)

Kim, S.; Pevzner, P. A. MS-GF+ makes progress towards a universal database search tool for proteomics. Nat. Commun. 2014, 5, 5277.

(33)

Ivanov, M. V.; Levitsky, L. I.; Lobas, A. A.; Panic, T.; Laskay, Ü. A.; Mitulovic, G.; Schmid, R.; Pridatchenko, M. L.; Tsybin, Y. O.; Gorshkov, M. V. Empirical multidimensional space for scoring peptide spectrum matches in shotgun proteomics. J. Proteome Res. 2014, 13 (4), 1911–1920.

(34)

Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods 2007, 4 (3), 207–214.

(35)

Goloborodko, A. A.; Levitsky, L. I.; Ivanov, M. V.; Gorshkov, M. V. Pyteomics—a Python framework for exploratory data analysis and rapid software prototyping in proteomics. J. Am. Soc. Mass Spectrom. 2013, 24 (2), 301–304.

(36)

Chick, J. M.; Kolippakkam, D.; Nusinow, D. P.; Zhai, B.; Rad, R.; Huttlin, E. L.; Gygi, S. P. A masstolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat. Biotechnol. 2015, 33 (7), 743–749.

(37)

Kong, A. T.; Leprevost, F. V; Avtonomov, D. M.; Mellacheruvu, D.; Nesvizhskii, A. I. MSFragger: ultrafast and comprehensive peptide identification in shotgun proteomics. Nat. Methods 2017, 14 (5), 513–520.

(38)

Zook, J. M.; Chapman, B.; Wang, J.; Mittelman, D.; Hofmann, O.; Hide, W.; Salit, M. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 2014, 32 (3), 246–251.

(39)

Lek, M.; Karczewski, K. J.; Minikel, E. V.; Samocha, K. E.; Banks, E.; Fennell, T.; O’Donnell-Luria, A. H.; Ware, J. S.; Hill, A. J.; Cummings, B. B.; et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 2016, 536 (7616), 285–291.

(40)

Fang, D.; Chen, H.; Zhu, J. Y.; Wang, W.; Teng, Y.; Ding, H.-F.; Jing, Q.; Su, S.-B.; Huang, S. Epithelial–mesenchymal transition of ovarian cancer cells is sustained by Rac1 through simultaneous activation of MEK1/2 and Src signaling pathways. Oncogene 2017, 36 (11), 1546–1558.

(41)

Hetmanski, J. H. R.; Zindy, E.; Schwartz, J.-M.; Caswell, P. T. A MAPK-Driven Feedback Loop Suppresses Rac Activity to Promote RhoA-Driven Cancer Cell Invasion. PLOS Comput. Biol. 2016, 12 (5), e1004909.

(42)

Lee, I. H.; Sohn, M.; Lim, H. J.; Yoon, S.; Oh, H.; Shin, S.; Shin, J. H.; Oh, S.-H.; Kim, J.; Lee, D. K.; et al. Ahnak functions as a tumor suppressor via modulation of TGFβ/Smad signaling pathway. Oncogene 2014, 33 (38), 4675–4684.

(43)

Sheppard, H. M.; Feisst, V.; Chen, J.; Print, C.; Dunbar, P. R. AHNAK is downregulated in melanoma, predicts poor outcome, and may be required for the expression of functional 25 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

cadherin-1. Melanoma Res. 2015, 0, 1. (44)

Wang, Z.-Q.; Milne, K.; Webb, J. R.; Watson, P. H. CD74 and intratumoral immune response in breast cancer. Oncotarget 2017, 8 (8), 12664–12674.

(45)

Ekmekcioglu, S.; Davies, M. A.; Tanese, K.; Roszik, J.; Shin-Sim, M.; Bassett, R. L.; Milton, D. R.; Woodman, S. E.; Prieto, V. G.; Gershenwald, J. E.; et al. Inflammatory marker testing identifies CD74 expression in melanoma tumor cells, and its expression associates with favorable survival for stage III melanoma. Clin. Cancer Res. 2016, 22 (12), 3016–3024.

(46)

Valentin-Vega, Y. A.; Wang, Y.-D.; Parker, M.; Patmore, D. M.; Kanagaraj, A.; Moore, J.; Rusch, M.; Finkelstein, D.; Ellison, D. W.; Gilbertson, R. J.; et al. Cancer-associated DDX3X mutations drive stress granule assembly and impair global translation. Sci. Rep. 2016, 6 (1), 25996.

(47)

Lawrence, M. S.; Stojanov, P.; Mermel, C. H.; Robinson, J. T.; Garraway, L. A.; Golub, T. R.; Meyerson, M.; Gabriel, S. B.; Lander, E. S.; Getz, G. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 2014, 505 (7484), 495–501.

(48)

Chahal, H. S.; Wu, W.; Ransohoff, K. J.; Yang, L.; Hedlin, H.; Desai, M.; Lin, Y.; Dai, H.-J.; Qureshi, A. A.; Li, W.-Q.; et al. Genome-wide association study identifies 14 novel risk alleles associated with basal cell carcinoma. Nat. Commun. 2016, 7, 12510.

(49)

Zhao, M.; Zhao, Z. Concordance of copy number loss and down-regulation of tumor suppressor genes: a pan-cancer study. BMC Genomics 2016, 17 (S7), 532.

(50)

Akbani, R.; Akdemir, K. C.; Aksoy, B. A.; Albert, M.; Ally, A.; Amin, S. B.; Arachchi, H.; Arora, A.; Auman, J. T.; Ayala, B.; et al. Genomic classification of cutaneous melanoma. Cell 2015, 161 (7), 1681–1696.

Graphical abstract

26 ACS Paragon Plus Environment

Page 26 of 32

Page 27 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 1. Experimental and data processing workflow used in this study: eight melanoma cell lines were subject to exome sequencing and shotgun LC-MS/MS analysis after trypsin digestion. Customized databases of six confidence levels (raw databases without SNP/indel filtering, conventional hard filtering to lev0, and more strict filtering to obtain lev1-lev4 databases) were generated from exome sequencing data and used for “all vs. all” searches utilizing two search engines, X!Tandem and MS-GF+. Variant PSMs were filtered separately using target-decoy approach. 490x330mm (96 x 96 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2. Number of variant peptides identified in searches against databases of different filtering levels. (a) The LC-MS/MS data for cell line 82 was searched against its own customized databases of six stringency levels using two search engines, X!Tandem and MS-GF+, the number of identified variant peptides grows with the increase in database size almost linearly. (b) The LC-MS/MS data for cell line 82 was also searched against other cell lines’ databases with exclusion of the peptides intersecting with either hard filtered (marked as w/o 82 lev0) or non-filtered (marked as w/o 82 raw) databases of the cell line under study. The number of false matches drops dramatically when excluding the intersection with raw-level database of cell line 82. 432x145mm (96 x 96 DPI)

ACS Paragon Plus Environment

Page 28 of 32

Page 29 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 3. Recovery rates (in %) calculated for wild-type peptides and variants identified using databases of different filtering stringency vs. the number of sample replicates merged. The cell line under study is 82, peptide identification was performed using X!Tandem. 250x264mm (96 x 96 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 4. (a) Mass shifts obtained in open search performed using X!Tandem with precursor mass tolerance of 500 Da. (b) Number of variant peptides identified in searches against customized databases with different filtering stringency before and after exclusion of the variants with mass shifts gaining 15 or more PSMs in open search (total of 28 substitutions excluded). The fraction of excluded peptides is nearly the same for all database filtering levels, suggesting that the subset of identified variants present only in unfiltered database is not enriched with false matches, occuring due to modifications. 432x145mm (96 x 96 DPI)

ACS Paragon Plus Environment

Page 30 of 32

Page 31 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 5. Number of variant peptides identified by the combination of X!Tandem and MS-GF+ search engines in each of eight melanoma cell lines against each of the customized databases of (a) raw level (unfiltered), (b) raw level, only unique peptides considered, (c) level 0 (hard filtered), (d) level 0, only unique peptides considered. The z-scores were calculated for the outliers assuming normal distributions for the number of peptides identified in the databases of wrong cell lines. The z-score shows how well the correct cell line can be distinguished from the others based on (e) all variant peptides identified in all-vs-all searches, (f) unique peptides identified in all-vs-all searches. The use of unfiltered (raw) databases improves the distinguishing between the cell lines. 432x401mm (96 x 96 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

TOC figure 554x319mm (96 x 96 DPI)

ACS Paragon Plus Environment

Page 32 of 32