Subscriber access provided by Kaohsiung Medical University
Article
Optimal settings of mass spectrometry open search strategy for higher confidence Dehua Li, Shaohua Lu, Wanting Liu, Xinlu Zhao, Zhibiao Mai, and Gong Zhang J. Proteome Res., Just Accepted Manuscript • Publication Date (Web): 28 Sep 2018 Downloaded from http://pubs.acs.org on September 28, 2018
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
Optimal settings of mass spectrometry open search strategy for higher confidence Dehua Li 1, Shaohua Lu 1, Wanting Liu 1, Xinlu Zhao 1, Zhibiao Mai 1, Gong Zhang 1,*
1
Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes,
Institute of Life and Health Engineering, College of Life Science and Technology, Jinan University, Guangzhou 510632, China. * Corresponding author. Phone/Fax: +86-20-85224031.E-mail:
[email protected],
[email protected].
Abstract In most proteome mass spectrometry experiments, more than half of the mass spectra cannot be identified, mainly because of various modifications. The open search strategy allows for a larger precursor tolerance to utilize more spectra, especially those with post-translational modifications; however, thorough quality control based on independent information is lacking. Here, we used the “Suspicious Discovery Rate (SDR)” based on translatome sequencing (RNC-seq) as an independent source to reference the proteome open search results in steady-state cells. We found that the open search strategy increased the spectra utilization with the cost of increased suspicious identifications that lack translation evidence. We further found that restricting the peptide FDR below 0.1% efficiently controlled the suspicious identifications of open search methods and thus enhanced the confidence of the peptide identification with modifications comparable to the level of the traditional narrow window search. We then demonstrated the successful and validated identification of 27 single amino acid variations from the spectra of two cell lines using the open search strategy without predefined database. These results validated the proper use of open search methods for higher-quality proteome identifications with information on post-translational modifications and single amino acid polymorphisms.
Keywords 1
ACS Paragon Plus Environment
Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Mass spectrometry; open search; suspicious discovery; post-translational modification
Introduction Shotgun mass spectrometry (MS) is the most efficient method to analyze complex protein samples involving searching the spectra against a pre-defined peptide database. However, despite the rapid improvement of the MS instruments, only ~25% of spectra can be identified 1. Besides the low-quality spectra, the explanation for such a large fraction of unidentified spectra was considered to be peptide variations, including post-translational modifications (PTMs) and single amino-acid polymorphisms (SAPs), which alter the mass of the peptide. These variations lead to a considerable mass shift of precursor ions in the MS spectra and systematic b/y ion mass shift in MS2 spectra. Permutating all possible variations in the database would massively expand the search space, resulting in decreased sensitivity and an exaggerated false discovery rate (FDR). Griss et al. clustered spectra across thousands of shotgun MS experiments to identify 20% of the unidentified spectra. However, no apt FDR assessment could be applied 1. The open search strategy, which increases the mass tolerance of the precursor ions, was proposed to circumvent the abovementioned limitation. A higher mass tolerance tolerates most PTMs and SAPs, so that these spectra can be subjected to further analyses. Increasing the mass tolerance up to 500 Da matched 50% more modified peptides compared to that matched with the narrow-mass search strategy in HEK293 cells, most of which contained hundreds of rare modifications 2. A number of open search tools have been developed, including PTMap3, pMatch4, MODa 5, SpecOMS 6, and MSFragger 7. However, these tools are not as widely used in proteomics as the traditional narrow-mass search tools such as Mascot8 and X!Tandem9. One important reason for their lack of use is the difficulty in obtaining an accurate FDR estimation with such approaches. Although the Target-Decoy principle still applies for open search 1, the number of identified spectra would determine the FDR estimation, i.e., the missing identifications would lead to an incorrect FDR estimation of the identified peptides 7. Because of the different sensitivities of these open search tools, the ID number varies, leading to a non-robust quality control that may limit the impact of these open search results in biological investigations. An FDR estimation based on an independent reference is needed. Translatome sequencing (RNC-seq) could be such an independent reference for the proteome at least in steady-state cells. In steady-state cells, the production and degradation of a protein is 2
ACS Paragon Plus Environment
Page 2 of 23
Page 3 of 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
balanced, so that the translating mRNAs correspond to the proteins. Taking advantage of the ultra-high-throughput of next-generation sequencing, RNC-seq can serve as a highly sensitive reference for the proteome and was proposed as a resource pillar of the Human Proteome Project (HPP) 10. In steady-state cells, the MS-identified proteins without translation evidence would be considered “suspicious identifications” (SI), which are very likely to be false discoveries. Using this method, we assessed the confidence of a number of multiple-engine integration strategies and established the protein-level integration strategy to increase the sensitivity while decreasing the “suspicious discovery rate,” which resulted in higher confidence 11. In this study, we applied the same RNC-seq reference strategy to assess the confidence of open search methods to find their optimal parameters for the confident identification of modifications.
Materials and Methods Cell Culture HeLa cells were cultured in 75 cm2 culture flasks in complete DMEM (Life Technologies, Carlsbad, CA, USA) supplemented with 10% FBS (Life Technologies, Carlsbad, CA, USA), 100 U/mL penicillin/0.1 mg/mL streptomycin (GBCBIO Technologies Inc, Guangzhou, China), and 10 µg/mL ciprofloxacin. All cells were cultured in a 5% CO2 incubator at 37°C and were negative for mycoplasma detection by PCR analysis.
RNC-seq The ribosome-nascent chain complex (RNC) extraction was performed as we previously reported [8]. In brief, HeLa cells were cultured in culture flasks after growth to 80-90% confluence and then were pre-treated with 100 mg/mL cycloheximide for 15 min. Then cells were washed with pre-chilled phosphate buffered saline twice followed by the addition of 2 mL cell lysis buffer [1% Triton X-100 in ribosome buffer (RB buffer) [20 mM HEPES-KOH (pH 7.4), 15 mM MgCl2, 200 mM KCl, 100 mg/mL cycloheximide, and 2 mM dithiothreitol]. After incubation for 30 min in an ice-bath, cell lysates were scraped and transferred to pre-chilled 1.5 ml tubes. Cell debris was removed by centrifuging at 16200 × g for 10 min at 4 . Supernatants were transferred to the surface of 20 mL of sucrose buffer (30% sucrose in RB buffer). RNCs were pelleted after 3
ACS Paragon Plus Environment
Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ultra-centrifugation at 185000 × g for 5 h at 4 . Then, RNC-mRNAs were isolated using TRIzol RNA extraction reagent (Ambion, Austin, TX) following the manufacturer’s instructions. The sequencing libraries of the HeLa cell line were prepared according to BGI-MGIEasy's instructions accompanying the MGIEasy mRNA Library Prep Kit V2 (Part# 85-05536-01). The polyA+ mRNA in the RNC-mRNA samples was selected using NEB Poly(A) mRNA Magnetic Isolation Module (New England Biolabs). The mRNA libraries were constructed using the MGIEasy mRNA Library Prep Kit V2 for MGI (MGI, Shenzhen, China) following the manufacturer’s protocol. Purified libraries were quantified with a Qubit 2.0 Fluorometer (Invitrogen), and sequencing was performed on a BGISEQ-500 sequencer for 50 cycles, single-ended mode. This dataset was deposited in the GEO database under the accession number GSE112623. The RNC-seq datasets of hepatocellular cell lines (sequenced with an Illumina HiSeq-2000 sequencer at a 50 nt read length) were obtained from Chang et al. (GEO accession number GSE49994) 12. To facilitate detection of single nucleotide variants (SAVs), we performed RNC-seq experiments for the two Chinese-derived MHCC97H and MHCCLM3 cell lines again according to the previous protocol 12 using a longer read length, namely paired-end 150 nt (PE150), on an Illumina HiSeq X Ten sequencer. The raw data were deposited in the GEO database under the accession number GSE112623. The high-quality 50 nt reads were mapped to the NCBI RefSeq-RNA reference sequence (version September 18, 2017) using the FANSe3 algorithm 13 with the parameters -E3 --indel. The mRNAs with at least 10 reads were considered as confident identifications 14. The expression level was calculated using the RPKM method. The longer reads, i.e., PE150 reads for SNV detection, were mapped to the Ensembl v83 transcript reference sequences (ENST) using the FANSe3 algorithm 13 with the parameters -E5%. After piling up the reads, the SNVs were detected using the Fisher exact test for each position with the null hypothesis that the number of nucleotides deviating from the reference sequence was not significantly higher than the 2% of the sequencing depth (the theoretical error of the experimental error and sequencing error) at this position.
MS datasets and search engines The MS datasets of the MHCC97H, MHCCLM3, and Hep3B hepatocellular carcinoma cell lines were obtained from Chang et al. (ProteomeXchange accession numbers PXD000529 and 4
ACS Paragon Plus Environment
Page 4 of 23
Page 5 of 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
PXD000533) 12. The MS datasets of the HeLa cell line were obtained from Bekker-Jensen et al. 15 (ProteomeXchange accession number PXD004452). We used pFind 3.0 16, 17 and MSFragger 7 to search the MS spectra using different parameters (Supplementary Table S1).
Translation-supported and suspicious protein identifications The definitions of translation-supported identifications (TI) and suspicious identifications (SI) of the MS protein identification results were taken from our previous work 11. In brief, the unique peptides identified by the search engine were mapped to the RNC-seq data. In the steady state, the translating mRNAs should correspond to the proteins. The proteins with translation evidence were considered as TI, and the proteins without translation evidence were considered as SI. The suspicious discovery rate (SDR) was calculated as SI/(TI+SI), which reflects the potential false discovery rate.
Results Near-complete translating mRNA identifications All RNC-seqs were performed at a very high throughput, with 144~188 million mapped reads (Supplementary Table S2). Given a typical mammalian cell, which contains 200,000 mRNA molecules 18, a transcript with an average length of 2.2 kb and with an abundance of 1 copy/cell is equivalent to 2.27 RPKM. Taking the quantification criterion of 10 reads per transcript 14, this throughput can quantify transcripts at the average length with an abundance of 0.014 copy/cell, suggesting that the translating mRNA identification is near complete. Indeed, we found that RNC-seq quantified low-abundance translating mRNAs down to 0.001 RPKM (Fig. 1A). Moreover, the quantified gene number approaches saturation at this throughput (Fig. 1B), again indicating a near-complete identification of translating mRNAs. This method represents a sensitive resource to independently reference the MS identification.
Open search strategy increases the SDR Both narrow window search and open search strategies require setting modifications prior to the search. These modifications would be considered during the theoretical spectra generation to 5
ACS Paragon Plus Environment
Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
directly match the real spectra. Increasing the number of preset modifications would expand the possible database exponentially, resulting in a decrease in the identification power. Therefore, a few modifications are usually set with high probability. In this study, we set two modifications: the cysteine carbamidomethylation and oxidized methionine. Moreover, we aimed to set these modifications as fixed modifications and variable modifications, respectively. Considering the extensive computational demand of the open search strategy, we chose two search engines, MSFragger and pFind, because of their fast speed. These engines are based on different mathematical models, which provide information on the SDR of various search engines. We used the RefSeq protein database (44572 entries) in the search to be consistent with the RefSeq RNA database in the RNC-seq analysis. When using a narrow window search strategy, the preset modifications slightly increased the SDR for pFind but not for MSFragger (Fig. 1C). Setting cysteine carbamidomethylation as a fixed modification and oxidized methionine as a variable modification is similar to the “both-fixed” and “both-variable” modification settings. However, when using the open search strategy with the common precursor tolerance of 500 Da, the SDR remarkably increased for both algorithms (Fig. 1C and Supplementary Fig. S1). Setting the variable modification further increased the SDR compared to that when setting a fixed modification for both algorithms. These results indicated that the open search strategy tended to give less confident identification results than the narrow search strategy. This trend was also independent of protein digestion and fractionation. When processing HeLa datasets associated with various digestion and fractionation/enrichment methods, the open search strategy resulted in a higher SDR than the narrow search strategy in most cases (Fig. 1D). Notably, in the pH 8 and pH 10 fractions, the open search almost doubled the SDR compared to that with the narrow window search, emphasizing again the necessity to improve the quality control of the open search strategy.
Controlling the open search SDR using mass tolerance and peptide FDR The key to the open search strategy is to allow a much larger precursor mass tolerance to tolerate the mass shift caused by modifications. However, this would also result in the side effect of allowing more false positives. Allowing a smaller precursor tolerance should theoretically decrease the SDR. We validated this hypothesis for both algorithms when allowing for a 100-800 Da precursor mass tolerance: the lower tolerance decreased the SDR (Fig. 2A). We also noted that the SDR almost saturated when the mass tolerance exceeded 400 Da. In comparison, the 6
ACS Paragon Plus Environment
Page 6 of 23
Page 7 of 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
identification power was almost independent of the window size (Fig. 2B). These results suggested that a larger mass tolerance would be useless when most of the expected modifications were already included. In fact, a 300 Da tolerance already covered 90% of expected modifications, and a 500 Da tolerance covered 94% (Supplementary Fig. S2). The top 20 precursor mass shifts are listed in Supplementary Table S3; 38-70% of the mass shifts were +1~+3, indicating that the main source of the mass shift was the isotopic offsets. Peptide FDR filter is another factor used to control the identification quality. Because of the possible higher error of peptide identification caused by the large precursor tolerance, a more stringent peptide FDR should help control the quality. Indeed, we found that setting more stringent peptide FDR criteria substantially decreased the SDR for both algorithms (Fig. 2C-D). When setting the peptide FDR as 0.001, the open search SDR could be controlled to the level of the narrow window search on average, regardless of whether the cysteine carbamidomethylation was set as a variable or fixed modification (Fig. 2C-D). Notably, there were differences between the algorithms: pFind open search SDR dropped below the narrow search SDR when setting peptide FDR as 0.002, while the MSFragger needed to be set as 0.001. The enhancement of the identification quality is reflected by the concordance of the IDs obtained with the two search engines. The fractions of concordant IDs at a peptide FDR