Article Cite This: J. Proteome Res. XXXX, XXX, XXX-XXX
pubs.acs.org/jpr
Protein-Level Integration Strategy of Multiengine MS Spectra Search Results for Higher Confidence and Sequence Coverage Panpan Zhao,† Jiayong Zhong,† Wanting Liu, Jing Zhao, and Gong Zhang* Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes, Institute of Life and Health Engineering, College of Life Science and Technology, Jinan University, Guangzhou 510632, China S Supporting Information *
ABSTRACT: Multiple search engines based on various models have been developed to search MS/MS spectra against a reference database, providing different results for the same data set. How to integrate these results efficiently with minimal compromise on false discoveries is an open question due to the lack of an independent, reliable, and highly sensitive standard. We took the advantage of the translating mRNA sequencing (RNCseq) result as a standard to evaluate the integration strategies of the protein identifications from various search engines. We used seven mainstream search engines (Andromeda, Mascot, OMSSA, X!Tandem, pFind, InsPecT, and ProVerB) to search the same label-free MS data sets of human cell lines Hep3B, MHCCLM3, and MHCC97H from the Chinese C-HPP Consortium for Chromosomes 1, 8, and 20. As expected, the union of seven engines resulted in a boosted false identification, whereas the intersection of seven engines remarkably decreased the identification power. We found that identifications of at least two out of seven engines resulted in maximizing the protein identification power while minimizing the ratio of suspicious/translation-supported identifications (STR), as monitored by our STR index, based on RNC-Seq. Furthermore, this strategy also significantly improves the peptides coverage of the protein amino acid sequence. In summary, we demonstrated a simple strategy to significantly improve the performance for shotgun mass spectrometry by protein-level integrating multiple search engines, maximizing the utilization of the current MS spectra without additional experimental work. KEYWORDS: mass spectrometry, protein identification algorithm, translating mRNA sequencing, RNC-seq, peptide coverage, translation-supported identifications, suspicious identifications, false discovery, protein-level, C-HPP, integration strategy, STR
■
commercial tool Mascot;7 also plenty of free search engines have been developed such as Andromeda (used in MaxQuant),8 OMSSA,9 X!Tandem,10,11 pFind,12−14 InsPecT,15 ProVerB,16 Dispec,17 MassWiz,18 and so on. However, different algorithm and peptide identification rating criteria among software led to diverse identification standards and result in nonconformity protein identification results.19 Mascot is based on a probability model. X!Tandem uses a hypergeometric scoring model, while OMSSA is based on a Poisson scoring model to assess the significance of peptide match; it is no longer available and had been integrated to COMPASS.20 ProVerB is based on the binomial probability distribution model, and pFind is based a parallel spectrum dot product scoring module. InsPecT is a scoring algorithm that reflects peptide fragmentation patterns and a novel quality score based on several complementary features. The logic behind this
INTRODUCTION Mass spectrometry (MS) has been established as a standard and reliable tool for proteomics analysis due to its high sensitivity and capability of analyzing complex samples. The protein identification efficiency has been tremendously elevated over the past decade due to the increased MS capabilities, developments in nanoLC (liquid chromatography), and the booming development of protein identification algorithms. In a typical shotgun proteomics study, protein samples are digested into peptides by using one or several proteases (typically trypsin).1−4 The peptides obtained are subsequently analyzed by liquid chromatography coupled to mass spectrometry (LC−MS) analysis, whereby a subset of the available precursor ions is sampled by the MS instrument, isolated, and further fragmented in the gas phase to generate thousands and thousands of fragment ion spectra (MS/MS spectra). Peptides and proteins are identified from these spectra, and their relative or absolute amounts can be determined by a number of dedicated quantification strategies such as SILAC or isobaric labeling.5,6 New search engines have been continuously developed. Among the many existing engines, the most popular is the © XXXX American Chemical Society
Special Issue: Chromosome-Centric Human Proteome Project 2017 Received: June 30, 2017 Published: October 1, 2017 A
DOI: 10.1021/acs.jproteome.7b00463 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research Table 1. Search Engines Used in This Study engine name
version
download link
reference
Andromenda (MaxQuant) Mascot COMPASS X!Tandem pFind InsPecT ProVerB
1.5.2.8 2.5.1 1.4.1 2009.04.01.1 3.0 20120109 64-bit
http://www.coxdocs.org/doku.php?id=maxquant:common:download_and_installation [commercial software] https://github.com/marpello/Compass http://www.thegpm.org/TANDEM/instructions.html http://pfind.ict.ac.cn/ http://proteomics.ucsd.edu/software-tools/inspectms-alignment/ http://bioinformatics.jnu.edu.cn/software/proverb/
8 7 20 10, 11 12−14 15 16
The RefSeq IDs were mapped to the UniprotKB ID on http:// www.uniprot.org/uploadlists/. For the MS data, the raw format files were merged to Mascot generic format (mgf) using the merge.pl program (http://www.matrixscience.com/downloads/ merge.zip) and the mgf file were converted to dta format by MascotGenericFile to DTA Converter 1.0 (http://panomics. pnnl.gov/).
strategy is that each search engine has a unique preference for the features in tandem mass spectra and covers distinct sets of peptides that other strategies miss. On the contrary, search engines use different “peptide spaces”, for example, by using different cleavage rules (Keil rule) and different peak picking algorithms. Thus researchers in proteomics often find themselves confused when deciding how to choose a reliable and suitable engine. Most researchers use only one engine for protein identification, leaving the risk of the disadvantages of the engine. To maximize the identification efficiency, effort has been reported on integrating or complementing identification results by multiple search engines. Some people try to take the union or intersection of identification results generated from multiple search engines,1,2,21,22 intending to maximize the number or confidence of peptides and protein identifications. However, the result of multiple engines presents additional technical and computational challenges, including the heterogeneity of search engine scores, the propagation of false discoveries, and informatics challenge related to different data formats. To tackle these hindrances, result integration tools have been developed like iProphet23 and Scaffold.24 These softwares, however, do not necessarily increase the number of peptides per protein.19 How to integrate these results efficiently with minimal compromise on false discoveries is still an open question due to the lack of an independent, reliable, and highly sensitive standard.25 As we previously presented,26 translating mRNA sequencing (RNC-seq) provides an accurate and sensitive reference standard for MS-data integration. Here we took this advantage to evaluate the integration strategies of the protein identifications from various search engines to establish a simple and optimal peptide/ protein identification strategy for maximized protein identification effectiveness and protein sequence coverage.
■
MS Data Search
The mass spectrometry data were searched against the neXtProt database (version April 12, 2017, 20 179 entries)30 using seven search engines (details listed in Table 1) with the following parameters: cysteine carbamidomethylation set as fixed modification and oxidized methionine set as variable modification, peptide mass tolerances set as 10 ppm, and fragment mass tolerance set as 0.02 Da. Up to two missed cleavages were permitted for fully tryptic peptides. FDR threshold was set as 1% at peptide level for all engines. Peptides that are unique to a specific neXtProt entry (NX number) were considered unique peptides. For Mascot, Andromeda, and pFind, which can control protein FDR level, we set protein FDR < 1% in the softwares. For the rest engines that cannot control the protein FDR directly, the MAYU software31 was used to control the FDR < 1%. The MS search results of Mascot and X!Tandem were integrated by Scaffold and iProphet in Trans-Proteomics-Pipeline using default settings. Identification Criteria
In this study, we used two sets of criteria to filter the protein identification results: HPP stringent criteria (HPP Guideline 2.1)32 and previous criteria. The HPP stringent criteria require two or more unique peptides and those peptides’ length must be at least nine amino acids. Previous criteria require one or more unique peptides and peptide length of at least eight amino acids.27
METHODS
Using RNC-seq Data To Assess the Translation Evidence of Protein Identification
MS and RNC-seq Data Sets
Translation evidence is an independent reference for the synthesis of proteins in steady-state cells. Proteins with translation evidence (i.e., identified in RNC-seq) were considered as translation-supported identifications (TIs). Proteins with no translation evidence were considered suspicious identifications (SIs). The ratio of TI and translated proteins was defined as the true identification rate (TIR), and the ratio of SI and MS searched proteins was defined as the suspicious identification Rate (SIR). The STR is defined as SIR/TIR.
MS and RNC-seq data from three Human hepatocellular carcinoma (HCC) cell lines (Hep3B, MHCCLM3, and MHCC97H) were described in our previous work from the Chinese C-HPP Consortium for Chromosomes 1, 8, and 20.27 The experimental details were listed in the previous literature.27 The mass spectra generated using an Orbitrap Q Exactive were used in this study. ProteomeXchange accession numbers: PXD000529, PXD000533, and PXD00053524; Gene Expression Omnibus database accession number: GSE42006. RNC-seq reads were mapped to human hg19 RefSeq-RNA reference sequences (downloaded from UCSC Genome Browser on April 7, 2017) using FANSe3 algorithm28 with the parameters − L55 − E3 − S14. Genes with at least 10 mapped reads were deemed confident gene29 Alternative splice variants were not merged.
■
RESULTS
Search Engines Provide Remarkably Different Protein IDs
Using the Hep3B MS data set 1 as an example, the seven search engines identified 4000−6000 proteins alone (Figure 1A). X! Tandem, pFind, and Andromeda showed better identification B
DOI: 10.1021/acs.jproteome.7b00463 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research
Figure 1. General descriptions of the identified proteins by seven search engines. (A) Number of identified proteins of the seven search engines for the Hep3B data set. (B) Protein identifications of seven engines. Proteins were alphabetically ordered by their SwissProt IDs. Identified proteins were presented as black boxes, whereas the unidentified proteins were presented as white boxes. (C) Protein distribution according to their number of identification engines. (D) The length distribution of proteins identified by one to seven engines. (E) pI distribution of proteins identified by one to seven engines. The P values were calculated using the two-tailed Kolmogorov−Smirnov test against the population of all identified proteins. (F) Amino acid composition of unique peptides identified solely by each search engine (not by any other search engine) compared with the background of neXtProt database as a reference. Chi-square test details are listed in Supplementary Table S1.
the 6362 identified proteins in total, ∼45% of proteins were missed by at least one engine, ∼8% of which were identified by only one engine (Figure 1C). The protein length and pI
power than the other four engines. When we looked deeper into the protein identifications, various MS/MS search algorithms identified different proteins from each other (Figure 1B). Among C
DOI: 10.1021/acs.jproteome.7b00463 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research
Figure 2. General descriptions of the identified proteins integrated by traditional integration strategy. (A) Average translation-supported identification (TI), suspicious identification (SI), and S/T Ratio (STR) of single search engines. Error bars denote the standard error of mean (SEM) when searching multiple MS data sets. (B,C) Average TI, SI, and STR of intersection (B) and union (C) of multiple search engines. Error bars denote the SEM of all possible combinations of search engines and MS data sets. (D) Average TI, SI, and STR of PSM-level integration strategy (Scaffold and iProphet) compared with single engines (average of X!Tandem and Mascot), union, and intersect of X!Tandem and Mascot. (E) Score distribution of the proteins identified by seven engines, respectively. (F) Score distribution of the proteins identified by only one engine. They are all significantly different than the distributions shown in panel E (P < 0.01, two-tailed Kolmogorov−Smirnov test).
peptide identification of single engines. In summary, the limitation of search engines cannot be overlooked; that is, a single search model cannot be efficient for all spectra and may be biased, necessitating an optimal strategy to integrate their search results.
distribution remained independent of the number of engines identified (all P > 0.01, two-tailed Kolmogorov−Smirnov test, Figure 1D,E). These results indicated that the physical and chemical properties are not the major reason for the miss of identifications. In contrast, the peptides identified by only one search engine showed significant bias compared with database reference on amino acid composition (Figure 1F). At the peptide level, only 13% of the peptides were identified by all engines, and 28% of the peptides were identified by only one engine (Supplementary Figure S1). These results indicated biased
Dilemma of Traditional Integration Strategies
Previously, three major types of integration strategies were widely used: intersection of multiple engines, union of multiple engines, and PSM-level scoring integration. Here we used translatome sequencing (RNC-seq) as an independent reference D
DOI: 10.1021/acs.jproteome.7b00463 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research
Figure 3. Optimal integration strategy of multiple engines at protein level. (A) General schematic illustration of the integration strategy. PL/UP = peptide length/unique peptide. (B−D) The TI (B), SI (C), and STR (D) of the protein-level integration strategy when a protein has to be identified by 1−7 out of 7 engines. Error bars denote the SEM of all possible combinations of search engines and MS data sets. Gray line denotes the average values of single search engines.
for the first time to evaluate these three strategies for their translation-supported identifications (TIs) and suspicious identifications (SIs, which means the protein identifications without translation evidence), because the translating mRNA species correspond to protein species in steady-state cells.26 In this context, the TI should be highly likely to be true identifications because they were experimentally supported at two independent levels. In comparison, the SIs are more likely to be false identifications. We define suspicious/translationsupported rate (STR) as SI/TI to reflect the balance between error and sensitivity. All seven engines showed similar STR, ranging from 4 to 5.3% (Figure 2A). Intersection of multiple engines (up to seven) led to little decrease in SI. Intersection of more than four engines led to considerable decrease in TI. In contrast, the STR increased to >5.8% for more than four engines, suggesting that intersecting more engines would decrease the reliability of protein identification as well as identification power (Figure 2B). On the contrary, union of up to seven engines almost doubled the TI, with the parallelly increased SI, which led to almost constant STR around 5% (Figure 2C). Using the PSM-level scoring integration strategy, Scaffold and iProphet resulted even higher STR when compared with single engines (X!Tandem and Mascot), union, and intersect of two engines, indicating the ineffectiveness of such strategy (Figure 2D). These trends remain similar if we
apply less stringent previous criteria (one or more unique peptides, peptide length ≥8aa), where intersection of more engines almost did not reduce the STR and the union of more engines boosted the STR (Supplementary Figure S2). Indeed, the scoring model of the engines are totally different and the score distribution showed no similarity (Figure 2E), indicating that the integration on PSM/peptide level is difficult to expand to many engines. In summary, the traditional integration strategies make it difficult to balance higher TI, lower SI, and lower STR, which necessitates a new integration strategy, preferably not on the PSM/peptide level. To be noted, the peptides identified with only one engine showed significantly biased score toward lower quality identifications (Figure 2F), indicating that these peptides may be less reliably identified due to the limitations of each engine. Therefore, combining the protein IDs of at least two engines is necessary. Our Optimal Integration Strategy at the Protein Level
To maximize the advantages of various search algorithms while controlling the STR, we proposed a simple integration strategy, integrating protein identifications instead of PSM/peptides: Keep the proteins that were identified by any N engines among M engines in total, where N < M. For each search engine, the peptide identification and the protein FDR control were E
DOI: 10.1021/acs.jproteome.7b00463 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research
Figure 4. Peptide sequence coverage of multiple engines and integration strategies. (A) Peptides coverage of the Microsomal glutathione S-transferase 3 protein (Microsomal GST-3, neXtProt ID: NX_O14880) amino acid sequence. Blue bars represent the identified peptides, and the red bars denote the covered regions. Red numbers denote the sequence coverage. (B) Comparing the peptide sequence coverage of single engines/traditional integration strategies versus our protein-level integration strategy for our protein-level integration identified proteins (identified by two out of seven engines). (C) Peptides coverage distribution of each engine and strategy. P values represent the two-tailed Kolmogorov−Smirnov test result, comparing our proteinlevel integration strategy (identified by two out of seven engines) against all of the single engines/traditional integration strategies.
performed independently. Final protein ID outputs by each engine were integrated (Figure 3A). Note that the protein length/unique peptide filtering of the protein identifications was independently performed for each engine, respectively (Figure 3A). Using seven engines in total, the TI and SI decreases when a protein needed to be identified by more engines (Figure 3B,C).
When N = 2−4, that is, a protein needed to be identified by two to four engines, the identification power surpassed the average of single search engines. Importantly, the STR reached minima within this range and lower than the average of single search engines (Figure 3D), indicating higher confidence. Therefore, taking N = 2 would maximize the protein identification power F
DOI: 10.1021/acs.jproteome.7b00463 J. Proteome Res. XXXX, XXX, XXX−XXX
Article
Journal of Proteome Research Table 2. Missing Proteins and Their Unique Peptides Identified by Single Search Engines engine
MS data set
neXtProt ID
X!Tandem
H3b_1
NX_Q8N4W6
Andromeda
H3b_2
NX_O60290 NX_Q6ZU67
Mascot
abundance in RNC-seq (rpkM) 29.0 1.5 undetected
97H_1
NX_O43374
16.8
97H_2
NX_O43374
16.8
LM3_1
NX_O43374
24.5
LM3_2
NX_O43374
24.5
H3b_2
NX_P69849
21.8
unique peptides NTLGSAFLTSPIFYGR QLAYQVLGLSEGATNEEIHR KEEMGALYVEEPR TLHALLVSWPALAR LLDSNHSQSMISCVKQEGSSYNER TFPSLELSAESRMILDAFAQQCSRVLSLLNCGGK ELSGGAEAGTVPTSPGK VSINNTGLLGSYHPGVFR VVQQEEGWFR VSINNTGLLGSYHPGVFR VVQQEEGWFR AHLGALLSALSR VSINNTGLLGSYHPGVFR VVQQEEGWFR AHLGALLSALSR FFSPAIMSPK VSINNTGLLGSYHPGVFR VVQQEEGWFR LQGVGALGQAASDNSGPEDAKR SLEVEVLEDDVSAVEFR
The peptide sequence coverage of seven engines alone was all