MS-Free Protein Identification in Complex ... - ACS Publications

Sep 14, 2017 - MS/MS-Free Protein Identification in Complex Mixtures Using. Multiple Enzymes with Complementary Specificity. Mark V. Ivanov,. †,‡...
2 downloads 0 Views 2MB Size
Article pubs.acs.org/jpr

Cite This: J. Proteome Res. 2017, 16, 3989-3999

MS/MS-Free Protein Identification in Complex Mixtures Using Multiple Enzymes with Complementary Specificity Mark V. Ivanov,†,‡ Irina A. Tarasova,† Lev I. Levitsky,†,‡ Elizaveta M. Solovyeva,†,‡ Marina L. Pridatchenko,† Anna A. Lobas,†,‡ Julia A. Bubis,†,‡ and Mikhail V. Gorshkov*,†,‡ †

Downloaded via EASTERN KENTUCKY UNIV on January 29, 2019 at 02:50:15 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.

V.L. Talrose Institute for Energy Problems of Chemical Physics, Russian Academy of Sciences, 38 Leninsky Pr., Bld. 2, Moscow 119334, Russia ‡ Moscow Institute of Physics and Technology (State University), 9 Institutsky Per. Dolgoprudny, Moscow 141700, Russia S Supporting Information *

ABSTRACT: In this work, we present the results of evaluation of a workflow that employs a multienzyme digestion strategy for MS1-based protein identification in “shotgun” proteomic applications. In the proposed strategy, several cleavage reagents of different specificity were used for parallel digestion of the protein sample followed by MS1 and retention time (RT) based search. Proof of principle for the proposed strategy was performed using experimental data obtained for the annotated 48-protein standard. By using the developed approach, up to 90% of proteins from the standard were unambiguously identified. The approach was further applied to HeLa proteome data. For the sample of this complexity, the proposed MS1-only strategy determined correctly up to 34% of all proteins identified using standard MS/MS-based database search. It was also found that the results of MS1-only search were independent of the chromatographic gradient time in a wide range of gradients from 15−120 min. Potentially, rapid MS1-only proteome characterization can be an alternative or complementary to the MS/MS-based “shotgun” analyses in the studies, in which the experimental time is more important than the depth of the proteome coverage. KEYWORDS: proteomics, database search, MS1-only search, peptide mass fingerprinting, shotgun proteomics, protein identification



INTRODUCTION Shotgun proteomics is a widely used approach for qualitative and quantitative analysis of proteins in complex biological samples. Currently, two basic strategies for identification of proteins are known. The first strategy was developed in the early 90-s and called peptide mass fingerprinting (PMF).1−6 In PMF, the protein identification process is based on the comparison of measured and theoretically calculated masses of peptides generated from enzymatic digestion of the protein sample. However, this strategy was only found to work for small databases and relatively simple mixtures typically reduced to a few proteins. When the sample complexity grows, PMF search generates increasingly unreliable identifications, especially for relatively short proteins represented by a limited number of peptides.7 The approach based on using accurate mass and retention time tags (AMT) significantly improved the capabilities of the method for tandem mass spectrometry (MS/ MS)-free protein identification.8 However, even for the mass accuracy of 1.0 ppm and retention time prediction accuracy close to 1.0 min, the utility of AMT approach for unambiguous protein identification was limited to relatively small proteomes.9 Currently, the method of choice for protein identification is based on MS/MS.10,11 The eluting peptides are sequentially isolated and fragmented using various dissociation techniques to produce sequence-specific MS/MS spectra. These spectra © 2017 American Chemical Society

are compared with the theoretical ones generated from the applicable database followed by ranking of the successful matches according to the probability scores. However, the MS/ MS-based proteome analysis suffers from a number of limitations. First, it significantly complicates the identification of low abundance proteins for high dynamic range samples. For example, the reported dynamic range of proteome characterization using state-of-art Orbitrap FTMS may not exceed 104.12 Moreover, a number of studies performed for high dynamic range protein standards have shown dynamic range below 103 for confidently identified proteins.13 Second, only a small fraction of all peptide-like features detected in MS1 spectra was further selected for the MS/MS and subsequent identification.14 This occurs because isolation, accumulation, and fragmentation of precursor ions require prolonged time, especially for low abundance peptides. Finally, these low abundance peptides, while being selected for isolation and fragmentation, typically produce low quality tandem mass spectra. Recently, the interest in peptide mass fingerprint or similar approaches was renewed as an addition to tandem mass spectrometry for increasing confidence in MS/MS-based Received: June 1, 2017 Published: September 14, 2017 3989

DOI: 10.1021/acs.jproteome.7b00365 J. Proteome Res. 2017, 16, 3989−3999

Article

Journal of Proteome Research peptide identifications.15,16 It has been shown that peptide feature matches, obtained by comparing the mass and retention time tags generated in silico with the ones from MS1 spectra, can be further employed for calculating protein probabilities using both fragmented and nonfragmented precursors.15 In the other study, the approach for determining most probable amino acid compositions and filtering sequence candidates based on neutron encoded (NeuCode) mass signatures was implemented.16−19 One of the unique features of this approach is the possibility of peptide identification without MS/MS.16 In this work, we propose and explore MS/MS-free method for protein identification based on parallel digestion of the analyzed sample using different proteases or chemical reagents, or their combinations. The main rationale behind this work is the development of reliable MS1-only method for rapid identification of the major components in large numbers of clinical samples using ultrafast HPLC gradients (below 15 min). We believe that this will allow us to employ more simple and inexpensive MS platforms without MS/MS capabilities (e.g., multipath high resolution TOFs). One of the assumptions behind this proposal is that MS1 spectra obtained for proteolytic peptides generated from different proteases bear some level of sequence-specific complementarity. Indeed, it was known for quite a long time that using multiple proteases for digesting the same protein sample increases both the number of proteins identified from tandem mass spectra and their sequence coverage.20−22 Thus, we further hypothesize that combining MS1 spectra from multiple proteases may improve the efficiency of standard mass fingerprinting approach. In addition, we add retention time prediction to the MS/MS-free algorithm of protein identification to further enhance its efficiency. The ground for this combination is the sequence specificity of peptide elution times.23−26 Further, we compare the efficiency and attainable protein sequence coverage of the proposed MS1-only protein identification method with standard proteome analyses based on tandem mass spectrometry.



Scientific, Waltham, MA, USA) coupled to Ultimate 3000 RSLCnano system (Dionex, Sunnyvale, CA, USA). The samples were loaded into a trap column (Acclaim PepMap, 2 cm × 75 μm i.d., C18, 3 μm, 100 A) (Thermo Fisher Scientific, Waltham, MA, USA) at 2 μL/min. Separations were performed using analytical column Zorbax 300SB-C18, 15 cm × 75 μm i.d., 3.5 μm particles (Agilent, Santa Clara, CA, USA). Mobile phase consisted of A and B solvents: (A) 100% water with 0.1% formic acid, and (B) 80% acetonitrile, 20% water with 0.1% formic acid. Linear gradient from 5% B to 40% B for 120 min at the flow rate of 300 nL/min was used for separations. MS1 settings were as follows: mass range m/z 400−1500; resolving power was 60k at m/z 400, maximum injection time was set to 100 ms, and AGC of 5 × 105. The peptide isolation window for MS/MS was 2.0 Th and high energy collisional dissociation (HCD) method was used for fragmentation. MS/MS scan range was from 200−2000 Th, and the dynamic exclusion time was 10.0 s. The resolving power for MS/MS spectra of 15K at m/z 400, maximum injection time of 100.0 ms, and AGC of 105 were used. Shotgun proteome analyses of the Pierce HeLa digest standard were performed on LTQ Orbitrap Velos (Thermo Fisher Scientific, Waltham, MA, USA) coupled to Agilent 1100 HPLC System (Agilent, Santa Clara, CA, USA). The samples were loaded into a trap column Zorbax 300SB-C18, 5 × 0.3 mm2, 5 μm particles (Agilent, Santa Clara, CA, USA) at 4 μL/ min. Separations were performed using analytical column Zorbax 300SB-C18, 15 cm × 75 μm, 3.5 μm particles (Agilent, Santa Clara, CA, USA). Mobile phase A (100% water with 0.1% formic acid) and mobile phase B (80% acetonitrile, 20% water with 0.1% formic acid) were used to establish the 70 min gradient composed of 2 min of 2−5% B, 21 min of 5−30% B, 5 min of 30−45% B, 2 min of 45−95% B, and 10 min of 95% B followed by re-equilibration at 2% B for 20 min. The flow rate of 300 nL/min was used for separations. MS1 settings were as follows: mass range m/z 300−1500, maximum injection time was set to 50 ms, and AGC of 4 × 106. The peptide isolation window for MS/MS was 4.0 Th. Resolving power was varied from 15−100k. For the MS/MS experiment, the 10 and 20 most intense ions above a 5000 counts threshold were selected for fragmentation. For collision-activated dissociation, normalized collision energy was set to 35%. MS/MS scan range was from 200−2000 Th, and the dynamic exclusion time was 15.0 s. The resolving powers for MS/MS spectra of 7.5K, maximum injection time of 250.0 ms, and AGC of 4 × 104 were used. The MS experiments were performed at the “Human Proteome” Core Facility at the Institute of Biomedical Chemistry (IBMC).

EXPERIMENTAL SECTION

Materials

Equimolar proteome standard UPS1 (Sigma-Aldrich, St. Louis, Missouri, USA) was dissolved in 50 mM ABB (ammonium bicarbonate) at the concentration of 0.1 μg μL−1 and digested with either trypsin (Promega, Madison, WI, USA), LysC (Promega, Madison, WI, USA), or GluC (Promega, Madison, WI, USA) alone or with a mixture of LysC/GluC (ratio of 1:1, v/v) and trypsin/GluC (ratio of 1:1, v/v). The samples were reduced with 10 mM DTT (dithiothreitol), added in the ratio of 1:1 (v/v), and then incubated for 30 min at 60 °C. The reduced samples were cooled to room temperature and alkylated with IAA (iodoacetamide) added to a final concentration of 15 mM by incubating for 30 min in dark. The proteases were added to the samples at the ratio of 1:20, (protease/protein, w/w). Then the samples were digested for 18 h at 37 °C. The digestion was stopped with formic acid. The total protein amount injected was 8.64 pmol. The Pierce HeLa Protein Digest Standard (Thermo Fisher Scientific, Waltham, MA, USA) was dissolved in Mili-Q water with 0.1% formic acid up to concentration 1 μg/μL. The total protein amount injected was 1 μg.

HeLa Data

Two publicly available data sets obtained for HeLa cell lysates in the earlier studies were used in this work: (1) The “confetti” data set was obtained using separate proteases such as trypsin, LysC, GluC, Asp-N, and Arg-C and protease mixtures such as LysC/GluC and trypsin/ GluC.21 Raw data were downloaded from www. proteomexchange.org (data set identifier PXD000900). The single-shot runs were used in the study. (2) Another data set was obtained using different HPLC gradient times varied in a range from 15−120 min.27 Raw data were downloaded from www.proteomexchange.org (data set identifier PXD001695).

LC−MS/MS

Shotgun proteome analyses of the UPS proteolytic digests were performed on Orbitrap Q Exactive HF (Thermo Fisher 3990

DOI: 10.1021/acs.jproteome.7b00365 J. Proteome Res. 2017, 16, 3989−3999

Article

Journal of Proteome Research

we added the so-called “noise” peptides to the in silico data set after applying all filtering described above. The sequences of “noise” peptides were generated by random selection from a list of 20 commonly occurring amino acid residues. The distributions of “noise” peptide sequences by length and number of missed cleavages were similar to the ones from in silico data set. The neutral mass and the RT for each peptide from the in silico generated data set, including “real” and “noise” peptides, were calculated using Pyteomics. Peptide retention times were calculated using retention coefficients determined for the experimental data set from 60 min gradient HPLC−MS/MS analysis. To emulate the real experimental data set, the masses and RTs of peptides from in silico data set were normally distributed with the standard errors shown in Table 2. The default values

MS1 and MS/MS Searches

Raw files were converted to MGF and mzML formats using msConvert from ProteoWizard.28 MS1 spectra in mzML format were processed for deisotoping and peak picking using Dinosaur software.29 Algorithm of the proposed method for MS1-only search is further described in Results and Discussion section below. For validation and evaluation of the efficiency of the proposed MS1 method, we performed standard MS/MS-based protein identification. Specifically, database search against Human SwissProt database was performed using X!Tandem, version 2012.10.01 Cyclone.30 The following parameters were used: precursor mass tolerance of 10 ppm, fragment mass tolerance of 0.02 Da, maximum allowed 2 missed cleavages, fixed carbamidomethylation of cysteine and potential oxidation of methionine as residue modifications. Pepxmltk 31 utility was used to convert X!Tandem output files to standard pepXML format. This utility allows setting the arbitrary sequence cleavage specificity, contrary to Tandem2XML converter typically employed for X! Tandem output conversion. The identifications were filtered to 1% FDR at the protein level and validated using MP score.32 Further processing and data analysis were performed in Python using Pyteomics.33 Retention time (RT) calculation was performed using ELUDE.34

Table 2. Parameters Used for Generation of in Silico Data no. mass accuracy, ppm retention time accuracy, min noise peaks number of proteins enzymes

In Silico MS1 Data Generation

Proteins randomly selected from SwissProt human database were digested in silico into peptides using a number of cleavage rules listed in Table 1. Theoretical peptides generated for a

enzyme

cleavage rule C-term K or R, but not before P C-term E C-term K C-term K, R, or E C-term K or E C-term R N-term D

tested values 0.1, 0.33, 1.0, 3.0 1, 3, 6, 9

3 2000 trypsin

0,1,2,3,4,5,10 100, 500, 1000, 2000, 3000, 4000, 5000 trypsin, GluC, LysC, LysC+GluC, Trypsin +GluC, ArgC, AspN

were 0.33 ppm for mass accuracy, 3 min for retention time accuracy, three noise peaks, trypsin cleavage, and 2000 proteins in the sample. We varied these parameters one by one while fixing the others to evaluate their effect on MS1 search efficiency. Additionally, to reveal the dependence between the best enzyme and the sample complexity, different enzymes were tested separately for 100 and 2000 proteins in the sample.

Table 1. Enzymes Used for in Silico Experiment trypsin GluC LysC trypsin+GluC LysC+GluC ArgC AspN

default 0.33 3



RESULTS AND DISCUSSION

MS1-Based Search and Protein Scoring Algorithms

The proposed workflow for MS1-only database search and protein scoring is shown in Figure 1. The specific feature of the approach is employing complementary information about peptide properties extracted from chromatographic and mass spectrometry data as well as the residue specificity of enzymes used for digestion. Peaks measured in the MS1 spectra are filtered by charge state and number of observed isotopes. After identifying the peptide-like features in the acquired data, a search against the protein database containing target and decoy protein sequences is performed to match experimental values of masses to the ones calculated for theoretical peptides (peptidefeature match, PFM). Importantly, contrary to the MS/MSbased search, which is typically independent of the way the decoy database was generated,35 the proposed MS1 strategy cannot work with reversed databases and requires shuffled sequences. This is due to the fact that a significant number of reversed decoy peptides will have the same m/z ratios as the target ones. Initial peptide-feature matching was performed with the userdefined mass tolerances. The optimal values for these tolerances (standard deviation and systematic shift) were also calculated on-the-fly during the initial analysis. Note that standard deviation and systematic shift for the differences between experimental and predicted retention times depend on the model used for RT prediction. Additive model with length

given cleavage specificity were further filtered to model the experimental data. First, the peptides generated in silico were restricted by minimal sequence length of six residues and m/z range of 300−1500 Th. We also assumed that a peptide contains one charge per each six residues in the sequence. Further, we removed 90% of peptides randomly selected in the generated data set with one missed cleavage and all peptides with more than one missed cleavage. Then we simulated a diversity of protein abundances in real samples by normal distribution of the percentage of observable peptides per protein. These percentages followed the normal distribution with mean value of 70% and standard deviation of 10%. This means that 68% of proteins have from 60−80% of detectable peptides (within ±1 standard deviation range) and only 4% of the protein population have less than 50% or higher than 90% of detectable peptides. Finally, we removed 50% of randomly selected remaining peptides to better match the in silico generated data set with the properties of experimental data sets obtained in a typical LC−MS/MS-based proteome analysis. Note that the real MS data exhibit significant presence of “noise” from unidentified peptide-like spectra originating from in-source fragmentation, artifact modifications, etc.14 Therefore, 3991

DOI: 10.1021/acs.jproteome.7b00365 J. Proteome Res. 2017, 16, 3989−3999

Article

Journal of Proteome Research

data, respectively (Supplementary Table S1). PFMs with predicted RTs beyond these thresholds were discarded. After filtering, the remaining PFMs were assembled into proteins. Protein probabilities were calculated using a binomial model. In this model, the number of trials, n, was equal to the total number of theoretical peptides, the number of successes, k, was set to that of identified peptides and the success probability in each trial, p, corresponded to the probability of a theoretical peptide to be randomly matched. Thus, the probability, P, of a protein to be randomly matched in a search was calculated as follows: P(k) =

n! × pk × (1 − p)n − k k ! × (n − k )!

(1)

This calculation does not make any distinction between unique and shared peptides, using only the protein’s peptide identifications to calculate its probability. The success probability in each trial, p, in eq 1 was calculated as the fraction of matched decoy peptides in the search space of all decoy peptides: p=

number of unique decoy PFMs number of theoretical decoy peptides

(2)

For each protein, the survival function was calculated as follows: k

Sf (k) = 1 −

∑ P(i) (3)

i=0

Figure 1. General workflow for MS1-only protein search implemented in this study.

When data are generated for a number of sample aliquots digested using different enzymes or chemical reagents, the final probability of a protein to be randomly matched in the searches performed for all sample aliquots, m, can be calculated as follows:

correction was implemented into the proposed MS1 search algorithm by default. This is the most simple and straightforward model allowing retraining of the retention coefficients for the residues for each analyzed data set. However, the accuracy of this model is limited.23 A more advanced self-trained model, ELUDE, was also integrated in the workflow to improve the accuracy of RT prediction. Another advantage of ELUDE is the smaller number of reliably identified peptides needed for the model’s training to attain the maximum RT prediction accuracy compared with the additive model.36 In this study, we employed ELUDE for all data sets. Generally speaking, the RT prediction for the MS1only search can be performed using any existing peptide retention models or the databases of experimental peptide retention times. While retrainable models can be used for arbitrary experimental parameters, they require reliable peptide identifications for the training set obtained for these parameters using MS/MS. Among the nonretrainable models, one can mention BioLCCC, which allows calculation of the absolute retention times for the given separation conditions,37 as well as one of the most accurate RT prediction algorithms, SSRCalc.38 However, the former is suffering from low RT prediction accuracy, which is crucial for MS1-only search efficiency, while the latter does not have a standalone version and is available through the web interface only, hindering its integration into the MS1 search software. In the next step of the search, all PFMs were filtered using a threshold determined by the standard deviation of predicted retention times from experimental RTs. 1.3σ and 3.0σ were found optimal for complex (HeLa) and simple (UPS standard)

Sffinal = Sf1 × Sf2 × ... × Sfm

(4)

For more convenient representation, we report the final protein score as Protein Score = −log10(Sffinal )

(5)

Then the scoring algorithm drops the protein with the lower Protein Score from each pair of target and its shuffled decoy sequence as suggested by Savitski et al.39 Finally, the remaining proteins are filtered to a given FDR threshold (typically, 1%) using target-decoy approach.40 MS1 Search Results for in Silico Data

First, the MS1-only searches were performed for the in silico generated data sets described above. For the proposed search strategy, we evaluated the expected number of protein identifications for different parameters. The results of these evaluations are summarized in Figure 2 for all in silico data sets with three replicates for each set of parameters. For the experiments with varying RT accuracy, RT prediction training was turned off in the MS1 algorithm and the predefined set of retention coefficients used above for RT data generation was employed. Also, for the sake of clarity in the interpretation, Figure 2 shows the results obtained for the proteins from in silico generated database only, rather than all proteins identified in the searches. The number of identified proteins can decrease dramatically almost to zero when the number of “noise” peaks per “real” peptide becomes higher than 4 (Figure 2a) or if the standard deviation of mass accuracy is higher than 3 ppm 3992

DOI: 10.1021/acs.jproteome.7b00365 J. Proteome Res. 2017, 16, 3989−3999

Article

Journal of Proteome Research

Figure 2. Theoretical dependence of the number of proteins identified at 1% protein FDR for (a) different number of noise peaks on (b) mass accuracies, (c) retention time accuracy, (d) sample complexity, and enzymes for (e) 2000 and (f) 100 proteins in the sample. The default parameters were three noise peaks, 0.33 ppm mass accuracy, 3 min RT accuracy, 2000 proteins in the sample and trypsin cleavage. Enzyme labels: T, trypsin; G, GluC; L, LysC; LG, LysC/GluC; TG, trypsin/GluC; Ac, ArgC; and An, AspN. Black lines show one standard deviation range.

MS1 versus MS/MS Database Search Strategies

(Figure 2b). Retention time accuracy has a linear effect on the number of identified proteins, as shown in Figure 2c. The effect of sample complexity is further demonstrated in Figure 2d: MS1 search identifies 60% of the proteins for simple samples (