MS-Free Protein Identification in Complex Mixtures Using Multiple

Sep 14, 2017 - MS/MS-Free Protein Identification in Complex Mixtures Using Multiple Enzymes with Complementary Specificity. Mark V. Ivanov†‡, Irin...
0 downloads 8 Views 1MB Size
Subscriber access provided by PEPPERDINE UNIV

Article

MS/MS-free protein identification in complex mixtures using multiple enzymes with complementary specificity Mark V. Ivanov, Irina A Tarasova, Lev I. Levitsky, Elizaveta M. Solovyeva, Marina L Pridatchenko, Anna A. Lobas, Julia A. Bubis, and Mikhail V Gorshkov J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.7b00365 • Publication Date (Web): 14 Sep 2017 Downloaded from http://pubs.acs.org on September 28, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

MS/MS-free protein identification in complex mixtures using multiple enzymes with complementary specificity

Mark V. Ivanov1,2, Irina A. Tarasova1, Lev I. Levitsky1,2, Elizaveta M. Solovyeva1,2, Marina L. Pridatchenko1, Anna A. Lobas1,2, Julia A. Bubis1,2, Mikhail V. Gorshkov1,2*

1

V.L. Talrose Institute for Energy Problems of Chemical Physics, Russian Academy of

Sciences, 38 Leninsky Pr., Bld. 2, Moscow 119334, Russia 2

Moscow Institute of Physics and Technology (State University), 9 Institutsky Per.,

Dolgoprudny 141700, Moscow region, Russia

*Correspondence to: Mikhail V. Gorshkov, Institute for Energy Problems of Chemical Physics, Russian Academy of Sciences, 38 Leninsky Pr., Bld. 2, 119334 Moscow, Russia E-mail: [email protected] Tel/Fax: +7-499-137-8257

1 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ABSTRACT

In this work, we present the results of evaluation of a workflow employing a multi-enzyme digestion strategy for MS1-based protein identification in “shotgun” proteomic applications. In the proposed strategy, several cleavage reagents of different specificity were used for parallel digestion of the protein sample followed by MS1 and retention time (RT) based search. Proof of principle for the proposed strategy was performed using experimental data obtained for the annotated 48-protein standard. Using the developed approach, up to 90% of proteins from the standard were unambiguously identified. The approach was further applied to HeLa proteome data. For the sample of this complexity, the proposed MS1-only strategy determined correctly up to 34% of all proteins identified using standard MS/MS-based database search. It was also found that the results of MS1-only search are independent of the chromatographic gradient time in a wide range of gradients from 15 to 120 min. Potentially, rapid MS1-only proteome characterization can be an alternative and/or complementary to the MS/MS-based “shotgun” analyses in the studies, in which the experimental time is more important than the depth of the proteome coverage.

KEYWORDS: proteomics, database search, MS1-only search, peptide mass fingerprinting, shotgun proteomics, protein identification

2 ACS Paragon Plus Environment

Page 2 of 40

Page 3 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

INTRODUCTION Shotgun proteomics is a widely used approach for qualitative and quantitative analysis of proteins in complex biological samples. Currently, two basic strategies for identification of proteins are known. The first strategy was developed in the early 90-s and called peptide mass fingerprinting (PMF).1–6 In PMF, the protein identification process is based on the comparison of measured and theoretically calculated masses of peptides generated from enzymatic digestion of the protein sample. However, this strategy was only found to work for small databases and relatively simple mixtures typically reduced to a few proteins. When the sample complexity grows, PMF search generates increasingly unreliable identifications, especially for relatively short proteins represented by a limited number of peptides.7 The approach based on using accurate mass and retention time tags (AMT) significantly improved the capabilities of the method for MS/MS-free protein identification.8 However, even for the mass accuracy of 1.0 ppm and retention time prediction accuracy close to 1.0 min, the utility of AMT approach for unambiguous protein identification was limited to relatively small proteomes.9 Currently, the method of choice for protein identification is based on tandem mass spectrometry (MS/MS).10,11 The eluting peptides are sequentially isolated and fragmented using various dissociation techniques to produce sequence-specific MS/MS spectra. These spectra are compared with the theoretical ones generated from the applicable database followed by ranking of the successful matches according to the probability scores. However, the MS/MS-based proteome analysis suffers from a number of limitations. Firstly, it significantly complicates the identification of low abundance proteins for high dynamic range samples. For example, the reported dynamic range of proteome characterization using stateof-art Orbitrap FTMS may not exceed 104.12 Moreover, a number of studies performed for high dynamic range protein standards have shown dynamic range below 103 for confidently identified proteins.13 Secondly, only a small fraction of all peptide-like features detected in MS1 spectra is further selected for the MS/MS and subsequent identification.14 This occurs 3 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

because isolation, accumulation, and fragmentation of precursor ions require prolonged time, especially for low abundance peptides. Finally, these low abundance peptides, while being selected for isolation and fragmentation, typically produce low quality tandem mass spectra. Recently, the interest in peptide mass fingerprint or similar approaches was renewed as an addition to tandem mass spectrometry for increasing confidence in MS/MS-based peptide identifications.15,16 It has been shown that peptide feature matches, obtained by comparing the mass and retention time tags generated in silico with the ones from MS1 spectra, can be further employed for calculating protein probabilities using both fragmented and non-fragmented precursors.15 In the other study, the approach for determining most probable amino acid compositions and filtering sequence candidates based on neutron encoded (NeuCode) mass signatures was implemented.16–19 One of the unique features of this approach is the possibility of peptide identification without MS/MS.16 In this work we propose and explore MS/MS-free method for protein identification based on parallel digestion of the analyzed sample using different proteases and/or chemical reagents, or their combinations. The main rationale behind this work is the development of reliable MS1-only method for rapid identification of the major components in large numbers of clinical samples using ultra-fast HPLC gradients (below 15 min). We believe that this will allow employing more simple and inexpensive MS platforms without tandem MS capabilities (e.g., multi-path high resolution TOFs). One of the assumptions behind this proposal is that MS1 spectra obtained for proteolytic peptides generated from different proteases bear some level of sequence-specific complementarity. Indeed, it was known for quite a long time that using multiple proteases for digesting the same protein sample increases both the number of proteins identified from tandem mass spectra and their sequence coverage.20–22 Thus, we further hypothesize that combining MS1 spectra from multiple proteases may improve the efficiency of standard mass fingerprinting approach. In addition, we add retention time prediction to the MS/MS-free algorithm of protein identification to further enhance its efficiency. The ground for this combination is the sequence specificity of peptide elution times.23–26 Further, we compare the efficiency and attainable protein sequence coverage of 4 ACS Paragon Plus Environment

Page 4 of 40

Page 5 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

the proposed MS1-only protein identification method with standard proteome analyses based on tandem mass spectrometry.

EXPERIMENTAL SECTION Materials. Equimolar proteome standard UPS1 (Sigma-Aldrich, St. Louis, Missouri, USA) was dissolved in 50 mM ABB (ammonium bicarbonate) at the concentration of 0.1 µg µl-1 and digested with either trypsin (Promega, Madison, WI, USA ), Lys-C (Promega, Madison, WI, USA) or Glu-C (Promega, Madison, WI, USA) alone or with a mixture of Lys-C / Glu-C (ratio of 1:1, v/v) and trypsin / Glu-C (ratio of 1:1, v/v). The samples were reduced with 10mM DTT (dithiothreitol), added in the ratio of 1:1 (v/v), and then incubated for 30 min at 60ºC. The reduced samples were cooled to room temperature and alkylated with IAA (iodoacetamide) added to a final concentration of 15 mM by incubating for 30 min in dark. The proteases were added to the samples at the ratio of 1:20, (protease/protein, w/w). Then, the samples were digested for 18 hrs at 37ºC. The digestion was stopped with formic acid. The total protein amount injected was 8.64 pmol. The Pierce HeLa Protein Digest Standard (Thermo Fisher Scientific, Waltham, MA, USA) was dissolved in Mili-Q water with 0.1% formic acid up to concentration 1 µg/µl. The total protein amount injected was 1 µg. LC-MS/MS. Shotgun proteome analyses of the UPS proteolytic digests were performed on Orbitrap Q Exactive HF (Thermo Fisher Scientific, Waltham, MA, USA) coupled to Ultimate 3000 RSLCnano system (Dionex, Sunnyvale, CA, USA). The samples were loaded into a trap column (Acclaim PepMap, 2 cm × 75 µm i.d., C18, 3 µm, 100 A) (Thermo Fisher Scientific, Waltham, MA, USA) at 2 µl/min. Separations were performed using analytical column Zorbax 300SB-C18, 15 cm × 75 µm i.d., 3.5 µm particles (Agilent, Santa Clara, CA, USA). Mobile phase consisted of A and B solvents: (A) 100% water with 0.1% formic acid, and (B) 80% acetonitrile, 20% water with 0.1% formic acid. Linear gradient from 5%B to 40%B for 120 min at the flow rate of 300 nl/min was used for separations. MS1 settings were as follows: mass range m/z 400-1500; resolving power was 60k at m/z 400, maximum injection time was set to 100 ms, and AGC of 5·105. The peptide isolation window for MS/MS 5 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

was 2.0 Th and high energy collisional dissociation (HCD) method was used for fragmentation. MS/MS scan range was from 200 to 2000 Th and the dynamic exclusion time was 10.0 sec. The resolving power for MS/MS spectra of 15K at m/z 400, maximum injection time of 100.0 ms, and AGC of 105 were used. Shotgun proteome analyses of the Pierce HeLa digest standard were performed on LTQ Orbitrap Velos (Thermo Fisher Scientific, Waltham, MA, USA) coupled to Agilent 1100 HPLC System (Agilent, Santa Clara, CA, USA ). The samples were loaded into a trap column Zorbax 300SB-C18, 5 × 0.3 mm, 5 µm particles (Agilent, Santa Clara, CA, USA ) at 4 µl/min. Separations were performed using analytical column Zorbax 300SB-C18, 15 cm × 75 µm, 3.5 µm particles (Agilent, Santa Clara, CA, USA ). Mobile phase A (100% water with 0.1% formic acid) and mobile phase B (80% acetonitrile, 20% water with 0.1% formic acid) were used to establish the 70 min gradient comprised of 2 min of 2–5% B, 21 min of 5-30%, 5 min of 30– 45% B, 2 min of 45–95% B and 10 min of 95% B followed by re-equilibration at 2% B for 20 min. The flow rate of 300 nl/min was used for separations. MS1 settings were as follows: mass range m/z 300-1500, maximum injection time was set to 50 ms, and AGC of 4X106. The peptide isolation window for MS/MS was 4.0 Th. Resolving power was varied from 15k to 100k. For the MS/MS experiment the 10 and 20 most intense ions above a 5000 counts threshold were selected for fragmentation. For collision-activated dissociation, normalized collision energy was set to 35%. MS/MS scan range was from 200 to 2000 Th and the dynamic exclusion time was 15.0 sec. The resolving power for MS/MS spectra of 7.5K, maximum injection time of 250.0 ms, and AGC of 4x104 were used. The mass-spectrometry experiments were performed at the “Human Proteome” Core Facility at the Institute of Biomedical Chemistry (IBMC). HeLa data. Two publicly available data sets obtained for HeLa cell lysates in the earlier studies were used in this work: 1. The “confetti” data set was obtained using separate proteases such as trypsin, Lys-C, GluC, Asp-N, Arg-C and protease mixtures such as Lys-C/Glu-C and trypsin/Glu-C.21 Raw data

6 ACS Paragon Plus Environment

Page 6 of 40

Page 7 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

were downloaded from www.proteomexchange.org (dataset identifier PXD000900). The single-shot runs were used in the study. 2. Another data set has been obtained using different HPLC gradient times varied in a range from 15 to 120 minutes.27 Raw data were downloaded from www.proteomexchange.org (dataset identifier PXD001695). MS1 and MS/MS searches. Raw files were converted to MGF and mzML formats using msConvert from ProteoWizard.28 MS1 spectra in mzML format were processed for deisotoping and peak picking using Dinosaur software.29 Algorithm of the proposed method for MS1-only search is further described in Results and Discussion section below. For validation and evaluation of the efficiency of the proposed MS1 method we performed standard MS/MS-based protein identification. Specifically, database search against Human SwissProt database was performed using X!Tandem, version 2012.10.01 Cyclone.30 The following parameters were used: precursor mass tolerance of 10 ppm, fragment mass tolerance of 0.02 Da, maximum allowed missed cleavages of 2, fixed carbamidomethylation of cysteine and potential oxidation of methionine as residue modifications. Pepxmltk31 utility was used to convert X!Tandem output files to standard pepXML format. This utility allows setting the arbitrary sequence cleavage specificity, contrary to Tandem2XML converter typically employed for X!Tandem output conversion. The identifications were filtered to 1% FDR at the protein level and validated using MP score.23 Further processing and data analysis were performed in Python using Pyteomics.33 RT calculation was performed using ELUDE.34 In silico MS1 data generation. Proteins randomly selected from SwissProt human database were digested in silico into peptides using a number of cleavage rules listed in Table 1. Theoretical peptides generated for a given cleavage specificity were further filtered to model the experimental data. Firstly, the peptides generated in silico were restricted by minimal sequence length of 6 residues and m/z range of 300 to 1500 Th. We also assumed that a peptide contains one charge per each six residues in the sequence. Further, we removed 90% of peptides randomly selected in the generated data set with one missed cleavage and all peptides with more than one missed cleavage. Then, we simulated a diversity of protein 7 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

abundances in real samples by normal distribution of the percentage of observable peptides per protein. These percentages followed the normal distribution with mean value of 70% and standard deviation of 10%. This means that 68% of proteins have from 60 to 80% of detectable peptides (within ±1 standard deviation range) and only 4% of the protein population have less than 50% or higher than 90% of detectable peptides. Finally, we removed 50% of randomly selected remaining peptides to better match the in silico generated data set with the properties of experimental data sets obtained in a typical LCMS/MS-based proteome analysis. Note that the real MS data exhibit significant presence of “noise” from unidentified peptide-like spectra originating from in-source fragmentation, artifact modifications, etc.14 Therefore, we added the so-called “noise” peptides to the in silico data set after applying all filtering described above. The sequences of “noise” peptides were generated by random selection from a list of 20 commonly occurring amino acid residues. The distributions of “noise” peptide sequences by length and number of missed cleavages were similar to the ones from in silico data set. The neutral mass and the retention time (RT) for each peptide from the in silico generated data set, including “real” and “noise” peptides, were calculated using Pyteomics. Peptide retention times were calculated using retention coefficients determined for the experimental data set from 60 min gradient HPLC-MS/MS analysis. To emulate the real experimental data set, the masses and RTs of peptides from in silico data set were normally distributed with the standard errors shown in Table 2. The default values were 0.33 ppm for mass accuracy, 3 min for retention time accuracy, 3 noise peaks, trypsin cleavage, and 2000 proteins in the sample. We varied these parameters one by one while fixing the others to evaluate their effect on MS1 search efficiency. Additionally, to reveal the dependence between the best enzyme and the sample complexity, different enzymes were tested separately for 100 and 2000 proteins in the sample.

8 ACS Paragon Plus Environment

Page 8 of 40

Page 9 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

RESULTS AND DISCUSSION MS1-based search and protein scoring algorithms. The proposed workflow for MS1-only database search and protein scoring is shown in Figure 1. The specific feature of the approach is employing complementary information about peptide properties extracted from chromatographic and mass spectrometry data, as well as the residue specificity of enzymes used for digestion. Peaks measured in the MS1 spectra are filtered by charge state and number of observed isotopes. After identifying the peptide-like features in the acquired data, a search against the protein database containing target and decoy protein sequences is performed to match experimental values of masses to the ones calculated for theoretical peptides (peptide-feature match, PFM). Importantly, contrary to the MS/MS-based search, which is typically independent of the way the decoy database was generated35, the proposed MS1 strategy cannot work with reversed databases and requires shuffled sequences. This is due to the fact that a significant number of reversed decoy peptides will have the same m/z ratios as the target ones. Initial peptide-feature matching was performed with the user-defined mass tolerances. The optimal values for these tolerances (standard deviation and systematic shift) were also calculated on-the-fly during the initial analysis. Note that standard deviation and systematic shift for the differences between experimental and predicted retention times depend on the model used for RT prediction. Additive model with length correction was implemented into the proposed MS1 search algorithm by default. This is the most simple and straightforward model allowing re-training of the retention coefficients for the residues for each analyzed data set. However, the accuracy of this model is limited.23 A more advanced self-trained model, ELUDE, was also integrated in the workflow to improve the accuracy of RT prediction. Another advantage of ELUDE is the smaller number of reliably identified peptides needed for the model’s training to attain the maximum RT prediction accuracy compared with the additive model36. In this study, we employed ELUDE for all data sets. Generally speaking, the RT prediction for the MS1-only search can be performed using any existing peptide retention models or the databases of experimental peptide retention times. While re-trainable models 9 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 40

can be used for arbitrary experimental parameters, they require reliable peptide identifications for the training set obtained for these parameters using MS/MS. Among the non-retrainable models, one can mention BioLCCC which allows calculation of the absolute retention times for the given separation conditions,37 as well as one of the most accurate RT prediction algorithms, SSRCalc38. However, the former is suffering from low RT prediction accuracy, which is crucial for MS1-only search efficiency, while the latter does not have standalone version and is available through the web interface only, hindering its integration into the MS1 search software. In the next step of the search, all PFMs were filtered using a threshold determined by the standard deviation of predicted retention times from experimental RTs. 1.3σ and 3.0σ were found optimal for complex (HeLa) and simple (UPS standard) data, respectively (Supplementary Table S1). PFMs with predicted RTs beyond these thresholds were discarded. After filtering, the remaining PFMs were assembled into proteins. Protein probabilities were calculated using a binomial model. In this model, the number of trials, n, was equal to the total number of theoretical peptides, the number of successes, k, was set to that of identified peptides and the success probability in each trial, p, corresponded to the probability of a theoretical peptide to be randomly matched. Thus, the probability, P, of a protein to be randomly matched in a search was calculated as follows:

P(k) =

n! n−k ∗ p k ∗ (1 − p ) k!∗(n − k )!

(1)

This calculation does not make any distinction between unique and shared peptides, using only the protein’s peptide identifications to calculate its probability. The success probability in each trial, p, in Eq.1 was calculated as the fraction of matched decoy peptides in the search space of all decoy peptides:

p=

number of unique decoy PFMs number of theoretical decoy peptides

For each protein, the survival function was calculated as follows:

10 ACS Paragon Plus Environment

(2)

Page 11 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

k

Sf(k) = 1 − ∑ P(i )

(3)

i= 0

When data are generated for a number of sample aliquots digested using different enzymes or chemical reagents, the final probability of a protein to be randomly matched in the searches performed for all sample aliquots, m, can be calculated as follows:

Sf final = Sf 1 ∗ Sf 2 ∗ ... ∗ Sf m

(4)

For more convenient representation we report the final protein score as:

ProteinScore = - log10 (Sf final )

(5)

Then, the scoring algorithm drops the protein with the lower ProteinScore from each pair of target and its shuffled decoy sequence as suggested by Savitski et al.39 Finally, the remaining proteins are filtered to a given FDR threshold (typically, 1%) using target-decoy approach.40 MS1 search results for in silico data. First, the MS1-only searches were performed for the in silico generated data sets described above. For the proposed search strategy, we evaluated the expected number of protein identifications for different parameters. The results of these evaluations are summarized in Figure 2 for all in silico data sets with 3 replicates for each set of parameters. For the experiments with varying RT accuracy, RT prediction training was turned off in the MS1 algorithm and the predefined set of retention coefficients used above for RT data generation was employed. Also, for the sake of clarity in the interpretation, Fig. 2 shows the results obtained for the proteins from in silico generated database only, rather than all proteins identified in the searches. The number of identified proteins can decrease dramatically almost to zero when the number of “noise” peaks per “real” peptide becomes higher than 4 (Fig. 2a), or if the standard deviation of mass accuracy is higher than 3 ppm (Fig. 2b). Retention time accuracy has a linear effect on the number of identified proteins, as shown in Fig. 2c. The effect of sample complexity is further demonstrated in Fig. 2d: MS1 search identifies 60% of the proteins for simple samples (< 1000 proteins), 40% of the proteins for samples of moderate complexity (2000-3000 proteins), and below 20% for

11 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

complex samples containing more than 4000 proteins. The enzyme selection for the most efficient MS1-only analysis also depends on the sample complexity (Fig. 2e and Fig. 2f). Interestingly, the enzymes most efficient for the complex mixtures (ArgC, AspN, and LysC) are predicted to be less efficient for simple mixtures. This can be explained by the fact that the proteases which produce short peptides result in large number of peptide peaks in MS1 spectra. This significantly increases the probability of random peptide matches for the complex samples, which has an adverse effect on the respective protein score. Note also that, on average, the MS1 search algorithm reported ~10% proteins which were not used for generation of the in silico digest. However, most of these proteins are not false identifications, but homologous proteins sharing common peptides with the proteins from the list. For example, there are 957 identified proteins reported for one of the data sets and only 868 of them came from the database of digested proteins. However, among 89 "false" identifications only 20 do not share peptides with digested proteins. MS1 vs. MS/MS database search strategies. To evaluate the utility of the proposed MS1only method for protein identification, we compared its efficiency with the standard MS/MSbased proteome analysis of the UPS1 protein digests generated using several proteases. The comparison of MS1-only and MS/MS-based strategies was further extended by using publicly available data for HeLa digests.21 In this comparison, the precursor mass tolerance for MS/MS was set to 10 ppm, while the 0.75 to 3 ppm accuracies were used for MS1 data after the mass recalibration using Dinasour software. In fact, the optimal values for precursor mass tolerance are different in MS1-only and MS/MS searches, especially after recalibration with Dinosaur. Also note that using higher precursor mass accuracies (up to 3 orders) for MS/MS searches only marginally changes the number of identifications (by a few percent).42 In case of MS1-only searches, the excessively loose precursor mass tolerance window negatively affected the search results. Therefore, we used the optimal values for precursor tolerance to achieve better identification for both strategies. The protein inference was not performed for either MS/MS or MS1-only analyses. For comparison, the number of individually identified proteins was used instead of protein groups. 12 ACS Paragon Plus Environment

Page 12 of 40

Page 13 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

The main reason for this is that the MS/MS results have almost no false identifications at the peptide level. Thus, counting protein groups is straightforward and fair. On the contrary, the MS1-only results are characterized by a large number of false identifications at peptide level. This leads to the number of protein groups being close to the number of individually identified proteins. Therefore, direct comparison of the numbers of individual proteins is more fair. UPS1 results. Comparison of MS1 and MS/MS strategies for the UPS1 samples is shown in Figure 3. Depending on the protease type, the MS1 searches reported from 40% to 80% of the UPS1 proteins identified using MS/MS approach. The best MS1 search results (80% of the proteins identified by MS/MS searches) were obtained for the LysC endoprotease used alone. Combining the results for all proteases for all samples yielded 43 and 48 UPS1 proteins in case of using MS1-only and MS/MS-based searches, respectively. Thus, the proposed MS1-only method based on parallel digestion of relatively small protein mixtures by several enzymes allows protein identification efficiency comparable with that of MS/MSbased proteome analysis. The results obtained for experimental UPS1 data set show a significant dependence on the choice of the protease. This is not consistent with the results of in silico tests for a simple mixture (Fig. 2f), where similar efficiency is predicted for all proteases. The possible explanations for this discrepancy may be the differences in protease specificity, efficiency of proteolysis, peptide ionization, and/or presence of chemical modifications occuring during the sample preparation. This is marginally confirmed by lower number of matched PFMs for UPS1/GluC data (Table 3) and, consequently, identified peptides (Table 4). Results obtained for MS1-only searches in the simulations have also revealed that the expected number of MS1-only protein identifications depends strongly on the peptide-like “noise” in the spectra. We further validated this observation using experimental data collected for the UPS1 standard. In particular, we estimated this effect by calculating a ratio of the number of peptide-feature matches for non-UPS1 theoretical peptides to the number of PFMs for UPS1 theoretical peptides (Tab. 3). It was found that the number of peptide-like “noise” peaks per UPS1 peptide was in the range of 4.4 to 6.4 supporting further the 13 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

approximations used for generation of in silico data sets. However, we assume that this ratio can be higher by the order of magnitude for more complex protein mixtures. HeLa results. Figure 4 shows the results of comparison between MS/MS-based and MS1only strategies for protein identification using the data obtained previously for multiple protease mapping of HeLa proteome. Surprisingly, the lowest number of proteins was found for Trypsin/GluC data, and the best results were obtained for GluC and AspN proteases (22% and 26% of MS/MS-based protein identifications, respectively). Combining the results for all individually used proteases, MS1 search produced 30% of proteins identified with MS/MS (1268 and 4329 human proteins identified in MS1-only and MS/MS-based searches, respectively). Complementarity of MS/MS and MS1-only results.

MS1-only approach produces a

number of protein identifications additional to those obtained using MS/MS analysis. The Venn diagrams in Figure 5 show that MS1-only results provide 0.5 to 12% of additional proteins for the combined strategy. We further evaluated the protein sequence coverage by peptides identified using the proposed MS1-only and MS/MS-based methods. First, the number of one-hit wonders among the MS/MS-based identifications was analyzed for all data sets. The proportion of these one-hit wonders varied from 14% to 74% of all MS/MS-based protein identifications for the HeLa data sets. For example, almost 50% of MS/MS-based protein identifications obtained for tryptic HeLa digest have one peptide match per a protein, as shown in Fig. 6a. On the contrary, the minimal number of peptide matches per protein obtained for MS1-only method was 8 peptides for the same data. On average, the MS1-only searches resulted in five times more peptide matches per protein compared with the MS/MS-based method. Indeed, the sequence specificity of a fragmentation spectrum allows reliable protein identification by one successfully fragmented peptide of at least 7-10 amino acids in length. Importantly, the proposed MS1-only method guarantees 1% FDR for the proteins only, while the FDR may be significantly higher at the peptide level. For the data presented in Figure 6a, the peptide-level FDR of 33% was estimated by calculating the ratio of decoy to target 14 ACS Paragon Plus Environment

Page 14 of 40

Page 15 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

peptides. Thus, the MS1-only results provide only 17 true peptide matches per protein on average. Because of high peptide level FDR, the MS1-only search requires at least 10 peptide matches to obtain a protein score sufficient for successful identification. Further advances in the accuracy of m/z measurement and, especially, retention time prediction should reduce the peptide-level FDR. Significant improvements are also expected from addition of orthogonal peptide descriptors, such as pI,43 or N-terminal amino acid information.44 To test the MS1-only method for the presence of intuitively expected biases towards high abundance proteins and proteins with long sequences, the relative abundances of proteins identified using both MS1-only and MS/MS strategies were estimated using NSAF label-free quantitation algorithm.45 Note that while NSAF provides a rough estimation of relative protein concentrations based on the fragment intensities in MS/MS spectra, it clearly shows the bias of MS1-only method towards abundant proteins (Fig. 6b). This result is intuitively expected because the fraction of detectable peptides among the theoretical ones increases with the protein abundance and strongly affects the scoring determined by Eq. 1. On the other hand, the absolute number of matched peptides, while dependent on the protein length, does not affect the scoring. This intuitively contrarian result is supported by the distribution of identified proteins by number of theoretical peptides shown in Fig. 6c. In summary, the above analysis shows that the bias towards long proteins for MS1-only approach is similar to that of MS/MS-based approach. Shuffled decoy database bias. The numbers of target and decoy matches were calculated for the Confetti data set obtained for the trypsin digestion. These calculations were performed to reveal the possible bias towards either target or decoy sequences. The mass measurement accuracies of 0.75, 10, and 100 ppm were used for the evaluation. We found that the average number of decoys is higher than the average number of target matches (shown in Table 5) for 10 and 100 ppm mass tolerances. This observation means that the proposed method has more conservative error rate estimation than expected. This happens because some of the target proteins (homologous, isoforms, etc.) share part of the 15 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

sequences, while their shuffled decoys have no shared parts, and, thus, represented exclusively by unique peptides. The results obtained for 0.75 ppm mass accuracy exhibit less decoy than target matches because of high presence of true target matches among all targets (true and false ones) within this small mass window. Effect of gradient time. It is known that increase in gradient time improves the sensitivity of MS/MS-based proteome analyses and results in more protein identifications23,46. We study the effect of the change in gradient time on the performance of MS-only search strategy. HeLa data obtained for different LC gradient times in the range from 15 to 120 minutes27 were used, and the results of MS/MS-based and MS1-only searches are shown in Figure 7. While the number of proteins identified using standard MS/MS-based search increases with the increase in gradient time, one of the interesting features of MS1-only search is its independence of the gradient time. Consequently, the efficiencies of the two methods become closer for shorter gradients. The slight decrease in the number of identified proteins for the 15-min gradient can be explained by a drop in mass measurement accuracy for MS1 spectra. The accuracy decreased from 0.8 ppm to 1.4 ppm, most probably due to the space charge effect in the Obritrap ion trap caused by higher density of the trapped ions across the measured mass range.47 Alternatively, the decrease in mass measurement accuracy can be explained by the insufficient number of MS1 spectra for calculation of exact masses of peptide features by the Dinosaur software.29 As an example, consider the peptide features of peptide GVVPLAGTNGETTTQGLDGLSER reliably identified in MS1 spectra (Supplementary Table S2). It was found that these peptide features were only detectable in 18 scans for the 15 min gradient data set, while for the other data sets this number for the same peptide varied from 27 to 63. Note also that the data were acquired using typical LCMS/MS runs with experimental settings optimized for the MS/MS-based workflow. However, the MS1-only analysis can be performed without running the instrument in MS/MS mode. In this case the number of MS1 spectra acquired for the same sample will significantly increase, which should solve the problem with the lower mass calibration accuracy in case of short gradients. 16 ACS Paragon Plus Environment

Page 16 of 40

Page 17 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

MS1-only experimental data. Note that employing tandem mass spectrometry data used typically in shotgun proteomics for evaluating MS1-only approach may not be optimal, since the acquisitions are tuned for collection of MS/MS rather than MS1 spectra. Therefore, a series of experiments were performed on the Orbitrap Velos FTMS instrument for the HeLa protein standard using both standard MS/MS and MS1-only runs. The MS1-only data were obtained with different mass resolutions varying from 15,000 to 100,000. MS/MS analysis was done for top-10 acquisition method with 60,000 resolution at MS1 level. The average number of identified proteins is shown in Figure 8a. The best results for MS1-only experiments were obtained for the mass resolution of 60,000. Using higher resolutions was less efficient because of the insufficient number of MS1 scans for calculating exact mass of peptide features. However, we believe that the optimal mass resolution settings are specific to mass analyzer and/or chromatographic system. Fig. 8b shows the Venn diagram for proteins identified in MS/MS and MS1-only at 60,000 resolution. As demonstrated in the figure, the MS1-only analysis provides 149 additional proteins to the 505 proteins identified using MS/MS. In general, increasing the number of MS1 scans at the expense of MS/MS events allowed increasing the number of protein identifications for MS1-only approach by a factor of 2.75. In this case, the number of MS1-only identifications becomes comparable with MS/MS results if the one-hit wonders are excluded. Software. The MS1-only search workflow used in this study was implemented as a command

line

operated

open-source

software

called

ms1searchpy

available

at

https://bitbucket.org/markmipt/ms1searchpy. Typical processing time for a single mzML file (Trypsin+GluC UPS data set, 43,000 MS1 spectra) is 4 minutes on a 6-core Intel(R) Core(TM) i7 CPU.

CONCLUSIONS An MS/MS-free method for protein identification in complex mixtures has been proposed and evaluated. It combines parallel digestion of the analyzed mixture using multiple proteases, prediction of peptide retention times, and accurate peptide mass measurements. 17 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Combination of multiple proteases allows significant increase in the number of protein identifications, but the approach can also be used in single protease experiments when only the abundant components of the proteome are targeted. We found that the method has no bias towards large proteins, yet it expectedly identifies the proteins with higher abundances compared with the standard MS/MS-based analyses. The proposed method is still inferior to MS/MS-based proteome characterization for complex mixtures with typically employed experimental settings that include long chromatographic separations. However, its relative efficiency becomes comparable for rapid analyses of the abundant components of the proteomes using ultra-short gradients. The ability of using the method for short gradients with similar or better efficiency compared with MS/MS can potentially be an interesting feature for the applications requiring high throughput analyses of clinical samples using relatively simple and inexpensive LS-MS setups. The efficiency of the method can be further improved if the analyzed mixture is digested with cleavage agents of high specificity, generating so-called “middle-down” peptides. We also believe that the method will significantly benefit from the progress in retention time prediction, as well as integration of pI values and/or the other experimentally determined peptide features into the MS1 search algorithm.

Supporting information The following files are available free of charge at ACS website http://pubs.acs.org: SupplementaryTable1S.xls Detailed search parameters and results for all used data sets. SupplementaryTable2S.xls Spectral information for peptide GVVPLAGTNGETTTQGLDGLSER in HeLa data for different LC gradient times.

ACKNOWLEDGEMENTS This study was supported by the Russian Science Foundation (project #14-1400971). The mass-spectrometry experiments were performed at the “Human Proteome” Core Facility, Institute of Biomedical Chemistry (IBMC). The authors thank Prof. Victor G. Zgoda for help with experiments and useful discussions of the results. 18 ACS Paragon Plus Environment

Page 18 of 40

Page 19 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

REFERENCES (1)

Clauser, K. R.; Baker, P.; Burlingame, A. L. Role of accurate mass measurement (+/10 ppm) in protein identification strategies employing MS or MS/MS and database searching. Anal. Chem. 1999, 71 (14), 2871–2882.

(2)

Henzel, W. J.; Billeci, T. M.; Stults, J. T.; Wong, S. C.; Grimley, C.; Watanabe, C. Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases. Proc Natl Acad Sci U S A 1993, 90 (11), 5011–5015.

(3)

James, P.; Quadroni, M.; Carafoli, E.; Gonnet, G. Protein identification by mass profile fingerprinting. Biochem. Biophys. Res. Commun. 1993, 195 (1), 58–64.

(4)

Mann, M.; Højrup, P.; Roepstorff, P. Use of mass spectrometric molecular weight information to identify proteins in sequence databases. Biol. Mass Spectrom. 1993, 22 (6), 338–345.

(5)

Pappin, D. J.; Hojrup, P.; Bleasby, A. J. Rapid identification of proteins by peptidemass fingerprinting. Curr. Biol. 1993, 3 (6), 327–332.

(6)

Yates, J. R.; Speicher, S.; Griffin, P. R.; Hunkapiller, T. Peptide mass maps: a highly informative approach to protein identification. Anal. Biochem. 1993, 214 (2), 397–408.

(7)

Cottrell, J. S. Protein identification by peptide mass fingerprinting. Pept. Res. 1994, 7 (3), 115–124.

(8)

Conrads, T. P.; Anderson, G. A.; Veenstra, T. D.; Pasa-Tolić, L.; Smith, R. D. Utility of accurate mass tags for proteome-wide protein identification. Anal. Chem. 2000, 72 (14), 3349–3354.

(9)

Smith, R. D.; Anderson, G. A.; Lipton, M. S.; Pasa-Tolic, L.; Shen, Y.; Conrads, T. P.; Veenstra, T. D.; Udseth, H. R. An accurate mass tag strategy for quantitative and highthroughput proteome measurements. Proteomics 2002, 2 (5), 513–523.

(10)

Edwards, N. J. Protein identification from tandem mass spectra by database searching. Methods Mol. Biol. 2011, 694, 119–138.

19 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(11)

Nesvizhskii, A. I. Protein identification by tandem mass spectrometry and sequence database searching. Methods Mol. Biol. 2007, 367, 87–119.

(12)

Makarov, A.; Denisov, E.; Lange, O.; Horning, S. Dynamic range of mass accuracy in LTQ Orbitrap hybrid mass spectrometer. J. Am. Soc. Mass Spectrom. 2006, 17 (7), 977–982.

(13)

Geiger, T.; Cox, J.; Mann, M. Proteomics on an Orbitrap benchtop mass spectrometer using all-ion fragmentation. Mol. Cell Proteomics 2010, 9 (10), 2252–2261.

(14)

Michalski, A.; Cox, J.; Mann, M. More than 100,000 detectable peptide species elute in single shotgun proteomics runs but the majority is inaccessible to data-dependent LCMS/MS. J. Proteome Res. 2011, 10 (4), 1785–1793.

(15)

Moruz, L.; Hoopmann, M. R.; Rosenlund, M.; Granholm, V.; Moritz, R. L.; Käll, L. Mass Fingerprinting of Complex Mixtures: Protein Inference from High-Resolution Peptide Masses and Predicted Retention Times. J. Proteome Res. 2013, 12 (12), 5730–5741.

(16)

Rose, C. M.; Merrill, A. E.; Bailey, D. J.; Hebert, A. S.; Westphall, M. S.; Coon, J. J. Neutron encoded labeling for peptide identification. Anal. Chem. 2013, 85 (10), 5129– 5137.

(17)

Hebert, A. S.; Merrill, A. E.; Bailey, D. J.; Still, A. J.; Westphall, M. S.; Strieter, E. R.; Pagliarini, D. J.; Coon, J. J. Neutron-encoded mass signatures for multiplexed proteome quantification. Nat. Methods 2013, 10 (4), 332–334.

(18)

Hebert, A. S.; Merrill, A. E.; Stefely, J. A.; Bailey, D. J.; Wenger, C. D.; Westphall, M. S.; Pagliarini, D. J.; Coon, J. J. Amine-reactive neutron-encoded labels for highly plexed proteomic quantitation. Mol. Cell Proteomics 2013, 12 (11), 3360–3369.

(19)

Merrill, A. E.; Hebert, A. S.; MacGilvray, M. E.; Rose, C. M.; Bailey, D. J.; Bradley, J. C.; Wood, W. W.; El Masri, M.; Westphall, M. S.; Gasch, A. P.; et al. NeuCode labels for relative protein quantification. Mol. Cell Proteomics 2014, 13 (9), 2503–2512.

(20)

Tsiatsiani, L.; Heck, A. J. R. Proteomics beyond trypsin. FEBS J. 2015, 282 (14), 2612–2626.

20 ACS Paragon Plus Environment

Page 20 of 40

Page 21 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(21)

Guo, X.; Trudgian, D. C.; Lemoff, A.; Yadavalli, S.; Mirzaei, H. Confetti: A Multiprotease Map of the HeLa Proteome for Comprehensive Proteomics. Mol Cell Proteomics 2014, 13 (6), 1573–1584.

(22)

Leitner, A.; Reischl, R.; Walzthoeni, T.; Herzog, F.; Bohn, S.; Förster, F.; Aebersold, R. Expanding the chemical cross-linking toolbox by the use of multiple proteases and enrichment by size exclusion chromatography. Mol. Cell Proteomics 2012, 11 (3), M111.014126.

(23)

Tarasova, I. A.; Masselon, C. D.; Gorshkov, A. V.; Gorshkov, M. V. Predictive chromatography of peptides and proteins as a complementary tool for proteomics. Analyst 2016, 141 (16), 4816–4832.

(24)

Moruz, L.; Staes, A.; Foster, J. M.; Hatzou, M.; Timmerman, E.; Martens, L.; Käll, L. Chromatographic retention time prediction for posttranslationally modified peptides. Proteomics 2012, 12 (8), 1151–1159.

(25)

Baczek, T.; Kaliszan, R. Predictions of peptides’ retention times in reversed-phase liquid chromatography as a new supportive tool to improve protein identification in proteomics. Proteomics 2009, 9 (4), 835–847.

(26)

Shao, C. Applications of peptide retention time in proteomic data analysis. Adv. Exp. Med. Biol. 2015, 845, 67–75.

(27)

Hosp, F.; Scheltema, R. A.; Eberl, H. C.; Kulak, N. A.; Keilhauer, E. C.; Mayr, K.; Mann, M. A Double-Barrel Liquid Chromatography-Tandem Mass Spectrometry (LCMS/MS) System to Quantify 96 Interactomes per Day. Mol. Cell Proteomics 2015, 14 (7), 2030–2041.

(28)

Chambers, M. C.; Maclean, B.; Burke, R.; Amodei, D.; Ruderman, D. L.; Neumann, S.; Gatto, L.; Fischer, B.; Pratt, B.; Egertson, J.; et al. A cross-platform toolkit for mass spectrometry and proteomics. Nat. Biotechnol. 2012, 30 (10), 918–920.

(29)

Teleman, J.; Chawade, A.; Sandin, M.; Levander, F.; Malmström, J. Dinosaur: A Refined Open-Source Peptide MS Feature Detector. J. Proteome Res. 2016, 15 (7), 2143–2151.

21 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(30)

Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20 (9), 1466–1467.

(31)

Ivanov, M. V.; Levitsky, L. I.; Tarasova, I. A.; Gorshkov, M. V. Pepxmltk—a format converter for peptide identification results obtained from tandem mass spectrometry data using X!Tandem search engine. J Anal Chem 2015, 70 (13), 1598–1599.

(32)

Ivanov, M. V.; Levitsky, L. I.; Lobas, A. A.; Panic, T.; Laskay, Ü. A.; Mitulovic, G.; Schmid, R.; Pridatchenko, M. L.; Tsybin, Y. O.; Gorshkov, M. V. Empirical multidimensional space for scoring peptide spectrum matches in shotgun proteomics. J. Proteome Res. 2014, 13 (4), 1911–1920.

(33)

Goloborodko, A. A.; Levitsky, L. I.; Ivanov, M. V.; Gorshkov, M. V. Pyteomics--a Python framework for exploratory data analysis and rapid software prototyping in proteomics. J. Am. Soc. Mass Spectrom. 2013, 24 (2), 301–304.

(34)

Moruz, L.; Tomazela, D.; Käll, L. Training, selection, and robust calibration of retention time models for targeted proteomics. J. Proteome Res. 2010, 9 (10), 5209–5216.

(35)

Gupta, N.; Bandeira, N.; Keich, U.; Pevzner, P. A. Target-decoy approach and false discovery rate: when things may go wrong. J. Am. Soc. Mass Spectrom. 2011, 22 (7), 1111–1120.

(36)

Baczek, T.; Wiczling, P.; Marszałł, M.; Heyden, Y. V.; Kaliszan, R. Prediction of peptide retention at different HPLC conditions from multiple linear regression models. J. Proteome Res. 2005, 4 (2), 555–563.

(37)

Perlova, T. Y.; Goloborodko, A. A.; Margolin, Y.; Pridatchenko, M. L.; Tarasova, I. A.; Gorshkov, A. V.; Moskovets, E.; Ivanov, A. R.; Gorshkov, M. V. Retention time prediction using the model of liquid chromatography of biomacromolecules at critical conditions in LC-MS phosphopeptide analysis. Proteomics 2010, 10 (19), 3458–3468.

(38)

Krokhin, O. V.; Spicer, V. Predicting peptide retention times for proteomics. Curr Protoc Bioinformatics 2010, Chapter 13, Unit 13.14.

22 ACS Paragon Plus Environment

Page 22 of 40

Page 23 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(39)

Savitski, M. M.; Wilhelm, M.; Hahne, H.; Kuster, B.; Bantscheff, M. A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets. Mol. Cell Proteomics 2015, 14 (9), 2394–2404.

(40)

Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increased confidence in largescale protein identifications by mass spectrometry. Nat. Methods 2007, 4 (3), 207–214.

(41)

Laskay, Ü. A.; Lobas, A. A.; Srzentić, K.; Gorshkov, M. V.; Tsybin, Y. O. Proteome digestion specificity analysis for rational design of extended bottom-up and middledown proteomics experiments. J. Proteome Res. 2013, 12 (12), 5558–5569.

(42)

Ivanov, M. V.; Levitsky, L. I.; Lobas, A. A.; Tarasova, I. A.; Pridatchenko, M. L.; Zgoda, V. G.; Moshkovskii, S. A.; Mitulovic, G.; Gorshkov, M. V. Peptide identification in “shotgun” proteomics using tandem mass spectrometry: Comparison of search engine algorithms. J Anal Chem 2015, 70 (14), 1614–1619.

(43)

Chingin, K.; Astorga-Wells, J.; Pirmoradian Najafabadi, M.; Lavold, T.; Zubarev, R. A. Separation of polypeptides by isoelectric point focusing in electrospray-friendly solution using a multiple-junction capillary fractionator. Anal. Chem. 2012, 84 (15), 6856–6862.

(44)

Lobas, A. A.; Verenchikov, A. N.; Goloborodko, A. A.; Levitsky, L. I.; Gorshkov, M. V. Combination of Edman degradation of peptides with liquid chromatography/mass spectrometry workflow for peptide identification in bottom-up proteomics. Rapid Commun. Mass Spectrom. 2013, 27 (3), 391–400.

(45)

Zybailov, B.; Mosley, A. L.; Sardiu, M. E.; Coleman, M. K.; Florens, L.; Washburn, M. P. Statistical analysis of membrane proteome expression changes in Saccharomyces cerevisiae. J. Proteome Res. 2006, 5 (9), 2339–2347.

(46)

Köcher, T.; Pichler, P.; Swart, R.; Mechtler, K. Analysis of protein mixtures from wholecell extracts by single-run nanoLC-MS/MS using ultralong gradients. Nat Protoc 2012, 7 (5), 882–890.

(47)

Gorshkov, M. V.; Good, D. M.; Lyutvinskiy, Y.; Yang, H.; Zubarev, R. A. Calibration function for the Orbitrap FTMS accounting for the space charge effect. J. Am. Soc. Mass Spectrom. 2010, 21 (11), 1846–1851.

23 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure captions

Figure 1. General workflow for MS1-only protein search implemented in this study.

Figure 2. Theoretical dependence of the number of proteins identified at 1% protein FDR for different number of noise peaks on (a), mass accuracies (b), retention time accuracy (c), sample complexity (d), and enzymes for 2000 (e) and 100 (f) proteins in the sample. The default parameters were 3 noise peaks, 0.33 ppm mass accuracy, 3 min RT accuracy, 2000 proteins in the sample and trypsin cleavage. Enzyme labels: T - trypsin, G - Glu-C, L - Lys-C, LG - Lys-C/Glu-C, TG - trypsin/Glu-C, Ac - ArgC and An - AspN. Black lines show 1 standard deviation range.

Figure 3. Number of proteins identified in experimental data for UPS1 samples at 1% protein FDR. Red and blue bars correspond to MS1 and MS/MS search results, respectively. Label legends: T - trypsin, G - GluC, L - LysC, L+G - LysC/GluC mixture, T+G - trypsin/GluC mixture, union - combination of all search results together.

Figure 4. Number of proteins identified for “Confetti” HeLa data (1% protein FDR). Red and blue bars correspond to MS1-only and MS/MS-based searches, respectively. Label legends: T - trypsin, G - GluC, L - LysC, L+G - LysC/GluC mixture, T+G - trypsin/GluC mixture, Ac ArgC, An - AspN, union - combination of all search results for all proteases and their mixtures.

Figure 5. Venn diagrams for proteins identified in MS1-only and MS/MS methods for “Confetti” HeLa data (1% protein FDR).

24 ACS Paragon Plus Environment

Page 24 of 40

Page 25 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 6. Number of matched peptides per protein (a), Logarithm of Normilized Spectral Abundace Factor (b) and number of theoretical peptides for identified proteins (c) for MS/MSbased (blue) and MS1-only (red) searches. For figures a and c, the total number of proteins is normalized to 1. Figures a and b show the proteins identified in both MS/MS and MS1-only analyses. For figure b the higher value (closer to -1) of LOG10(NSAF) means higher concentration of proteins. Experimental data were obtained for HeLa tryptic digest.

Figure 7. Number of proteins identified in HeLa data obtained from the previous studies using different LC gradients. Protein identifications were filtered to 1% FDR using targetdecoy approach. Red and blue bars correspond to MS1-only and MS/MS-based searches, respectively.

Figure 8. Results of identification for HeLa data acquired using Orbitrap Velos at MS1-only and MS/MS modes. (a) Number of proteins identified using different experimental parameters of experiments. 15k, 30k, 60k and 100k correspond to the resolution in MS1 scans. (b) Venn diagram for the identified proteins in MS1-only and MS/MS experiments.

Table 1. Enzymes used for in silico experiment.

Table 2. Parameters used for generation of in silico data.

Table 3. Number of PFM and “noise” peaks for UPS MS1 data.

Table 4. Fraction of matched UPS peptides in experimental MS1 data.

Table 5. Number of target and decoy peptide matches for different mass accuracies used for MS1-only search of Conffeti trypsin data.

25 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

For TOC only.

26 ACS Paragon Plus Environment

Page 26 of 40

Page 27 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 1. General workflow for MS1-only protein search implemented in this study.

27 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2. Theoretical dependence of the number of proteins identified at 1% protein FDR for different number of noise peaks on (a), mass accuracies (b), retention time accuracy (c), sample complexity (d), and enzymes for 2000 (e) and 100 (f) proteins in the sample. The default parameters were 3 noise peaks, 0.33 ppm mass accuracy, 3 min RT accuracy, 2000 proteins in the sample and trypsin cleavage. Enzyme labels: T - trypsin, G - Glu-C, L - Lys-C, LG - Lys-C/Glu-C, TG - trypsin/Glu-C, Ac - ArgC and An - AspN. Black lines show 1 standard deviation range.

28 ACS Paragon Plus Environment

Page 28 of 40

Page 29 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 3. Number of proteins identified in experimental data for UPS1 samples at 1% protein FDR. Red and blue bars correspond to MS1 and MS/MS search results, respectively. Label legends: T - trypsin, G - GluC, L - LysC, L+G - LysC/GluC mixture, T+G - trypsin/GluC mixture, union - combination of all search results together.

29 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 4. Number of proteins identified for “Confetti” HeLa data (1% protein FDR). Red and blue bars correspond to MS1-only and MS/MS-based searches, respectively. Label legends: T - trypsin, G - GluC, L - LysC, L+G - LysC/GluC mixture, T+G - trypsin/GluC mixture, Ac ArgC, An - AspN, union - combination of all search results for all proteases and their mixtures.

30 ACS Paragon Plus Environment

Page 30 of 40

Page 31 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 5. Venn diagrams for proteins identified in MS1-only and MS/MS methods for “Confetti” HeLa data (1% protein FDR).

31 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 6. Number of matched peptides per protein (a), Logarithm of Normilized Spectral Abundace Factor (b) and number of theoretical peptides for identified proteins (c) for MS/MS-based (blue) and MS1only (red) searches. For figures a and c, the total number of proteins is normalized to 1. Figures a and b show the proteins identified in both MS/MS and MS1-only analyses. For figure b the higher value (closer to -1) of LOG10(NSAF) means higher concentration of proteins. Experimental data were obtained for HeLa tryptic digest.

32 ACS Paragon Plus Environment

Page 32 of 40

Page 33 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 7. Number of proteins identified in HeLa data obtained from the previous studies using different LC gradients. Protein identifications were filtered to 1% FDR using targetdecoy approach. Red and blue bars correspond to MS1-only and MS/MS-based searches, respectively.

33 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 8. Results of identification for HeLa data acquired using Orbitrap Velos at MS1-only and MS/MS modes. (a) Number of proteins identified using different experimental parameters of experiments. 15k, 30k, 60k and 100k correspond to the resolution in MS1 scans. (b) Venn diagram for the identified proteins in MS1-only and MS/MS experiments.

34 ACS Paragon Plus Environment

Page 34 of 40

Page 35 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 1. Enzymes used for in silico experiment. Enzyme

Cleavage rule

Trypsin

C-term K or R, but not before P

Glu-C

C-term E

Lys-C

C-term K

Trypsin+Glu-C

C-term K, R, or E

Lys-C + Glu-C

C-term K or E

ArgC

C-term R

AspN

N-term D

35 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table 2. Parameters used for generation of in silico data.

# Mass accuracy, ppm Retention time accuracy, min Noise peaks Number of proteins Enzymes

Default 0.33 3 3 2000 Trypsin

Tested values 0.1, 0.33, 1.0, 3.0 1, 3, 6, 9 0,1,2,3,4,5,10 100, 500, 1000, 2000, 3000, 4000, 5000 Trypsin, GluC, LysC, LysC+GluC, Trypsin+GluC, ArgC, AspN

36 ACS Paragon Plus Environment

Page 36 of 40

Page 37 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 3. Number of PFM and “noise” peaks for UPS MS1 data. Protease Trypsin GluC LysC LysC+GluC Trypsin+GluC

#UPS1 PFMs 672 457 8426 7406 7296

# total PFMs = UPS1 PFMs + nonUPS1 PFMs 5233 3835 42237 50599 59831

37 ACS Paragon Plus Environment

“noise” peaks = nonUPS1 PFMs / UPS1 PFMs 6.8 7.4 4.0 5.8 7.2

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table 4. Fraction of matched UPS peptides in experimental MS1 data.

Trypsin

# UPS1 peptides, experimental 423

# UPS1 peptides, theoretical 2054

Fraction, % 21

GluC

212

1566

14

LysC

434

1290

34

LysC+GluC

573

2245

26

Trypsin+GluC

678

2549

27

Protease

38 ACS Paragon Plus Environment

Page 38 of 40

Page 39 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 5. Number of target and decoy peptide matches for different mass accuracies used for MS1-only search of Conffeti trypsin data.

Mass accuracy

# peptide features

# targets

# decoys

0.75 ppm

28619

72464

65689

10 ppm

44255

566530

620662

100 ppm

45666

3916188

4412820

39 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

For TOC only

ACS Paragon Plus Environment

Page 40 of 40