MS Spectra of Peptides with Deep Learning

Nov 10, 2017 - In tandem mass spectrometry (MS/MS)-based proteomics, search engines rely on comparison between an experimental MS/MS spectrum and the ...
12 downloads 21 Views 638KB Size
Subscriber access provided by READING UNIV

Article

pDeep: Predicting MS/MS Spectra of Peptides with Deep Learning Xie-Xuan Zhou, Wen-Feng Zeng, Hao Chi, Chunjie Luo, Chao Liu, Jianfeng Zhan, Si-Min He, and Zhifei Zhang Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.7b02566 • Publication Date (Web): 10 Nov 2017 Downloaded from http://pubs.acs.org on November 11, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Analytical Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 21

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

pDeep: Predicting MS/MS Spectra of Peptides with Deep Learning Xie-Xuan Zhou,†,‡,k Wen-Feng Zeng,¶,‡,k Hao Chi,¶,‡ Chunjie Luo,†,‡ Chao Liu,¶,‡ Jianfeng Zhan,∗,†,‡ Si-Min He,∗,¶,‡ and Zhifei Zhang∗,§ †State Key Laboratory of Computer Architecture, Institute of Computing Technology (ICT), 100190, Chinese Academy of Sciences (CAS), Beijing, China. ‡University of Chinese Academy of Sciences, Beijing, China. ¶Key Laboratory of Intelligent Information Processing of CAS, ICT, 100190, CAS, Beijing, China. §Capital Medical University, 100069, Beijing, China. kThese authors contributed equally to this work. E-mail: [email protected]; [email protected]; [email protected]

Abstract In tandem mass spectrometry (MS/MS)-based proteomics, search engines rely on comparison between an experimental MS/MS spectrum and the theoretical spectra of the candidate peptides. Hence accurate prediction of the theoretical spectra of peptides appears to be particularly important. Here we present pDeep, a deep neural networkbased model for the spectrum prediction of peptides. Using the bidirectional long shortterm memory (BiLSTM), pDeep can predict HCD, ETD and EThcD MS/MS spectra of peptides with > 0.9 median Pearson correlation coefficients. Besides, we showed that intermediate layer of the neural network could reveal physicochemical properties of amino acids, for example the similarities of fragmentation behaviors between amino

1

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

acids. We also showed the potential of pDeep to distinguish extremely similar peptides (peptides that contain isobaric amino acids, for example, “GG = N”, “AG = Q” or even “I = L”), which were very difficult to be distinguished by traditional search engines.

Over the last decade, tandem mass spectrometry (MS/MS)-based shotgun proteomics has become a routine technique for protein identification. MS/MS-based peptide identification relies on the understanding of peptide fragmentation behaviors. There are two dimensions of information in an MS/MS spectrum, the mass-to-charge ratio (or m/z), and the intensity. For the m/z dimension, it is well-known that HCD mainly produces b/y ions and ETD mainly produces c/z ions. Protein search engines, such as pFind, 1 Sequest 2 and Mascot, 3 identify peptides by matching the experimental spectra with the calculated ion masses of peptides at a certain mass tolerance. But for the intensity dimension, the peak intensity distribution in the spectra is not fully understood. Therefore, many search engines just assign fixed theoretical intensities to ions while matching. Other search engines, such as SQID, 4 showed that if theoretical b/y ion probabilities between amino acid pairs were considered, the identification rates could be increased. Investigation of the peptide fragmentation is valuable both in theory and in practice. There are some researchers focusing on the prediction of theoretical MS/MS spectra of peptides, including kinetic model-based methods and machine learning-based methods. MassAnalyzer 5,6 and MS-Simulator 7,8 are two major kinetic model-based tools designed based on the mobile proton hypothesis with some basic assumptions, and the key parameters of the models are tuned to fit the data by statistics. The disadvantage of kinetic model is that it can not consistently be used to model the peptide fragmentation under HCD, ETD or EThcD. PeptideART is a pure machine learning-based tool which models the theoretical spectrum prediction as a classification problem, and the probability of the occurrence of each peak is learned by using a shallow feed-forward neural network. 9,10 Other previous work 11 predicts intensity ranks instead of relative intensities using learning-to-rank algorithms. It has been shown that a good prediction method can boost the identification of peptides. 4,8 2

ACS Paragon Plus Environment

Page 2 of 21

Page 3 of 21

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

However, peptide fragmentation is very complex to predict, Li et. al. pointed out that the cross-experiment correlations of PeptideART based on CID (collision-induced dissociation) spectra were significantly lower than within-experiment analyses. 10 To handle the complexity of peptide fragmentation, more powerful algorithms should be considered such as deep learning. Deep learning, or the deep neural network, can learn complex nonlinear function relating the input x to the prediction y, i.e. f : y = f (x), by adding multiple hidden layers between x and y. Lecun et. al. have proposed a very good review for Deep learning. 12 When using traditional machine learning algorithms, we have to focus on designing appropriate features manually. But it has been shown that different layers of the deep neural network could learn different representations of objects automatically. 13 To analyze the data of time series or sequential patterns, RNN (recurrent neural network) and LSTM (long short-term memory, a very popular implementation of RNN) are often used. For sequential patterns at position t with input xt and output yt , yt does not only rely on xt but also rely on the internal states of previous patterns. RNN or LSTM can remember the states of sequential patterns from position 0 to position t − 1 in the hidden neurons, and then predict yt by combining xt and the current states of the sequential patterns. An introduction to RNN/LSTM could be found in the review of deep learning. 12 In this work, we developed pDeep, a deep learning-based method to predict the intensity distribution of product ions of a peptide. pDeep can not only work well in predicting HCD spectra, but also can be used to predict ETD and EThcD spectra. To train and test pDeep, we collected ∼4,000,000 high-quality, high-resolution MS/MS spectra from ProteomeTools 14 and other reliable proteomic data sets, including HCD, ETD and EThcD spectra. For deep learning, BiLSTM (bidirectional LSTM), which has been successfully used to capture the bidirectional dependencies of sequential patterns in speech and natural languages, 15,16 is selected to model the influences of both N- and C-terminal amino acids of each cleavage position on the site-specific peptide fragmentation. pDeep achieved >0.9 median PCCs

3

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(Pearson correlation coefficients) in predicting HCD, ETD, EThcD spectra, which are significantly higher than kinetic model-based MassAnalyzer and MS-Simulator, as well as the machine learning-based PeptideART. Analysis of the intermediate layer of the neural network in pDeep showed that physicochemical knowledge, for example similarities of fragmentation behaviors between amino acids, could be learned by deep learning automatically. With the accurate spectrum prediction, pDeep shows its potential in distinguishing extremely similar peptides with isobaric amino acids or isobaric amino acid combinations by considering the intensity information. For example, pDeep can distinguish “GG” from “N” and “AG” from “Q” in peptides with >0.93 accuracies. Furthermore, pDeep can distinguish “I” from “L” with ∼0.67 accuracies in HCD spectra and ∼0.76 in EThcD spectra.

Results and discussion Data preparation In the development of pDeep, we used 8 data sets from 3 different labs and 3 different fragmentation types. Table 1 illustrates the information of the data sets we used for training and testing. All these HCD, ETD, and EThcD data are high-resolution data (detection type is OT, i.e. the Orbitrap). Raw data of QE-M-M, QE-Y-M, and QE-H-G were analyzed by pFind3 1 followed by the percolator algorithm 17 at 0.1% FDR with fixed modification of carbamidomethylation on “C” (Cys). For data sets from ProteomeTools, the MaxQuant search results recorded at PXD004732 in PRIDE were used, and only PSMs with Andromeda score ≥ 100 and PIF ≥ 0.7 were kept, as suggested by Zolg et al. 14 Furthermore, to ensure the matching quality of all spectra in these data sets, the PSM was removed if the number of matched peaks was less than its peptide length. After removal of the spectra with peptide length longer than 20 (∼90% of lengths of peptides identified from all raw data sets in Table 1 are not longer than 20), we obtained ∼4,000,000 high-quality spectra corresponding to 310,215 peptide sequences. If the charge states are considered, we have collected 46 1+ 4

ACS Paragon Plus Environment

Page 4 of 21

Page 5 of 21

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

peptides, 241,894 2+ peptides, 131,635 3+ peptides, 18,139 4+ peptides, 704 5+ peptides and 9 6+ peptides. 1+ and 6+ peptides were further excluded because the amount of these peptides are too small for training and testing.

Table 1: Data set information Data set QE-M-M QE-Y-M QE-H-G ProteomeTools-HCD25 ProteomeTools-HCD30 ProteomeTools-HCD35 ProteomeTools-ETD ProteomeTools-EThcD a b

Instrument Q-Exactive HF Q-Exactive Q-Exactive Lumos Lumos Lumos Lumos Lumos

Species Mouse Yeast Human Synthesis Synthesis Synthesis Synthesis Synthesis

Lab Mann Mann Gygi Kuster Kuster Kuster Kuster Kuster

Frag@NCEa (%) HCD@27 HCD@25 HCD@25 HCD@25 HCD@30 HCD@35 ETD EThcD

# Spectrab 306,810 119,779 425,424 919,866 895,697 755,417 184,111 369,798

Publication Sharma 18 Kulak 19 Chick 20 Zolg 14 Zolg 14 Zolg 14 Zolg 14 Zolg 14

“Frag@NCE”: Frag is the fragmentation type, NCE is the normalized collisional energy; Here “# Spectra” is the number of spectra used in training or testing of pDeep.

Training and testing of the HCD model The HCD model of pDeep, after trained on the QE-M-M data set, was then tested on the QE-Y-M data set for cross-species validation (only 0.8% spectra in QE-Y-M share the same peptides with QE-M-M), and tested on QE-H-G data set for cross-lab validation (35.7% spectra in QE-H-G share the same peptides with QE-M-M). By comparing predicted spectra with real ones, pDeep achieved median PCCs of 0.940 and 0.977, respectively (Figure 1). We also characterized other similarities such as cosine similarities (COS) and Spearman’s rank correlation coefficients (SPC) on QE-Y-M and QE-H-G data sets, both of them could achieve > 0.91 similarities (Fig. S1). Although Li et al. pointed out that cross-experiment PCCs would decrease 0.15 or more compared to within-experiment PCCs when analyzing the low-resolution spectra of identical peptides, 10 our analyses showed that the cross-lab and cross-species PCCs did not decrease on high-resolution MS/MS data.

5

ACS Paragon Plus Environment

Analytical Chemistry

a.

b. > PCC > 0.70 > 0.75 > 0.80 > 0.85 > 0.90

Percentage 98.4% 97.5% 96.2% 94.0% 89.2%

median=0.977

#Spectra

#Spectra

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 21

> PCC > 0.70 > 0.75 > 0.80 > 0.85 > 0.90

Percentage 96.4% 94.0% 89.6% 82.3% 70.2%

median=0.940

PCC

PCC

Figure 1: Test results of cross-lab and cross-species analyses of HCD data. (a). Crosslab validation. The model is trained on QE-M-M and tested on QE-H-G. The median PCC is as high as 0.977, and 97.5% of PCCs are higher than 0.75. The ultra-high PCCs shows that pDeep works very well in high-resolution spectra even if data are generated from different labs. (b). Cross-species validation. The model is trained on QE-M-M and tested on QE-Y-M. The median PCC is 0.940, and 94.0% of PCCs are higher than 0.75.

Comparing with MassAnalyzer, MS-Simulator and PeptideART. We compared pDeep with the kinetic model-based MassAnalyzer 5,6 (version 3.01, built 2017-04-17), MSSimulator 7,8 and machine learning-based PeptideART 10 on QE-Y-M data set, the results are shown in Figure 2. Parameters of MassAnalyzer were set depending on the information of the raw data of QE-Y-M: instrument = Q-Exactive, resolution = 17,500, isolation width = 2.2, collision energy (%) = 25, mass range = 100 - 2000, activation time was set as default (0.1 ms), all “C” (Cys) were converted to “U” (carbamidomethylated Cys in MassAnalyzer). The median PCC of MassAnalyzer is 0.789 and 61.8% PCCs are higher than 0.75. The performance of MassAnalyzer was quite good when the rank-based SPC similarity was considered (see Fig. S2). MS-Simulator was recently retrained for predicting 1+ y ions of 2+ peptides for Q-Exactive MS spectra. The median PCC of MS-Simulator is 0.850, and 66.0% of PCCs are higher than 0.75. Since original version of PeptideART was trained on CID spectra, we built a two layer feed-forward neural network based on the PeptideART algorithm, called “PeptideART-like”, to predict HCD spectra. PeptideART-like was trained on the QE-M-M data set, and it could achieve 0.894 median PCC on QE-Y-M, and 78.1% 6

ACS Paragon Plus Environment

Page 7 of 21

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

of PCCs are higher than 0.75(Figure 2). pDeep can achieve 0.940 median PCC, and 94.0% of PCCs are higher than 0.75. PeptideART has shown the high accuracies of the neural network for spectrum prediction, the accuracies of predicted spectra are much increased when BiLSTM model was used. We also analyzed the upper limit PCCs which were calculated by comparing different spectra of the same peptides with the same charge states on the QE-Y-M data set, and the median upper limit PCC is 0.981 (Figure 2). The comparison with the upper limit PCCs shows that there is still some room for improvement. +

MassAnalyzer MS-Simulator PeptideART-like pDeep upper limit

PCC = 0.75

+

median +

+

Figure 2: Comparing pDeep with MassAnalyzer, MS-Simulator and PeptideART-like on QE-Y-M data set. MS-Simulator could currently predict the intensity of 1+ y ions of 2+ peptides. PeptideART-like is a two layer feed-forward neural network model trained on the QE-M-M data set to predict HCD spectra based on PeptideART algorithm. Upper limit PCCs were calculated by comparing different spectra of the same peptides with same charge states on the QE-Y-M data set. Testing on ProteomeTools data sets. To validate the performance of pDeep, we further tested the model on HCD data sets from ProteomeTools with three different NCEs, i.e., NCE = 25%, 30%, and 35%, pDeep achieved median PCCs of 0.920, 0.963 and 0.950 respectively, as shown in Figure 3. pDeep was trained on QE-M-M at NCE = 27% in QExactive, but the test PCCs on ProteomeTools-HCD30 and ProteomeTools-HCD35 7

ACS Paragon Plus Environment

Analytical Chemistry

were higher than those on ProteomeTools-HCD25. We suspect that the NCE may work differently in different instruments, so we compared the experimental spectra on QE-H-G (NCE = 25%) with that on ProteomeTools at three different NCEs, the results showed that the spectra from Q-Exactive at NCE = 25% were more similar to the spectra from Lumos at NCE = 30% than those from Lumos at NCE = 25% (Figure 3d). Hence in current version of pDeep, we do not consider the NCE as a feature.

a.

b. median=0.920

# Spectra

2.0e5 1.5e5 1.0e5

> PCC > 0.70 > 0.75 > 0.80 > 0.85 > 0.90

median=0.963

5e5

Percentage 86.1% 83.7% 79.7% 72.4% 58.4%

4e5

# Spectra

2.5e5

3e5 2e5

> PCC > 0.70 > 0.75 > 0.80 > 0.85 > 0.90

Percentage 97.0% 95.3% 93.0% 89.2% 81.8%

1e5

5e4

PCC

PCC

c.

d.

3.0e5 2.5e5 2.0e5 1.5e5

> > > > > >

PCC 0.70 0.75 0.80 0.85 0.90

Percentage of PCC < x (%)

median=0.950

3.5e5

# Spectra

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 21

Percentage 98.6% 97.7% 95.7% 90.3% 78.0%

1.0e5 5e4

QE25 vs Lumos25 QE25 vs Lumos30 QE25 vs Lumos35 median

PCC

PCC = x

Figure 3: Testing on HCD data sets from ProteomeTools at different NCEs. (a) Test results on ProteomeTools-HCD25. The median PCC is 0.920, and 83.7% of PCCs are higher than 0.75. (b) Test results on ProteomeTools-HCD30. The median PCC is as high as 0.963, and 95.3% of PCCs are higher than 0.75. (c) Test results on ProteomeTools-HCD35. The median PCC is 0.950, and 97.7% PCCs are higher than 0.75. (d) Comparing the experimental spectra of human proteomes from Q-Exactive and Lumos. Here “QE25” refers to the QE-H-G data set. “Lumos25”, “Lumos30” and “Lumos35” refer to the ProteomeTools data sets with NCE = 25%, 30% and 35%, respectively.

8

ACS Paragon Plus Environment

Page 9 of 21

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Training and testing of ETD and EThcD models ProteomeTools has generated large-scale reliable ETD and EThcD data sets. In the ETD (c/z ions) analysis, 184,111 ETD spectra in ProteomeTools-ETD were split into two parts according to the different pools of ProteomeTools, 157,231 spectra were used in training the ETD model of pDeep, and the rest 26,880 were used in testing. 369,798 EThcD (b/y/c/z ions) spectra in ProteomeTools-EThcD were also split into two parts according to the different pools of ProteomeTools, 316,796 for training and 53,002 for testing. The test results are shown in Figure 4, with median PCCs of 0.908 and 0.930 for ETD and EThcD, respectively. The test results of EThcD show that pDeep can predict not only c/z ions produced by the ETD part of EThcD, but also b/y ions produced by the supplemental HCD, demonstrating the high extensibility of pDeep.

9

ACS Paragon Plus Environment

Analytical Chemistry

a.

b.

ProteomeTools-ETD

> PCC > 0.70 > 0.75 > 0.80 > 0.85 > 0.90

ProteomeTools-EThcD median=0.930

median=0.908

Percentage 93.8% 90.4% 84.1% 73.1% 54.1%

> PCC > 0.70 > 0.75 > 0.80 > 0.85 > 0.90

#Spectra

#Spectra

Percentage 96.8% 94.8% 90.7% 82.1% 64.7%

PCC

PCC

c.

M(1+) Rel. Int. = 100%

EThcD

z9+ y9+

c10+ c10+ z7+

z9+ y9+

z8+ y8+ y8+

z7+ y7+ z7+ y7+

z8+

z4+ y4+ z5+ y5+ z6+ y6+ z4+ y4+ z5+ y5+ z6+ y6+

b4+

b4+

M(2+)

y2+ y2+ z3+ z3+ b3+ b3+ y3+ y3+

b2+

b2+

Real

c1+

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 21

Predicted

Figure 4: Test results of ETD and EThcD data sets from ProteomeTools. (a) Test results of ETD data. The median PCC is 0.908, and 90.4% of PCCs are higher than 0.75. (b) Test results of EThcD data. The median PCC is as high as 0.930, and 94.8% of PCCs are higher than 0.75. (c) An example of the real EThcD spectrum and the predicted EThcD spectrum of 2+ peptide “LQDAYGGWANR” with 0.95 PCC.

Analysis of learned knowledge about amino acids pDeep has shown its high accuracies in predicting MS/MS spectra, so we believe some informative knowledge about amino acids must have been learned and digitalized as the neurons in the neural network. Here we further analyzed the intermediate neurons of amino acids learned by the deep neural network. We used each single amino acid to activate the 10

ACS Paragon Plus Environment

Page 11 of 21

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

neural network of pDeep to avoid the influences of other amino acids, and then the output of intermediate layer, which is the representation vector of the amino acid, was extracted. We could obtain 20 representation vectors for 20 amino acids. The pairwise Euclidean distances between representation vectors were calculated and then plotted as a heat map in Figure 5. An amino acid may locate at the N-terminal or C-terminal of a peptide bond, so the distances of amino acids at both of these two terminals are plotted separately in Figure 5a and 5b. Interestingly, as shown in Figure 5a, the distances between the representation vectors of N-terminal amino acids actually reflect some similar or dissimilar properties of amino acids. For example, “I” and “L” are twin amino acids, so their distance is the shortest among other amino acid pairs. “Y” and “F” show quite a short distance, this may be because both of them carry a benzene ring on their side chains. “K” is quite different from any other amino acid, as well as “R” and “H”, showing some different characteristics of these three amino acids while fragmenting under HCD. Figure 5b shows the similar heat map of amino acids at the C-terminal. Although we do not tell the neural network any information about the amino acids except for their name indicators, it can learn some extra information about the amino acids from MS/MS data by deep learning itself. This kind of ability of the neural network will be very helpful in the theoretical study of the complex process of peptide fragmentation.

11

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 5: Heat map based on the distances between representation vectors of amino acids learned by pDeep. (a) Heat map of amino acids at the N-terminal of the peptide bond. (b) Heat map of amino acids at the C-terminal of the peptide bond.

Distinguishing extremely similar peptides with pDeep If we only consider the m/z dimension, some extremely similar peptides originating from mass ambiguities are hard to be distinguished. There are several sources of mass ambiguities: (1) “I” (Ile) and “L” (Leu) share exactly the same mass, which is generally considered to be indistinguishable. (2) Some combinations of multiple amino acids also share the same masses, for example, “GG = N”, “AG = Q”. (3) Same amino acids with different permutations, for example, “AF = FA”, “KR = RK”. We call them the isobaric amino acids in our manuscript. Muth et. al. pointed out that most of the errors of de novo sequencing algorithms originated from the isobaric amino acids. 21 This kind of error sources is not only critical in de novo sequencing, but also important in database-driven proteomics. If de novo-best peptides and true peptides coexist in the protein database, the true peptides will have few opportunities to be identified because de novo-best peptides will have higher matching scores. Although coexistence of similar peptides with isobaric amino acids in a protein database does not frequently happen, it is still a potential risk we are taking in proteomics studies. Accurate spectrum prediction of pDeep shows its potential to distinguish isobaric amino acids by

12

ACS Paragon Plus Environment

Page 12 of 21

Page 13 of 21

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

considering the intensity information. To test the performance of pDeep in the isobaric problem, all “GG” or “AG” contained PSMs were extracted from the ProteomeTools-HCD30 data set. For a PSM, the PCC was calculated by comparing intensities of the matched peaks in the experimental spectrum with the predicted intensities, this PCC is called the “true PCC”. Meanwhile, the fake peptide of this PSM was generated by substituting “GG” by “N” (“GG to N”) and “AG” by “Q” (“AG to Q”). And then, the “fake PCC” was calculated by comparing intensities of the matched peaks with the predicted intensities of the fake peptide. Afterward, we treated DeltaPCC (true PCC − fake PCC) as a discrimination function to distinguish the “true” from the “fake”, the results are shown in Figure 6a and 6b. For “GG to N” and “AG to Q”, 95.1% and 93.6% of “true PCCs” are higher than “fake PCCs”, respectively, showing < 7% error rates to distinguish “GG” from “N” and “AG” from “Q”. Although the mass of “GG" (or “AG") is equal to “N" (or “Q"), the physicochemical properties of these amino acids are a little different, resulting in slightly different fragmentation patterns, hence enabling us to distinguish the “true" from the “fake". Here we do not expect that the fragmentation patterns of true peptides and fake peptides are very different because other amino acids are identical. An example of “GG to N” is shown in Figure 6c and 6d. Intensities of b3+ and y8+ of the fake peptide “NFFSFGDLTK” are slightly different from their corresponding b4+ and y8+ ions of the true peptide “GGFFSFGDLTK”, resulting in a 0.09 lower PCC.

13

ACS Paragon Plus Environment

Analytical Chemistry

a.

b.

“GG to N”

“AG to Q” 93.6% (48,402 / 51,720) of true PCCs > fake PCCs

Count

Count

95.1% (39,452 / 41,490) of true PCCs > fake PCCs

fake PCC

true PCC



ProteomeTools-HCD30 Raw = 01625b_GA1-TUM_first_pool_1_01_01-3xHCD-1h-R1 Scan = 43232 b4

b3

x 5.8e+06

GGFFSFGDLTK(2+)

NFFSFGDLTK(2+) y8

y8

Real

PCC = 0.98

PCC = 0.89 y7+ y6+

y5+

y7+

y7+

y6+

y4+ y4+

y3+ b3+ y3+ b3+

y5+

y2+ b2+ b2+

y1+

y2+

y1+

Intensity y8+

b1+

y8+

y6+

y5+

y6+

y4+

y5+

y3+

b4+ y4+

y3+

b4+

y2+ b3+ y2+ b3+

b2+ b2+ y1+ y1+

y7+

Real

Predicted

Predicted 600

y8+

x 5.8e+06

fake PCC



d.

1200

Different intensities of b3+ and y8+ from GGFFSFGDLTK (b4+ and y8+) 600

m/z

y8+

true PCC

c.

Intensity

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 21

1200

m/z

Figure 6: Performance of pDeep in distinguishing “GG” from “N” (“GG to N”) and “AG” from “Q” (“AG to Q”) based on the ProteomeTools-HCD30 data set. (a) Histogram of the ∆PCC (true PCC − fake PCC) of “GG to N”. (b) Histogram of the ∆PCC (true PCC − fake PCC) of “AG to Q”. (c) The match of the real spectrum and the predicted spectrum of the true peptide “GGFFSFGDLTK". (d) The match and the predicted spectrum of the fake peptide “NFFSFGDLTK" of the true peptide in c.

In MS-based proteomics, “I” and “L” are twin amino acids, which are generally considered to be indistinguishable. Here we also tested the performance of pDeep in distinguishing these twin amino acids under HCD and EThcD. In the ProteomeTools-HCD30 and ProteomeTools-EThcD data sets, all “I” contained PSMs were extracted, and the corresponding fake peptides were generated by substituting “I” by “L” (“I to L”). The comparisons between “true PCCs” and “fake PCCs” under HCD and EThcD are shown in Figure 7a and 7b. The true positive rate of distinguishing “I” from “L” is 67.6% under HCD. When “I”-form and “L”-form peptides are both presents in the protein database, search engines will randomly choose the “I”- or “L”-form peptide, resulting in an error rate of 50%. pDeep en-

14

ACS Paragon Plus Environment

Page 15 of 21

ables us to distinguish “I” from “L” with ∼ 33% error rate. And the error rate is ∼ 25% using EThcD. More precise discrimination of “I” and “L” requires consideration of more informative product ion types. 22

b.

“I to L” of HCD

“I to L” of EThcD

67.7% (279,877 / 413,204) of true PCCs > fake PCCs

76.3% (9,138 / 11,973) of true PCCs > fake PCCs

null hypothesis: I and L are indistinguishable prob = 0.5 Bernoulli test: p-value < 2.2e-16 95% CI = [67.5%,67.9%]

null hypothesis: I and L are indistinguishable prob = 0.5 Bernoulli test: p-value < 2.2e-16 95% CI = [75.5%, 77.1%]



true PCC

Count

a. Count

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

fake PCC



true PCC





fake PCC

Figure 7: Performance of pDeep in distinguishing “I” from “L” (“I to L”) under HCD and EThcD. (a) Histogram of the ∆PCC (true PCC − fake PCC) of “I to L” on the ProteomeTools-HCD30 data set. (b) Histogram of the ∆PCC (true PCC − fake PCC) of “I to L” on the ProteomeTools-EThcD data set. “CI” refers to the confidence interval. Other isobaric problems, such as “N to GG” and different permutations of amino acids were analyzed in Fig. S3 and Fig. S4, the results show > 80% accuracies. As a scoring function for the peptide identification, PCC is too simple to further distinguish the isobaric amino acids with higher accuracy, a better-designed scoring scheme based on the predicted intensity distribution is needed. We also showed that the identification rate of search engines could be increased using pDeep, see Fig. S5.

Conclusion Previous researches on MS/MS spectrum prediction had made certain progresses, but there was still a lot room for improvement. Traditional classification or regression algorithms, such as support vector machines, random forests or feed-forward neural networks, consider each input independently in the model. But some researches have shown that other amino 15

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

acids of a peptide may have different influences on a specific cleavage site, BiLSTM can handle these influences by capturing the long and short-term dependencies among all amino acids of a given peptide, hence enables pDeep to reach higher prediction accuracies. Not only for HCD, pDeep can predict ETD and EThcD spectra as well. The more precise prediction of theoretical spectra allows us to distinguish extremely similar peptides, and will increase the identification rates of protein search engines. pDeep can be also used as a predicted library in addition to the spectral library for the data-independent acquisition mass spectrometry. Since we do not fully understand the physicochemical properties of peptides about the fragmentation, the data-driven deep learning can accelerate our understanding of the principles behind peptide fragmentation. We are also looking forward to more non-trivial applications of deep learning in proteomic studies.

Methods The bidirectional LSTM model While fragmenting, all amino acids of a peptide may have different influences on a specific cleavage site. 7 Therefore, we use Bidirectional LSTM (BiLSTM), which has been successfully used to capture the bidirectional dependencies of sequential patterns in speech recognition 15 and natural language processing, 23 to model this kind of influences. The other reason we have to use the BiLSTM is that the b/c ions depend on the N-terminal amino acids, and the y/z ions depend on the C-terminal ones. If we need to predict b/y/c/z ions together, both directions should be simultaneously considered. The BiLSTM-based pDeep takes the whole peptide as input, converts the different cleavage sites into feature vectors of different time-steps, and outputs the corresponded intensity of each peak, as shown in Figure 8. The maximum number of time-steps of the BiLSTM, which is defined as the max_step, has to be set as fixed, but peptides are diverse in their lengths. So if the number of cleavage sites of an input peptide is less than the max_step, zero vectors are padded for the 16

ACS Paragon Plus Environment

Page 16 of 21

Page 17 of 21

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

empty time-steps and then masked by a Masking layer, and if the number of cleavage sites is greater than the max_step, the PSM is ignored. In current version of pDeep, we considered 1+ product ions for 1+ and 2+ precursors, 1+ and 2+ product ions for >2+ precursors.

pDeep: Output

peaks1

peaks2

peaks3

Backward LSTM

LSTM1

LSTM2

LSTM3



Forward LSTM

LSTM1

LSTM2

LSTM3



Numeric features

x1

x2

x3

BiLSTM

Input peptide

x4

F-G-S-I-K Figure 8: BiLSTM-based pDeep.

We used Keras (version 1.2.1) [https://keras.io/] with Tensorflow 24 (version 0.12.1) backend to train the BiLSTM model. The max_step was set as 19, which means the longest length of peptides is 20. The model was built with activation function of ReLU, loss function of MAE (Mean Absolute Error) and the optimizer of Adam (Adaptive Moment Estimation). The probability of dropout was set as 0.3 for each BiLSTM layer and the number of epochs was set as 100. Training the model would take >5 days without GPU (Intel(R) Xeon(R) CPU E5-2620 v3, 6 cores, CentOS 7, 64bit, 64GB memory), hence we used a Tesla K80 graphics card with 24GB video memory to accelerate the training of pDeep (∼10 hours). After training, the model could be used in a PC without GPU acceleration. pDeep could predict a spectrum in 0.025 seconds with a single CPU core. We tested 1-layer, 2-layer and 3-layer BiLSTM on QE-Y-M, wherein 2-layer and 3-layer BiLSTM have almost the same test PCC distributions, which are both better than those of 1-layer BiLSTM, as shown in Fig. S6. At last, we chose 2-layer BiLSTM to build the pDeep model. 17

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The source codes of pDeep is available at http://pfind.ict.ac.cn/download/pDeep.zip.

Acknowledgement The authors thank all pfinders (http://pfind.ict.ac.cn/members.html) for downloading the raw data sets from PRIDE or other websites. We also thank Dr. Zhongqi Zhang for his help with MassAnalyzer 3.01, and thank Dr. Shiwei Sun for his help with MS-Simulator. This work was supported in part by the National Key Research and Development Program of China (No. 2016YFA0501301 to S.-M. H.), the National Key Research Program of China (No. 2016YFB1000605 to Z.-F. Z.), the National Natural Science Foundation of China (No. 31470805 to H.C.), and the CAS Interdisciplinary Innovation Team Program.

Supporting Information Available • Fig. S1: COS and SPC similarities between real and predicted spectra using pDeep. • Fig. S2: COS and SPC similarities between real and predicted spectra using MassAnalyzer. • Fig. S3: Distinguishing “N” from “GG”. • Fig. S4: Distinguishing different permutations of amino acids. • Fig. S5: Increasing the identification rate of pFind using pDeep. • Fig. S6: Performance of different BiLSTM layers of pDeep. This material is available free of charge via the Internet at http://pubs.acs.org/.

18

ACS Paragon Plus Environment

Page 18 of 21

Page 19 of 21

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

References (1) Chi, H.; He, K.; Yang, B.; Chen, Z.; Sun, R.-X.; Fan, S.-B.; Zhang, K.; Liu, C.; Yuan, Z.-F.; Wang, Q.-H. et al. J. Proteomics 2015, 125, 89–97. (2) Eng, J. K.; McCormack, A. L.; Yates, J. R. J. Am. Soc. Mass. Spectrom. 1994, 5, 976–989. (3) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551–3567. (4) Li, W.; Ji, L.; Goya, J.; Tan, G.; Wysocki, V. H. J. Proteome Res. 2011, 10, 1593–1602. (5) Zhang, Z. Anal. Chem. 2004, 76, 3908–3922. (6) Zhang, Z. Anal. Chem. 2005, 77, 6364–6373. (7) Sun, S.; Yang, F.; Yang, Q.; Zhang, H.; Wang, Y.; Bu, D.; Ma, B. J. Proteome Res. 2012, 11, 4509–4516. (8) Wang, Y.; Yang, F.; Wu, P.; Bu, D.; Sun, S. BMC Bioinf. 2015, 16, 110. (9) Arnold, R. J.; Jayasankar, N.; Aggarwal, D.; Tang, H.; Radivojac, P. Pac. Symp. Biocomput. 2006, 11, 219–230. (10) Li, S.; Arnold, R. J.; Tang, H.; Radivojac, P. Anal. Chem. 2010, 83, 790–796. (11) Frank, A. M. J. Proteome Res. 2009, 8, 2226–2240. (12) LeCun, Y.; Bengio, Y.; Hinton, G. Nature 2015, 521, 436–444. (13) Zeiler, M. D.; Fergus, R. ECCV 2014, 13, 818–833. (14) Zolg, D. P.; Wilhelm, M.; Schnatbaum, K.; Zerweck, J.; Knaute, T.; Delanghe, B.; Bailey, D. J.; Gessulat, S.; Ehrlich, H.-C.; Weininger, M. et al. Nat. Methods 2017, 14, 259–262. 19

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(15) Graves, A.; Jaitly, N.; Mohamed, A.-r. ASRU 2013, 273–278. (16) Sundermeyer, M.; Alkhouli, T.; Wuebker, J.; Ney, H. EMNLP 2014, 14 – 25. (17) Kall, L.; Canterbury, J. D.; Weston, J.; Noble, W. S.; MacCoss, M. J. Nat. Methods 2007, 4, 923–925. (18) Sharma, K.; Schmitt, S.; Bergner, C. G.; Tyanova, S.; Kannaiyan, N.; ManriqueHoyos, N.; Kongi, K.; Cantuti, L.; Hanisch, U.-K.; Philips, M.-A. et al. Nat. Neurosci. 2015, 18, 1819–1831. (19) Kulak, N. A.; Pichler, G.; Paron, I.; Nagaraj, N.; Mann, M. Nat. Methods 2014, 11, 319–324. (20) Chick, J. M.; Kolippakkam, D.; Nusinow, D. P.; Zhai, B.; Rad, R.; Huttlin, E. L.; Gygi, S. P. Nat. Biotechnol. 2015, 33, 743–749. (21) Muth, T.; Renard, B. Y. Briefings Bioinf. 2017, 1–17. (22) Zhokhov, S. S.; Kovalyov, S. V.; Samgina, T. Y.; Lebedev, A. T. J. Am. Soc. Mass Spectrom. 2017, 1–12. (23) Sutskever, I.; Vinyals, O.; Le, Q. V. NIPS 2014, 27, 3104–3112. (24) Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M. et al. OSDI 2016, 12, 265–283.

20

ACS Paragon Plus Environment

Page 20 of 21

200

400

y8+

y6+

y8+

b5+

for TOC only

ACS Paragon Plus Environment

y7+

y6+ 600

m/z

21

y7+

Real

y4+ b4+

y3+ b3+

y2+

y5+

pDeep

y2+

QAWVWAAVR

y3+ b3+ y4+

MS

b1+ b1+ y1+ y1+ b2+ b2+

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

y5+

Page 21 of 21

Predicted 800

1000