Improved Peptide Retention Time Prediction in ... - ACS Publications

Aug 16, 2018 - data analysis, such as RNA expression1 and retention time. (RT).2 Prior to being ... for improvement. Deep learning, an advanced machin...
19 downloads 0 Views 959KB Size
Subscriber access provided by Kaohsiung Medical University

Article

Improved Peptide Retention Time Prediction in Liquid Chromatography through Deep Learning Chunwei Ma, Yan Ren, Jiarui Yang, Zhe Ren, Huanming Yang, and Siqi Liu Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.8b02386 • Publication Date (Web): 16 Aug 2018 Downloaded from http://pubs.acs.org on August 17, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Improved Peptide Retention Time Prediction in Liquid Chromatography through Deep Learning †‡

†‡

†‡

†‡

Chunwei Ma, Yan Ren, Jiarui Yang, Zhe Ren, Huanming Yang,

†§

†‡

and Siqi Liu*



BGI-Shenzhen, Beishan Industrial Zone 11th Building, Yantian District, Shenzhen, Guangdong 518083, China



China National GeneBank, BGI-Shenzhen, Shenzhen 518120, China

§

James D. Watson Institute of Genome Sciences, Hangzhou 310008, China

ABSTRACT: The accuracy of peptide retention time (RT) prediction model in liquid chromatography (LC) is still not sufficient for wider implementation in proteomics practice. Herein, we propose deep learning as an ideal tool to considerably improve this prediction. A new peptide RT prediction tool, DeepRT, was designed using a capsule network model, and the public datasets containing peptides separated by reverse-phase liquid chromatography (RPLC) were used to evaluate the DeepRT performance. Compared with other prevailing RT predictors, DeepRT attained overall improvement in the prediction of peptide RTs with an R2 of ~0.994. Moreover, DeepRT was able to accommodate to the peptides that were separated by different types of LC, such as strong cation exchange (SCX) and hydrophilic interaction liquid chromatography (HILIC) and to reach the RT prediction with R2 values of ~0.996 for SCX and ~0.993 for HILIC, respectively. If a large peptide dataset is available for one type of LC, DeepRT can be promoted to DeepRT(+) using transfer learning. Based on a large peptide dataset gained from SWATH, DeepRT(+) further elevated the accuracy of RT prediction for peptides in a small dataset and enabled a satisfactory prediction upon limited peptides approximating hundreds. Further, DeepRT automatically learns retention-related properties of amino acids under different separation mechanisms, which are well consistent with retention coefficients (Rc) of the amino acids. DeepRT was thus proven to be an improved RT predictor with high flexibility and efficiency. DeepRT is available at https://github.com/horsepurve/DeepRTplus.

Liquid chromatography coupled with mass spectrometry (LC-MS) is the main technique of choice for proteomic analysis. In terms of the high complexity of MS data, several additional data in addition to mass spectra are available for data analysis, such as RNA expression1 and retention time (RT).2 Prior to being injected into a mass spectrometer, peptides are normally separated by liquid chromatography in order to reduce the complexity of peptidemixtures. There are several popular LC methods, such as reverse-phase (RPLC), strong cation exchange (SCX) and hydrophilic interaction liquid chromatography (HILIC). In proteomics analysis, RT information is valuable in assisting with peptide identification from MS/MS signals.2 The peptide RTs computationally predicted coupled with corresponding mass spectra can be used to construct ion libraries in silico for dataindependent acquisition (DIA)-based proteomics.3 Efforts to date to predict peptide RT are mainly based on retention coefficients (Rc) of amino acids,4 while SSRCalc is the most popular Rc-based predictor.5 Rc is a parameter to appraise the contribution of an individual amino acid to peptide RT, and the sum of all the Rcs of amino acids in a peptide could serve for RT estimation. Additional factors such as peptide length, charge, and helicity are also considered during peptide RT prediction.6 Several predictors based on Rc and other measurable factors have been proposed and reported in some studies.7,8,9,10,11 For example, Elude7,8 and GPTime9 developed from support vector machine (SVM) and Gaussian process regression employed Rcs learned from datasets and can also provide RT prediction for posttranslationally modified (PTM) peptides. All these tools produced RT with R2 values of less than ~0.965 on various datasets.12 On the other hand, it is well recognized that we are still lacking in enough knowledge to fully understand the physicochemical properties of peptides and the complex interactions between peptides and stationary phase, which leads to the less-than-optimum prediction of peptide RT.13 In terms of the algorithm, the traditional model shows its limitation in tracing the many subtle factors that affect the peptide behaviors on LC. Hence, in the field of peptide RT prediction, there is still large room for improvement. Deep learning, an advanced machine learning method, has shown extraordinary capability to learn complex relationships from large scale data. There have been many tools that successfully utilized deep learning in proteomics, such as pDeep, a MS/MS spectrum predictor,14 DeepNovo, a software for peptide de novo sequencing,15 DeepPep for protein inference,16 and DNN-MDA, a tool for biomarker identification in metabolomics.17 Although Moruz et al. have suggested deep learning’s usefulness for RT

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

prediction,13 it is not yet well positioned in this field. The current neural network models implemented for RT prediction are typically 1-layer models and thus lack accuracy. Deep learning is characterized by stacking multiple hidden neural networks and ingesting raw data without manually designed features, while the capsule network (CapsNet) proposed by Sabour et al. is a state of the art deep learning model.18 Herein, we constructed a model in which twenty amino acids and their post-translationally modified counterparts were encoded with vectors, and the concatenated amino acid vectors of a peptide were then put into CapsNet to be mapped to corresponding RTs. DeepRT, a new RT predictor for peptides designed with CapsNet in this study, was demonstrated as an improved framework for the peptide RT prediction, which accurately foresaw the RTs for the peptides at even different modification status and at varied LC conditions. Although Rc was not considered by DeepRT, we show that the amino acid vectors automatically learned by DeepRT were consistent with their Rcs and can be used to unravel the amino acid properties in different analytical platforms.

MATERIALS AND METHODS Datasets used for RT prediction. A total of eight datasets containing peptide identification and RT were collected for this study, which covered different species, modification status and LC platforms (Table 1). Two datasets termed yeast and HeLa obtained using RPLC conditions were used for methodology comparison. The yeast dataset with 14361 peptides was obtained from one file of the yeast proteomic study (PXD000409 in PRIDE),19 which was performed using RPLC separation conditions. The HeLa dataset with 3413 peptides that contained 66% (2243) peptides with modified amino acids, including oxidized methionine (m), phosphorylated serine (s), phosphorylated threonine (t) and phosphorylated tyrosine (y), was gained from one file of the HeLa cell study (PXD000612 in PRIDE),20 also using RPLC. The Misc dataset for pre-training was derived from a combination of 24 different kinds of cell lines and tissues, including HeLa cell line, muscle, lung, etc. and generated from the SWATH approach under RPLC consisted of 146587 unmodified peptides (PXD000954 in SWATHAtlas).21 The LC parameters of the three RPLC datasets are listed in Table S1. Additionally, there were five datasets acquired using non-RPLC conditions, one dataset from SCX22 and four from HILIC23, with the peptides ranging from 30 thousand to 40 thousand. Amino Acid Embedding and Capsule Network Model. The overall pipeline for peptide RT prediction followed the framework shown in Figure 1. To capture the contribution of each amino acid to peptide RT, the embedding technique was taken for amino acids encoding, which is widely used for semantic analysis in natural language processing and works here as learning a distributed vector representation of an amino acid24. For a peptide ‫݌‬, all amino acids within it were encoded into 20D embedding vectors, and these vectors were stacked to form a matrix ܲ representing the peptide‫݌‬. The modified amino acids such as m, s, t, and y had embedding vectors that were different from their unmodified counterparts because modification may cause changes in their physicochemical properties. After the embedding of amino acids, a capsule neural network (CapsNet) computed RT for ‫ ݌‬through a complex function (݂) that consists of multiple layers.

݂ሺ‫ݐ݁ܰݏ݌ܽܥ = )݌‬ሺܲ) Two convolutional layers (Conv) served as the first two layers in CapsNet, which proved very suitable to detect the local interactions of amino acids in a polypeptide chain.25 The feature map, i.e., the output of Conv, was then put into to a capsule layer. In contrast to Conv, each neuron in a capsule layer received a vector (8D) as an input instead of a single value, which was exponentially more efficient. Subsequently, another capsule layer was appended, and the parameters between these two capsule layers were optimized using ‘dynamic routing’. Finally, the root sum square (RSS) of the 16D vector in the second capsule layer was computed as the predicted peptide RT ݂ሺܲ). In each training iteration, DeepRT randomly selected a mini-batch of ܰ peptide sequences ሼ‫݌‬ଵ , ‫݌‬ଶ , … , ‫݌‬ே ሽ with corresponding retention times ሼ‫ݐ‬ଵ , ‫ݐ‬ଶ , … , ‫ݐ‬ே ሽ and improved the predictive performance of CapsNet by approximately minimizing the following loss function using the Adam (adaptive moment estimation) optimizer. 1 ݈‫ = ݏݏ݋‬ඩ ෍ሺ‫ݐ‬௜ − ݂ሺ‫݌‬௜ ))ଶ ܰ ே

௜ୀଵ

The hyperparameters of CapsNet, such as filter numbers and capsule numbers, were chosen to be the same as the original CapsNet. The filter sizes were changed to 8, 10 and 12, and the three derived results were averaged as the final predicted peptide RT. More detailed algorithmic computation information is provided in the Supporting Information, and the source code of DeepRT is available at https://github.com/horsepurve/DeepRTplus. Transfer Learning for Automatic Model Calibration. Generally, the trained DeepRT model cannot be directly applied to new datasets because there are often some linear or non-linear shifts between the optimization and testing datasets generated from different LC-MS runs26 (see Figure S1 for an illustration), and the wrong choice of parameters such as ion-pairing modifier, temperature or column configuration may lead to false prediction.27 Meanwhile, if the training peptides are limited, it is impossible to build a high-quality RT model; thus, the calibration of a trained model is necessary in this situation. Recently, a transfer learning strategy was used by Esteva et al.28 in which a deep learning model was pre-trained on a larger dataset followed by fine-tuning on a smaller dataset so that more information extracted in large datasets could compensate for the limited knowledge in the small dataset. A similar idea was borrowed for DeepRT to improve its RT prediction and the model

ACS Paragon Plus Environment

Page 2 of 10

Page 3 of 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

calibration for different LC conditions. Specifically, DeepRT is first trained based on a larger dataset and subsequently tuned on a smaller dataset to accommodate the LC conditions of the small dataset.

RESULTS AND DISCUSSION Performance of DeepRT on RPLC. To evaluate the RT prediction from the different RT predictors, the HeLa and yeast datasets were randomly split into a 9:1 ratio, the former for model optimization and the latter for testing. The splitting process was performed 10 times to eliminate selection bias. Two popular RT predictors, Elude (version 2.04) and GPTime (https://github.com/statisticalbiotechnology/GPTime) were run with default parameters with the same datasets. Although CapsNet is an emerging deep model, long short-term memory (LSTM) and residual network (ResNet) still keep the records on most tasks so we evaluated the performance of LSTM and ResNet in our preliminary test. The architectures of LSTM and ResNet are shown in Figure S2 and comparison result in Figure S3. As demonstrated in Figure S3, the performance of CapsNet was consistently better than those of the other two models on various datasets. Thus, CapsNet was chosen as a typical deep learning model in all subsequent analyses. To assess the performance of RT prediction by different software, two metrics were used, namely, R2, which represents the correlation coefficient of predicted and observed RT values and ∆t95%, which represents the minimal time window containing the deviations between observed and predicted RTs for 95% of the peptides. The peptide retention time predictions using Elude, GPTime and DeepRT are presented in Figures 2, 3 and S4. DeepRT showed prediction accuracy of R2 at 0.987 and ∆t95% at 25.9 for non-modified peptide from yeast dataset, whereas Elude and GPTime showed prediction accuracy of R2 at 0.963 and 0.963, while ∆t95% at 48.1 and 47.7 min, respectively. Elude and GPTime algorithms shown similar accuracy because they were based on the same feature set with machine learning method different. With the HeLa dataset with modified peptides, the values of R2 and ∆t95% contributed by DeepRT, Elude and GPTime were 0.970, 0.955 and 0.952, and 12.6, 16.6 and 17.0 min, respectively. The comparison illustrated in Figure 2 and 3 clearly indicates that the performance of RT prediction was improved by DeepRT in terms of both R2 and ∆t95%. In addition, the averaged prediction derived from different filter widths (8, 10 and 12) gave better result than using just single filter width (Figure S5). We also observed an increased prediction for hydrophobic peptides (Figure S6). The decrease of the prediction accuracy of HeLa data compared with yeast data may be ascribed to the limited size of training peptides, especially modified peptides, but we observed no significant difference between the prediction errors of modified and unmodified peptides (data not shown). We also examined DeepRT’s performance on Misc dataset, a large SWATH library of human proteome, in which all RT information was recalibrated using iRT Kit. An improved prediction with R2 at 0.994 and ∆t95% at 13.4 min was demonstrated (Figure S7) and the reason of DeepRT’s good performance on such data may be the accurately determined experimental RT and increased size of the data. Importantly, it was almost infeasible to use Elude and GPTime with the Misc dataset due to the extremely large peptide number (131928 training peptides). All the evidence described above revealed that the algorithm based on deep learning was indeed beneficial to the RT prediction of peptides. Using parallel computing on GPU, DeepRT exhibited ~450 times acceleration than Elude which was designed to run on CPU only (4 min versus 32 h on yeast data) (Table S2), and because Elude and GPTime were prohibitively slow on datasets containing more than 10 thousand peptides, we did not train them using datasets other than HeLa and yeast in this study. Predicting peptide RT in SCX and HILIC using DeepRT. Peptide separation by LC is affected by the stationary phase’s chemistry such as reverse-phase, ion exchange and hydrophilicity. The RT prediction above using DeepRT was only made on the peptides separated by RPLC. SCX and HILIC are among mostly used liquid chromatography following RPLC. Recently, Krokhin et al. reported SSRCalc to predict peptide RTs in the datasets from SCX and HILIC.29,22,23 To investigate DeepRT’s flexibility on different LC mechanisms, the five datasets from Krokhin et al. were taken for RT prediction with DeepRT following the same strategy described above. With SCX data, DeepRT obtained prediction accuracy with an R2 of 0.996 and ∆t95% of 1.42 min, while with HILIC data, DeepRT predicted an R2 of 0.988~0.993 and ∆t95% of 2.10~2.55 (Table 1, Figure 4 and S8). The reported RT predictions derived from SSRCalc for the same dataset were slightly poorer than that of DeepRT (~0.991 R2 value for SCX and 0.973~0.98 R2 value for HILIC), suggesting that the accurate RT prediction from DeepRT was independent of the LC types. Improved RT prediction by transfer learning. Applying a RT model to new conditions different from at which it was optimized is an intractable problem because their experiment parameters are different, such as different column length, elution gradient and composition of the mobile phase. Traditionally, the prediction of peptide RT is restricted to datasets generated under specific LC conditions. Here, we demonstrated how the RT prediction by DeepRT was further improved based on datasets across different LC conditions. In our transfer learning strategy, the Misc dataset comprising 140 thousand peptides served as a large dataset for pre-training and the HeLa or yeast datasets were taken as a smaller dataset for fine-tuning. Using the improved DeepRT, termed as DeepRT(+), with yeast data, the R2 value was boosted to 0.993 and ∆t95% dropped to 15.8 min, while in HeLa data, the R2 was enhanced to 0.980 and ∆t95% was lowered to 7.7 min (Figure 2, 3 and S3). Note that the Misc data contained no modified peptides, but the accuracy of prediction on modified HeLa peptides were still largely improved. We believe this improvement was not due to the overlap between pre-training data and calibration data. In these datasets, the rates of overlapped peptides were relatively low, 698/146587 (HeLa/Misc) and 47/146587 (yeast/Misc). We did not remove these overlapped peptides because their RTs were from different conditions with non-linear shifts (Figure S1). Notably, transferring

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

RT model of Misc dataset to yeast data achieved prediction as high as ~0.993 R2 value even though Misc was contributed from different species and LC conditions with poor overlapped peptides, only 0.03% commonly shared between yeast and HeLa dataset (Figure 2 and 3). This result demonstrated that the DeepRT(+) could effectively utilize the advantage of large data to improve the RT prediction, regardless of the data sources from different species or LC conditions under similar physicochemical principle for peptide separation. The peptide threshold required for DeepRT(+) training. As transfer learning would improve RT prediction based on a larger dataset, it was queried how limited a number of peptides in a smaller dataset was required by DeepRT(+). To address this question, 10% to 100% peptide fractions were randomly selected from the Misc dataset, and the fractioned peptides were pretrained with DeepRT(+). Meanwhile, the peptides in the yeast dataset were randomly divided into two parts, in which 90% of the peptides (12924) were randomly fractioned from 1% to 100% followed by fine-tune treatment, while the remaining 10% of the peptides (1437) were used as a testing dataset. The three methods, Elude, GPTime and DeepRT, were also conducted with the same fraction approach to the yeast dataset. As presented in Figure 5A and B, there were three different trends, 1) Even at the 1% level of training peptides (143), Elude and GPTime obtained modest prediction accuracy with R2 at ~0.9 and ∆t95% at ~75 min; however, both R2 curves quickly met saturation once 30% training peptides were taken for prediction; 2) DeepRT exhibited relatively poor performance compared with Elude and GPTime if less than 10% of the peptides were used, whereas its R2 and ∆t95% showed advantages over Elude and GPTime once ~20% peptides were taken, and the R2 and ∆t95% curves continued in an elevating mode with increasing fractioned peptides; and 3) DeepRT(+) was found to have globally higher R2 values than the other three methods and achieved a satisfactory R2 (0.975~0.987) and ∆t95% (30.34~40.08) even when only 1% peptides (143) were used. Therefore, DeepRT(+) might only require a limited number of peptides, around 100, for accurate RT prediction. Besides, the fractionation treatments were implemented with the HeLa dataset and the similar conclusion was reached as well (Figure S9). Comparison of DeepRT’s amino acid features and previously reported Retention Coefficients. Although deep learning is usually criticized as being lacking in interpretability, here we show how DeepRT can provide information elucidating amino acid features in liquid chromatography. During the training process of DeepRT, each amino acid was encoded into a unique vector and we believed after training was done these vectors reflect retention-related properties of amino acids. Hence, we extracted these vectors and measured the similarity between them based on Pearson product-moment correlation coefficients. According to these similarities, the amino acids were hierarchically clustered, as shown in Figure 6 for RPLC and SCX and in Figure S10 for HILIC. Interestingly, the amino acids were clustered into generally two groups, hydrophobic and hydrophilic. With the HeLa dataset, almost all the hydrophobic or hydrophilic amino acids were clustered together, while the three basic amino acids exhibited higher correlation. Moreover, the modified residues (m, s, t and y) were close to their unmodified counterparts (M, S, T and Y), and I and L, two isomers, shared very similar vectors. With the HILIC dataset, the amino acids with hydrophilic, acidic and basic properties were separately clustered, while on the SCX dataset, basic amino acids were clustered. Even under different LC types, therefore, the influences of amino acids to RT were likely to be captured by deep learning. Additionally, the Rc values estimated by Krokhin et al. are shown on the rightmost column in Figure 6A and B. Comparing the cluster results and the Rc ranks, the amino acids clustered together were well matched to these with similar Rc values (Figure 6). The Rc values generated from different approaches provided additional evidence to support that DeepRT indeed enabled the discovery of key elements to predict peptide RT.

CONCLUSIONS By virtue of deep learning, DeepRT overcame the difficulty of peptide RT prediction due to its powerful capacity to extract the fundamental factors that determine the peptide behaviors on LC from a big peptide dataset. The comparison of the prediction performance between Elude, GPTime and DeepRT with the same datasets demonstrated deep learning capable of improving the prediction accuracy. Further using transfer learning, DeepRT(+) resourcefully utilized the information elicited from small peptide datasets and large peptide datasets, and it could offer a more accurate prediction with ~100 experimental peptides. The embedding method was adopted to clarify different properties of amino acids in different LC types. The evaluation suggested that DeepRT was able to learn the amino acid behaviors on LC with different types. We attribute its superiority to two aspects, that is, modern deep learning model that can learn intricate relation and large-scale RT datasets with high quality that can be used to optimize the RT model. The use of DeepRT is thus expected to be applicable to a wide research field for the RT prediction of peptides. For example, the trained models here can be used to evaluate separation orthogonality in silico between different LC pairs and aid the choice of separation mechanisms. DeepRT can also be used for the discrimination of structurally similar peptide isomers (Figure S11). The status of an amino acid residue within a peptide, such as modification or steric configuration, may be a pivotal element for the peptide’s behaviors on LC. In Figure 6, the cluster results showed that DeepRT could automatically not only capture the similarity of different amino acids but also discriminate between slight differences in similar amino acids. For example, it picked up on the differences between unmodified and oxidized methionine and normal and phosphorylated tyrosine. Krokhin et al. pointed out that the oxidized methionine residue (e.g., sulfoxide methionine and sulfate methionine) has reduced hydrophobicity index units on RPLC, indicating that the oxidized methionine has higher hydrophilicity30, while Marx et al. reported that the

ACS Paragon Plus Environment

Page 4 of 10

Page 5 of 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

phosphorylation status for serine, threonine and tyrosine changes their hydrophobicity with the order of phosphorylated serine > threonine > tyrosine.31 Our results derived from the embedding cluster were in a good agreement with this (Table S3), showing that under certain LC types, DeepRT accurately detected the changes in amino acids without knowing detailed information about their physicochemical features. Data independent acquisition (DIA)-based proteomics or SWATH has drawn great attention in the proteomics community because it can provide more accurate quantification without labeling and allows for the examination of individual samples at a large scale. The construction of ion libraries based on data-dependent acquisition (DDA) is an important step prior to the analysis of large-scale DIA, while in contrast to the common shotgun approach, the RT values are critical for the ion library because the two parameters of MS/MS and RT are determinative for data searches of DIA. With the limitation of LC experiments, an investigator in the DIA study often meets the difficulty to evaluate whether the RT errors are come from the ion library elicited from DDA or the peptide detection acquired from DIA. In light of the advantageous prediction offered by DeepRT, it is reasoned that the approach could make a better ion library with additional in silico RT prediction and improve the match rate of peptides between the ion library and DIA data. As a matter of fact, RT prediction was proposed to be beneficial for the annotation of DIA data due to its greater detection sensitivity relative to peptides14. In future studies, we plan to integrate DeepRT with pDeep to greatly improve the quantitative proteomics from DIA experiments.

FIGURES Figure 1. Schematic illustration of the capsule network (CapsNet) for RT modeling. Every amino acid in the peptide ‘NIGGMSF’ is encoded into vectors (20D) in the Embedding layer and then goes through two convolutional layers. Filter number and capsule number of CapsNet are set at 256 and 32, respectively, as suggested by Sabour et al.18 The capsules receive and produce vectors, 8D in the first capsule layer and 16D in the second capsule layer. The root sum square (RSS) of the vector in the last capsule is computed as the predicted RT of NIGGMSF.

Figure 2. Correlation between observed and predicted RTs for yeast data by (A) Elude, (B) GPTime, (C) DeepRT and (D) DeepRT(+). The 5% of peptides with larger deviations are labeled with red.

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3. Comparison of peptide retention time prediction accuracy for different models in four software on (A) R2 and (B) ∆t95%. The error bars reflect the variance of 10 random runs.

ACS Paragon Plus Environment

Page 6 of 10

Page 7 of 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 4. Correlation between observed predicted RTs demonstrated by DeepRT using the SCX dataset.

Figure 5. The improvement of peptide RT prediction using transfer learning. In the process, the Misc dataset with 146587 peptides was taken as pre-training dataset, and the yeast dataset with 14361 peptides was set as fine-tuning dataset. The peptides in the datasets were randomly fractioned as described in the Results, and the improved performance is presented in (A) R2 and (B) ∆t95%.

Figure 6. Comparison of amino acids’ property determined by DeepRT and previously reported retention coefficients for RPLC and SCX. The column and row stand for the individual amino acids and the properties similarities are scaled from -1 to 1 as shown in the top right color bar. The far right columns represent the internal retention coefficients (Rc) reported by Krokhin et al.29,22 The comparisons were conducted using two types of LC, i.e., (A) RPLC and (B) SCX. The signs of m, s, t, and y represent the oxidation of methionine, phosphorylation of serine, threonine, and tyrosine, respectively.

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 10

TABLES Table 1. Dataset Summarization and DeepRT’s prediction accuracy. The value ∆tr95% represents relative ∆t95%, defined as ∆t95% as a proportion of the overall elution time. dataset

source

LC Type

no. peptides

training

testing

R2*

∆t95%

∆tr95%

reference

HeLa

human

RPLC

3413

3071

342

0.971

12.56

11.42

Sharma et al.20

Yeast

yeast

RPLC

14361

12924

1437

0.987

25.88

9.80

Nagaraj et al.19

Misc

human

RPLC

146587

131928

14659

0.994

13.40

5.49

Rosenberger et al.21

SCX

yeast

SCX

30482

27433

3049

0.996

1.42

3.09

Gussakovsky et al.22

Luna HILIC

yeast

HILIC

36271

32643

3628

0.989

2.55

6.07

Spicer et al.23

Xbridge Amide

yeast

HILIC

40290

36261

4029

0.993

2.36

6.38

Spicer et al.23

Atlantis Silica

yeast

HILIC

39091

35181

3910

0.990

2.10

5.68

Spicer et al.23

Luna Silica

yeast

HILIC

37110

33399

3711

0.988

2.30

6.05

Spicer et al.23

*Note: Accuracy reported in the table was based on the prediction from DeepRT trained from scratch without pre-training.

ASSOCIATED CONTENT The Supporting Information is available free of charge on the ACS Publications website. Figure S1: Non-linear shifts between different experimental runs; Figure S2: The network architectures of LSTM and ResNet for RT prediction; Figure S3: Comparison of different deep learning models including LSTM, ResNet and CapsNet on two datasets, HeLa and yeast; Figure S4: Correlation of the observed and predicted RT from HeLa data by Elude, GPTime, DeepRT and DeepRT(+); Figure S5: Prediction result of HeLa before and after averaging multiple versions of different filter widths; Figure S6: Prediction error (∆t95%) with observed RT increasing on yeast data; Figure S7: Correlation of the observed and predicted RT from the Misc data by DeepRT; Figure S8: Correlation of observed and predicted RT by DeepRT using four HILIC datasets; Figure S9: R 2 and ∆t 95% of RT prediction of HeLa data (342 peptides); Figure S10: Similarity of amino acid properties learned by DeepRT and relation with previously reported retention coefficients (Rc) in Luna HILIC, Xbridge Amide, Atlantis Silica and Luna Silica; Figure S11: Prediction result on isomers of the Mics data; Table S1: LC parameters of three RPLC datasets; Table S2: hardware platforms for testing the three software and their running times; Table S3: Similarities of amino acid vectors of m, s, t, y with their unmodified counterparts (M, S, T, Y); and capsule network and transfer learning for peptide retention prediction (PDF)

AUTHOR INFORMATION Corresponding Author *E-mail: [email protected].

ORCID Chunwei Ma: 0000-0001-9410-1264 Notes The authors declare no competing financial interest.

ACKNOWLEDGMENT This study was supported by the National Key R&D Program of China (2017YFC0908400), the National Natural Science Foundation of China (31500670) and the National Key Basic Research Program of China (2014CBA02002, 2014CBA02005).

ACS Paragon Plus Environment

Page 9 of 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

REFERENCES (1) (2) (3) (4)

(5) (6)

(7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19)

(20) (21) (22) (23) (24) (25) (26) (27)

(28) (29) (30) (31)

Ma, C.; Xu, S.; Liu, G.; Liu, X.; Xu, X.; Wen, B.; Liu, S. Improvement of Peptide Identification with Considering the Abundance of MRNA and Peptide. BMC Bioinformatics 2017, 18 (1), 109. Klammer, A. A.; Yi, X.; MacCoss, M. J.; Noble, W. S. Improving Tandem Mass Spectrum Identification Using Peptide Retention Time Prediction across Diverse Chromatography Conditions. Anal. Chem. 2007, 79 (16), 6111–6118. Ting, Y. S.; Egertson, J. D.; Bollinger, J. G.; Searle, B. C.; Payne, S. H.; Noble, W. S.; MacCoss, M. J. PECAN: Library-Free Peptide Detection for Data-Independent Acquisition Tandem Mass Spectrometry Data. Nat. Methods 2017, 14 (9), 903–908. Guo, D.; Mant, C. T.; Taneja, A. K.; Parker, J. M. R.; Rodges, R. S. Prediction of Peptide Retention Times in Reversed-Phase High-Performance Liquid Chromatography I. Determination of Retention Coefficients of Amino Acid Residues of Model Synthetic Peptides. J. Chromatogr. A 1986, 359, 499–518. Krokhin, O. V. Sequence-Specific Retention Calculator. Algorithm for Peptide Retention Prediction in Ion-Pair RP-HPLC: Application to 300and 100-?? Pore Size C18 Sorbents. Anal. Chem. 2006, 78 (22), 7785–7795. Petritis, K.; Kangas, L. J.; Yan, B.; Monroe, M. E.; Strittmatter, E. F.; Qian, W.-J.; Adkins, J. N.; Moore, R. J.; Xu, Y.; Lipton, M. S.; et al. Improved Peptide Elution Time Prediction for Reversed-Phase Liquid Chromatography-MS by Incorporating Peptide Sequence Information. Anal. Chem. 2006, 78 (14), 5026–5039. Moruz, L.; Tomazela, D.; Käll, L. Training, Selection, and Robust Calibration of Retention Time Models for Targeted Proteomics. J. Proteome Res. 2010, 9 (10), 5209–5216. Moruz, L.; Staes, A.; Foster, J. M.; Hatzou, M.; Timmerman, E.; Martens, L.; Käll, L. Chromatographic Retention Time Prediction for Posttranslationally Modified Peptides. Proteomics 2012, 12 (8), 1151–1159. Maboudi Afkham, H.; Qiu, X.; The, M.; Käll, L. Uncertainty Estimation of Predictions of Peptides’ Chromatographic Retention Times in Shotgun Proteomics. Bioinformatics 2017, 33 (4), 508–513. Pfeifer, N.; Leinenbach, A.; Huber, C. G.; Kohlbacher, O. Statistical Learning of Peptide Retention Behavior in Chromatographic Separations: A New Kernel-Based Approach for Computational Proteomics. BMC Bioinformatics 2007, 8 (1), 468. Lu, W.; Liu, X.; Liu, S.; Cao, W.; Zhang, Y.; Yang, P. Locus-Specific Retention Predictor (LsRP): A Peptide Retention Time Predictor Developed for Precision Proteomics. Sci. Rep. 2017, 7, 43959. Tarasova, I. A.; Masselon, C. D.; Gorshkov, A. V; Gorshkov, M. V. Predictive Chromatography of Peptides and Proteins as a Complementary Tool for Proteomics. Analyst 2016, 141 (August), 4816–4832. Moruz, L.; Käll, L. Peptide Retention Time Prediction. Mass Spectrom. Rev. 2017, 36 (5), 615–623. Zhou, X. X.; Zeng, W. F.; Chi, H.; Luo, C.; Liu, C.; Zhan, J.; He, S. M.; Zhang, Z. PDeep: Predicting MS/MS Spectra of Peptides with Deep Learning. Anal. Chem. 2017, 89 (23), 12690–12697. Tran, N. H.; Zhang, X.; Xin, L.; Shan, B.; Li, M. De Novo Peptide Sequencing by Deep Learning. Proc. Natl. Acad. Sci. 2017, 114 (31), 8247– 8252. Kim, M.; Eetemadi, A.; Tagkopoulos, I. DeepPep: Deep Proteome Inference from Peptide Profiles. PLOS Comput. Biol. 2017, 13 (9), 1–17. Date, Y.; Kikuchi, J. Application of a Deep Neural Network to Metabolomics Studies and Its Performance in Determining Important Variables. Anal. Chem. 2018, 90 (3), 1805-1810. Sabour, S.; Frosst, N.; Hinton, G. E. Dynamic Routing Between Capsules. In Advances in Neural Information Processing Systems 30; 2017; pp 3859–3869. Nagaraj, N.; Kulak, N. A.; Cox, J.; Neuhauser, N.; Mayr, K.; Hoerning, O.; Vorm, O.; Mann, M. System-Wide Perturbation Analysis with Nearly Complete Coverage of the Yeast Proteome by Single-Shot Ultra HPLC Runs on a Bench Top Orbitrap. Mol. Cell. Proteomics 2012, 11 (3), M111.013722. Sharma, K.; D’Souza, R. C. J.; Tyanova, S.; Schaab, C.; Wiśniewski, J. R.; Cox, J.; Mann, M. Ultradeep Human Phosphoproteome Reveals a Distinct Regulatory Nature of Tyr and Ser/Thr-Based Signaling. Cell Rep. 2014, 8 (5), 1583–1594. Rosenberger, G.; Koh, C. C.; Guo, T.; Röst, H. L.; Kouvonen, P.; Collins, B. C.; Heusel, M.; Liu, Y.; Caron, E.; Vichalkovski, A. A Repository of Assays to Quantify 10,000 Human Proteins by SWATH-MS. Sci. Data 2014, 1, 140031. Gussakovsky, D.; Neustaeter, H.; Spicer, V.; Krokhin, O. V. Sequence-Specific Model for Peptide Retention Time Prediction in Strong Cation Exchange Chromatography. Anal. Chem. 2017, 89 (21), 11795–11802. Spicer, V.; Krokhin, O. V. Peptide Retention Time Prediction in Hydrophilic Interaction Liquid Chromatography. Comparison of Separation Selectivity between Bare Silica and Bonded Stationary Phases. J. Chromatogr. A 2018, 1534, 75–84. LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521 (7553), 436–444. Wang, S.; Sun, S.; Li, Z.; Zhang, R.; Xu, J. Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model. PLOS Comput. Biol. 2017, 13 (1), 1–34. Podwojski, K.; Fritsch, A.; Chamrad, D. C.; Paul, W.; Sitek, B.; Stühler, K.; Mutzel, P.; Stephan, C.; Meyer, H. E.; Urfer, W.; et al. Retention Time Alignment Algorithms for LC/MS Data Must Consider Non-Linear Shifts. Bioinformatics 2009, 25 (6), 758–764. Krokhin, O. V. Comparison of Peptide Retention Prediction Algorithm in Reversed-Phase Chromatography. Comment on “Predictive Chromatography of Peptides and Proteins as a Complementary Tool for Proteomics”, by I. A. Tarasova, C. D. Masselon, A. V. Gorshkov and M. V. Gorshkov, Analyst, 2016, 141, 4816. Analyst 2017, 142 (11), 2050–2051. Esteva, A.; Kuprel, B.; Novoa, R. A.; Ko, J.; Swetter, S. M.; Blau, H. M.; Thrun, S. Dermatologist-Level Classification of Skin Cancer with Deep Neural Networks. Nature 2017, 542 (7639), 115–118. Krokhin, O. V; Ezzati, P.; Spicer, V. Peptide Retention Time Prediction in Hydrophilic Interaction Liquid Chromatography: Data Collection Methods and Features of Additive and Sequence-Specific Models. Anal. Chem. 2017, 89 (10), 5526–5533. Lao, Y. W.; Gungormusler-Yilmaz, M.; Shuvo, S.; Verbeke, T.; Spicer, V.; Krokhin, O. V. Chromatographic Behavior of Peptides Containing Oxidized Methionine Residues in Proteomic LC–MS Experiments: Complex Tale of a Simple Modification. J. Proteomics 2015, 125, 131–139. Marx, H.; Lemeer, S.; Schliep, J. E.; Matheron, L.; Mohammed, S.; Cox, J.; Mann, M.; Heck, A. J. R.; Kuster, B. A Large Synthetic Peptide and Phosphopeptide Reference Library for Mass Spectrometry-Based Proteomics. Nat. Biotechnol. 2013, 31 (6), 557–564.

TOC

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Paragon Plus Environment

Page 10 of 10

10