AP3: an advanced proteotypic peptide predictor for ... - ACS Publications

17. Yan Fu, Email: [email protected]. 18. Yunping Zhu, Email: [email protected]. 19. Page 1 of 25. ACS Paragon Plus Environment. Analytical Chemistry...
0 downloads 0 Views 507KB Size
Subscriber access provided by Bethel University

Article

AP3: an advanced proteotypic peptide predictor for targeted proteomics by incorporating peptide digestibility ZhiQiang Gao, Cheng Chang, Jinghan Yang, Yunping Zhu, and Yan Fu Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.9b02520 • Publication Date (Web): 06 Jun 2019 Downloaded from http://pubs.acs.org on June 7, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

AP3: an advanced proteotypic peptide predictor for

2

targeted

3

digestibility

4

Zhiqiang Gao1,3||, Cheng Chang2||, Jinghan Yang1,3,

5

Yunping Zhu2,4*, Yan Fu1,3*

6

1. National Center for Mathematics and Interdisciplinary Sciences, Key Laboratory of

7

Random Complex Structures and Data Science, Academy of Mathematics and

8

Systems Science, Chinese Academy of Sciences, Beijing 100190, China.

9

2. State Key Laboratory of Proteomics, Beijing Proteome Research Center, National

10

Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing 102206,

11

China.

12

3. School of Mathematical Sciences, University of Chinese Academy of Sciences,

13

Beijing 100049, China.

14

4. Anhui Medical University, Hefei 230032, China.

15

|| Contributed equally to this work

16

* To whom correspondence should be addressed:

17

Yan Fu, Email: [email protected]

18

Yunping Zhu, Email: [email protected]

proteomics

by

incorporating

peptide

19

1

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

Page 2 of 25

ABSTRACT

2

The selection of proteotypic peptides, i.e., detectable unique representatives of

3

proteins of interest, is a key step in targeted proteomics. To date, much effort has been

4

made to understand the mechanisms underlying peptide detection in liquid

5

chromatography-tandem mass spectrometry (LC-MS/MS) based shotgun proteomics

6

and to predict proteotypic peptides in the absence of experimental LC-MS/MS data.

7

However, the prediction accuracy of existing tools is still unsatisfactory. We find that

8

one crucial reason is their neglect of the significant influence of protein proteolytic

9

digestion on peptide detectability in shotgun proteomics. Here, we present an Advanced

10

Proteotypic Peptide Predictor (AP3), which explicitly takes peptide digestibility into

11

account for the prediction of proteotypic peptides. Specifically, peptide digestibility is

12

first predicted for each peptide and then incorporated as a feature into the peptide

13

detectability prediction model. Our results demonstrated that peptide digestibility is the

14

most important feature for the accurate prediction of proteotypic peptides in our model.

15

Compared with the existing available algorithms, AP3 showed 10.3%-34.7% higher

16

prediction accuracy. On a targeted proteomics data set, AP3 accurately predicted the

17

proteotypic peptides for proteins of interest, showing great potential for assisting the

18

design of targeted proteomics experiments.

19

2

ACS Paragon Plus Environment

Page 3 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

Introduction

2

In shotgun proteomics, proteins are first digested by some enzyme into peptide

3

mixtures which are then analyzed by liquid chromatography-tandem mass spectrometry

4

(LC-MS/MS). In recent years, MS-based targeted proteomics, such as multiple reaction

5

monitoring (MRM) experiments, are capable of the sensitive identification and

6

quantification of proteins of interest and have become a promising powerful tool for

7

biological or clinical studies, such as the verification of candidate biomarkers1,2. The

8

first key step in developing an MRM assay is the selection of proteotypic peptides, i.e.,

9

the peptides that are unique representatives of their corresponding proteins and are

10

detectable in proteomic experiments3. There are two major approaches to proteotypic

11

peptide selection, i.e., the experimental approach and the computational approach. The

12

former selects previously identified peptides as proteotypic peptides, while the latter

13

predicts proteotypic peptides using empirical or machine learning models. Although the

14

experimental approach has been successfully applied in MRM assays, it has some

15

limitations. For example, not all target proteins have experimental evidences, especially

16

the proteins identified by literature mining. Thus, researchers are paying more attention

17

to the computational approach. However, the mechanisms underlying peptide detection

18

are still unclear, which hinders the development of accurate proteotypic peptide

19

prediction algorithms.

20

To date, much effort has been devoted to understanding the mechanisms of peptide

21

detection and predicting proteotypic peptides. In early researches, empirical score

22

functions were constructed based on the physiochemical properties of peptides, such as

23

hydrophobicity, peptide length and isoelectric point4,5. In recent years, several machine

24

learning-based algorithms6–12 have been developed to predict proteotypic or high-

25

responding peptides. In these studies, designing the features characterizing the peptides 3

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 25

1

is one of the key steps. The likelihood of observing a peptide in a shotgun proteomics

2

experiment is governed by many factors, such as the physicochemical properties of the

3

peptide, the LC-MS/MS behavior of the peptide, and the abundance of its

4

corresponding protein13,14. Tang et al.13 first proposed the concept of peptide

5

detectability, which is defined as the probability that a peptide would be observed in a

6

standard sample analyzed by a standard proteomics routine. They also invented a

7

machine learning model which used 175 features derived from the peptide sequence to

8

predict the peptide detectability. Later, Sander et al.9, Mallick et al.8 and Eyers et al.11

9

developed their models using 596, 1010, and 1186 features, respectively. These features

10

mainly include AAindex-derived features15 and sequence-derived features. Recently,

11

Muntel et al.16 considered the protein abundance as an additional feature and obtained

12

improved performance. However, the protein abundance information is generally

13

unavailable in the absence of experimental LC-MS/MS data.

14

A common limitation of the above algorithms is that they only considered the

15

features of peptides themselves but did not consider the influence of protein proteolytic

16

digestion on peptide detectability. As we know, a typical shotgun proteomics

17

experiment can be divided into two successive processes: protein proteolytic digestion

18

and peptide detection by LC-MS/MS, and the peptide detectability is the probability of

19

observing a peptide in the whole shotgun proteomics experiment, not only in the peptide

20

detection process by LC-MS/MS. It is important to know which peptides and what

21

proportions of them could be digested out and subjected to subsequent LC-MS/MS

22

analysis. Previous studies have demonstrated that the process of protein digestion by

23

the commonly used enzyme trypsin is always incomplete17. The cleavage probability

24

of a tryptic site is mainly determined by the amino acids surrounding the site and

25

affected by other factors such as the local conformation and tertiary structure of its 4

ACS Paragon Plus Environment

Page 5 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

protein as well as experimental conditions of digestion17. Several algorithms have been

2

demonstrated to be successful in predicting the cleavage probabilities of tryptic sites

3

exclusively from the adjacent amino acids18,19. Note that half of the amino acids

4

surrounding a cleavage site are outside the peptide sequence and therefore are not

5

characterized by previous peptide detectability prediction algorithms. For more

6

accurate prediction of peptide detectability, the peptide digestibility must be explicitly

7

taken into account.

8

Here, we present a novel algorithm named AP3 (short for Advanced Proteotypic

9

Peptide Predictor) which, for the first time, integrates the protein digestion and peptide

10

LC-MS/MS detection processes together to predict proteotypic peptides. Specifically,

11

it first predicts the peptide digestibility and then incorporates it as a feature into the

12

peptide detectability prediction model. Our results showed that peptide digestibility was

13

the most important feature for predicting proteotypic peptides, increasing the 10-fold

14

cross-validation AUC (area under the receiver operating characteristic (ROC) curve)

15

from 0.8891 to 0.9269. Furthermore, we demonstrated that the peptide detectability

16

prediction model trained on a yeast data set retained good predictive power over three

17

independent comprehensive data sets from E. coli (AUC 0.9405), mouse (AUC 0.9313)

18

and human (AUC 0.9213) samples. Compared with existing tools, including

19

PeptideSieve8, CONSeQuence11, ESP Predictor6 and PPA16, AP3 exhibited 10.3%-34.7%

20

higher accuracy in AUC. At last, we showed that AP3 can effectively predict

21

proteotypic peptides for targeted proteomics in the absence of experimental LC-MS/MS

22

data.

5

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

Experimental Section

2

Data sets

Page 6 of 25

3

A public large-scale yeast data set20 was used to train the AP3 model. To test the

4

generalization performance of AP3, we used three independent public data sets from

5

other organisms, i.e., E. coli21, mouse22 and human23. In all of these data sets, the

6

proteolytic digestion was performed by incubation with trypsin overnight. For the

7

human data set, the lymph node and salivary gland data in their original publication

8

were used. We reanalyzed the raw MS files as follows.

9

Mass spectra in each data set were searched against the corresponding organism

10

sequences from the UniProt database using the Andromeda search engine24 in the

11

MaxQuant software25 (version 1.6.0.1). Carbamidomethylation on cysteine was set as

12

a fixed modification. Methionine oxidation and protein N-terminal acetylation were set

13

as variable modifications. Protein sequences were theoretically digested using fully

14

tryptic cleavage constraints, and up to two missed cleavage sites were allowed. The

15

precursor mass tolerance was set to 20 ppm for the first search and 4.5 ppm for the main

16

search. The mass tolerance for fragment ions was set to 0.05 Da for Q Exactive Plus

17

and 0.5 Da for other instruments. False discovery rates at the protein and peptide levels

18

were both set to 1%.

19

Filtration of identified proteins

20

To construct more confident training and test data sets for AP3, the identified

21

proteins were further filtered by their spectral counts (SCs) and sequence coverages.

22

The SC of a protein was calculated as the sum of SCs of its identified peptides. It is

23

reasonable to assume that a protein is confidently identified if it has a large SC and high

24

sequence coverage. Therefore, we sorted the proteins by both their SCs and sequence 6

ACS Paragon Plus Environment

Page 7 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

coverages, and kept those proteins that appeared in the top 50% of both ranks. Table 1

2

provides a summary of the four data sets.

3

Table 1. Summary of the four public data sets used for training and testing AP3 Data set

Proteins

Peptides

Proteins after

Peptides after

identified

identified

filtration

filtration

3959

43088

1556

29172

2898

38035

1077

23674

Q Exactive Plus

5304

58184

2120

39095

LTQ Orbitrap

6422

80490

2606

55289

Usage

Instrument

Yeast

training

Orbitrap Fusion

E. coli

test

Mouse

test

Human

test

name

LTQ Orbitrap Elite

4 5

Workflow of AP3 algorithm

6

As Figure 1 illustrates, the development workflow of AP3 has two main

7

components: a peptide digestibility predictor and a peptide detectability predictor. To

8

avoid over-fitting and confounding the peptide digestibility and detectability in the

9

training stage, the identified filtered proteins are randomly divided into two halves. The

10

first half is used to train the peptide digestibility predictor and the other half is used to

11

train the peptide detectability predictor. A random forest classifier is first trained to

12

predict the cleavage probabilities of tryptic sites. Then, the digestibility of each peptide

13

is calculated from the cleavage probabilities of its tryptic sites and integrated into the

14

peptide detectability predictor as one of the features that characterize the peptide. Next,

15

other 587 physicochemical features of peptides are calculated and feature selection is

16

performed. Finally, another random forest classifier is trained for predicting the peptide

17

detectability.

18 19 7

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 25

1 2 3 4 5 6 7 8 9

Figure 1. The development workflow of AP3. The identified proteins are divided into two halves to completely separate the training sets of peptide digestibility predictor and peptide detectability predictor. The first half is used to train a random forest classifier for predicting cleavage probabilities of tryptic sites. The other half is used to train another random forest classifier for predicting peptide detectability. The predicted digestibility is integrated as a feature into the peptide detectability predictor.

10

Constructing the training set. Identified peptides were first mapped to their

11

corresponding protein sequences. Then, the cleavage information of potential tryptic

12

sites, e.g. the arginine (R) and lysine (K) residues, in the first half of identified proteins

13

was collected, including the SCs of the peptides observed on the N-terminal (SCN) and

14

C-terminal (SCC) of the tryptic site, and the SCs of the peptides containing this tryptic

15

site as a missed cleavage site (SCM). Then, tryptic sites were labeled as positive sites if

16

(1) SCN was at least 1, (2) SCC was at least 1 and (3) SCM was zero, and as negative

Peptide digestibility predictor

8

ACS Paragon Plus Environment

Page 9 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

sites if both SCN and SCC were zero and SCM was at least 2. Previous studies have

2

demonstrated that the cleavage probability of a tryptic site is influenced mainly by the

3

adjacent amino acids18,19,26. Therefore, for each tryptic site, a 9-mer consisting of the

4

tryptic site and four adjacent residues on each side was extracted. If the tryptic site was

5

located on the N or C terminus of a protein, resulting in insufficient amino acids to form

6

a 9-mer, the character Z was added to make up a 9-mer. Each character in the 9-mer,

7

except for the tryptic site (arginine or lysine for trypsin enzyme), was converted into a

8

21-dimensional binary vector that indicated whether one of the 20 amino acids or the

9

character Z appears. In the binary vector, if one amino acid appeared, the corresponding

10

position was set to 1, and other positions were set to 0. Thus, each 9-mer was converted

11

into a 168-dimensional binary vector that retained both the position and residue-specific

12

information.

13

Random forest classifier for cleavage probability prediction. We used the

14

random forest algorithm to predict the cleavage probabilities of tryptic sites. A random

15

forest is a nonlinear ensemble classifier consisting of a collection of independent

16

unpruned trees27. Its randomness is reflected in two aspects27: random selection of a

17

training subset for each tree by bootstrap and random selection of the feature for the

18

best split at each node. To implement the random forest algorithm, we used the

19

TreeBagger toolbox in MATLAB, which output a probability for each site to measure

20

the probability of being cleaved. The number of trees was set to 200. The number of

21

randomly selected features for each node was set to the square root of the number of all

22

features.

23

Peptide digestibility calculation. The peptide digestibility, which is defined as

24

the probability of the peptide being produced by the protein digestion process, is

9

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 25

1

calculated from the predicted cleavage probabilities of tryptic sites using the following

2

formula: 𝑛

3

peptide digestibility = 𝑒𝑁 ∗ 𝑒𝐶 ∗

∏(1 ― 𝑒 ) 𝑖

𝑖=1

4

where 𝑒𝑁 and 𝑒𝐶 are the predicted cleavage probabilities of the N- and C-terminal

5

tryptic sites of the peptide, respectively, 𝑒𝑖 is the predicted cleavage probability of the

6

𝑖-th missed cleavage site in the peptide, and 𝑛 is the total number of missed cleavage

7

sites in the peptide.

8

Peptide detectability predictor

9

Constructing the training set. For each protein in the second half of training data

10

set, in silico digestion was performed with up to 2 missed cleavage sites and the length

11

range of the digested peptides was set as the same as that of those identified peptides.

12

The peptides with more than one SC were taken as positive peptides, and the

13

unidentified digested peptides were taken as negative peptides. To date, the

14

mechanisms underlying peptide detectability are still not clear, so we collected as many

15

computable features of peptides as possible. The first was the peptide digestibility

16

which was calculated for each peptide using the method described above. To our

17

knowledge, the new feature peptide digestibility has never been used before. Other

18

features are as follows. Peptide length, number of missed cleavage sites, peptide

19

molecular weight and frequencies of 20 amino acids were calculated from the peptide

20

sequence. 544 amino acid-related physicochemical features from AAindex15 were used.

21

For each of the 544 AAindex features, the numerical values of the constituent amino

22

acids in a peptide were averaged to produce a single value. In addition, 20 other features

23

collected from previous studies7,11,13,28,29 about peptide detectability were added.

24

Finally, a total of 588 features were used to characterize each peptide (Table S1). 10

ACS Paragon Plus Environment

Page 11 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

Random forest classifier for peptide detectability prediction. AP3 employed

2

another random forest classifier to predict the peptide detectability. The random forest

3

toolbox and parameter settings were the same as used for predicting cleavage

4

probability. In construction of the training set, the number of identified peptides was

5

usually far less than the number of unidentified digested peptides. To overcome this

6

imbalance problem, we adopted the down-sampling technique30 to the negative

7

peptides (i.e., the majority class). In brief, the same number of negative peptides as that

8

of positive peptides were randomly selected from the whole set of negative peptides to

9

form a balanced training set. Next, since the scales of the selected features were

10

different, z-score normalization31 was employed for each feature to obtain a zero mean

11

and unity variance. To reduce the redundancy of features and increase the computation

12

efficiency, feature selection was performed using the minimum redundancy maximum

13

relevance (mRMR) method32 (Supplementary Note 1).

14

After peptide detectability prediction, the peptides for each protein of interest were

15

sorted by their predicted detectabilities in descending order and the top peptides were

16

selected as the proteotypic peptides of this protein.

17

Results and Discussion

18

We trained the AP3 algorithm using the Yeast data set, tested its generalization

19

ability on the E. coli, Mouse and Human data sets, and compared it with existing tools.

20

In addition, we evaluated the contribution of the feature peptide digestibility to the

21

performance of AP3.

22

Performance of tryptic-site cleavage probability predictor

23

There were 3959 proteins with 43088 peptides identified in the Yeast data set.

24

After filtering by the sequence coverages and SCs of the proteins, 1556 proteins with

25

29171 peptides remained. These proteins were randomly divided into two halves. The 11

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 25

1

first half consisting of 778 proteins were used to train the cleavage probability predictor.

2

Following the construction rules of the cleavage probability training set described in

3

the “Experimental Section”, 3867 positive tryptic sites and 2370 negative tryptic sites

4

were obtained. The trained cleavage probability predictor had a 10-fold cross-validation

5

AUC of 0.9708 (Figure 2), and the average AUC for the three test data sets was 0.9684

6

(Figure S3). These results demonstrated the accuracy of the cleavage probability

7

predictor.

8 9 10

Figure 2. The 10-fold cross-validation ROC curve demonstrating the performance of cleavage probability predictor on the Yeast data set.

11 12

Performance of peptide detectability predictor

13

In the second half of the Yeast data set used for training the peptide detectability

14

predictor, there were 12850 identified peptides with SC > 1, which were taken as

15

positive peptides, and the same number of negative peptides were randomly selected

16

from the unidentified digested peptides with SC > 1. For each peptide, 588 features

12

ACS Paragon Plus Environment

Page 13 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

were calculated, including the peptide digestibility. Then, we selected 29 features in

2

AP3 algorithm using the mRMR method (Supplementary Note 1).

3

The peptide detectability prediction model trained with the 29 selected features

4

obtained a 10-fold cross-validation AUC of 0.9269 on the training set from the Yeast

5

data set (Figure 3A). The AUCs for the three test data sets (E. coli, Mouse and Human)

6

were 0.9405, 0.9313, and 0.9213, respectively (Figure 3B). These results illustrated

7

that our model retained good predictive power on independent data sets of other

8

organisms. To demonstrate the generalization ability of AP3, we also tested the trained

9

model on the three additive published data sets33–35 with different LC columns and

10

elution conditions (Table S3). The result demonstrated that LC columns and elution

11

condition do not affect the good performance of AP3 (Figure S4). During the analysis,

12

we trained the peptide detectability model 100 times independently and in each time

13

used the down-sampling technique30 to randomly select the same number of negative

14

peptides as positive peptides. The results showed that our model has a good stability

15

(Figure S5).

16

Notably, our proposed feature, peptide digestibility, showed an excellent ability in

17

predicting the peptide detectability. First, peptide digestibility was the top one among

18

the selected features and this conclusion had also been validated on the three test data

19

sets (Table S2). Second, we compared the accuracy and generalization ability of the

20

peptide detectability predictors with/without the peptide digestibility included in the

21

selected features on the four data sets. Note that the model without the peptide

22

digestibility feature was trained using all the proteins after filtration, rather than half of

23

these proteins. By including peptide digestibility, the 10-fold cross-validation AUC

24

increased from 0.8891 to 0.9269 on the training data set (Figure 3A), and the AUCs

25

increased from 0.9126 to 0.9405, from 0.9063 to 0.9313 and from 0.8854 to 0.9213 on 13

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 25

1

the three test data sets (E. coli, Mouse, Human), respectively (Figure 3B). This

2

conclusion based on the selected features was also confirmed by comparing the

3

accuracy and generalization ability of the peptide detectability predictors with/without

4

the peptide digestibility feature included in the whole set of features (without feature

5

selection) on the four data sets (Figure S6). At last, we investigated the relationship

6

between the peptide digestibility and the number of missed cleavage sites, and found

7

that they were indeed correlated with each other but the former was much more

8

powerful than the latter in predicting the peptide detectability (Supplementary Note 2,

9

Figure S7). By using the peptide digestibility alone, the random forest classifier

10

exhibited a 10-fold cross-validation AUC of as high as 0.8266 (Figure S7B). All these

11

results demonstrated that peptide digestibility is the most important feature for

12

predicting proteotypic peptides.

13

14

ACS Paragon Plus Environment

Page 15 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1 2 3 4 5 6 7

Figure 3. Performance comparisons of the AP3 algorithm with/without the peptide digestibility feature included in the peptide detectability model. (A) The 10-fold crossvalidation ROC curves on the training data set (Yeast). (B) The AUCs on the three test data sets.

Comparison with existing tools

8

To evaluate the performance of our algorithm, we compared AP3 with existing

9

available tools, including PeptideSieve8, CONSeQuence11, ESP Predictor6, and PPA16.

10

For all these tools, we used the available trained models provided by the authors of their

11

publications. Note that PeptideSieve, CONSeQuence and ESP Predictor were all 15

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 25

1

trained on Yeast data sets, while PPA was trained on a human data set. PeptideSieve

2

took protein sequences as the input in the FASTA format. The maximum number of

3

missed cleavage sites was set to 2, and the maximum peptide mass was set to 6000 Da.

4

We tested all types of PeptideSieve (PAGE_ESI, PAGE_MALDI, MUDPIT_ESI,

5

MUDPIT_ICAT).

6

(http://king.smith.man.ac.uk/CONSeQuence/) with the number of missed cleavage

7

sites set to 2 and the prediction type set to ANN only. The ESP Predictor was also run

8

online (https://genepattern.broadinstitute.org/) using the default parameters. The Perl

9

script

PPA.pl

CONSeQuence

was

was

downloaded

from

run

the

online

website

10

http://software.steenlab.org/rc4/PPA.php. The ROCs of different tools on the three test

11

data sets are shown in Figure 4. The results showed that AP3 outperformed other tools

12

in terms of true positive rate at arbitrary given false positive rate. On all the three test

13

data sets, AP3 exhibited superior performance to other four tools (PeptideSieve,

14

ConSequence, ESP Predictor and PPA): increasing the AUC by 10.3%, 16.3%, 34.7%,

15

and 25.4% on average on the three test data sets, respectively. Further, among these

16

tools, only PPA provided retraining solution. Thus, we also retrained PPA on our Yeast

17

data set and compared its performance with AP3 on the three test data sets. The result

18

still supported that AP3 is significantly more accurate than PPA (Figure S8).

19

16

ACS Paragon Plus Environment

Page 17 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1 2 3 4 5 6 7

Figure 4. Performance comparisons between AP3 and other tools on the three test data sets (A) E. coli, (B) Mouse and (C) Human, respectively. Abbreviations: PeptideSieve1: PeptideSieve_ICAT_ESI; PeptideSieve2: PeptideSieve_MUDPIT_ESI; PeptideSieve3: PeptideSieve_PAGE_ESI; and PeptideSieve4: PeptideSieve_PAGE_MALDI. 17

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

Page 18 of 25

MRM assay validation

2

One direct application of a peptide detectability predictor is selecting proteotypic

3

peptides for targeted proteomics assays. We applied AP3 and other four available tools

4

to an MRM data set published by Fusaro et al.6. Briefly, this data set consists of 14

5

proteins, each of which has several experimentally validated proteotypic peptides. We

6

first predicted the peptide detectabilities of all in silico digested peptides of these

7

proteins, and then measured the performances of different algorithms in terms of the

8

protein sensitivity, which is defined as the percentage of the proteins which have at

9

least one validated proteotypic peptides included in the top five peptides with the

10

highest predicted detectabilities for each protein6. The protein sensitivity of AP3 was

11

100% (14/14), while the protein sensitivities of PeptideSieve, CONSeQuence, ESP

12

Predictor, PPA, were all 93% (13/14) (Table S4). The results indicated that AP3 was

13

capable of accurately selecting proteotypic peptides for MRM-MS assays.

14

Software availability

15

The AP3 software and the user guide document are freely available at

16

http://fugroup.amss.ac.cn/software/AP3/AP3.html. For ease of use, we provide a

17

graphical user interface for AP3.

18

Conclusions

19

In this study, we presented an algorithm, named AP3, for predicting peptide

20

detectability. For the first time, we incorporated peptide digestibility into the peptide

21

detectability prediction model as a novel and powerful feature. We demonstrated that

22

peptide digestibility can greatly increase the accuracy of the peptide detectability

23

prediction. AP3 enables the selection of candidate proteotypic peptides for the proteins

24

of interest in the absence of high-quality MS-based experimental evidence, especially 18

ACS Paragon Plus Environment

Page 19 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

for the proteins identified by methods other than proteomics, such as genomic

2

experiments or literature mining. This study may also have a significant effect on

3

improving protein quantification, designing targeted proteomics assays, and

4

discovering biological biomarkers for early diagnosis and therapy.

5

Supporting Information

6

Supplementary Notes: 1. Feature selection. 2. Relationship between peptide

7

digestibility and the number of missed cleavage sites. The incremental feature selection

8

curve was plotted by 10-fold cross-validation, as the top 50 features selected by mRMR

9

were added successively to the random forest classifier (Figure S1); Validation of

10

feature selection (Figure S2); ROC curves demonstrating the performance of cleavage

11

probability model applied to the three test data sets (Figure S3); The performance of

12

AP3 on three published data sets with different LC columns and elution conditions

13

(Figure S4); Demonstration of the stability of peptide detectability model (Figure S5);

14

Performance comparisons of the AP3 algorithm with/without the peptide digestibility

15

feature included in all the 588 features (Figure S6); Comparative analysis of peptide

16

digestibility and the number of missed cleavage sites (Figure S7); Performance

17

comparison between AP3 and PPA on the three independent published data sets (A) E.

18

coli, (B) Mouse, and (C) Human, respectively (Figure S8); The 588 features used to

19

characterize the peptides in AP3 (Table S1); The lists of selected features in four data

20

sets by mRMR (Table S2); Summary of the three additive published data sets with

21

different LC columns and elution conditions (Table S3); Validation result on MRM

22

assay data set (Table S4).

23

19

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

Page 20 of 25

Acknowledgments

2

This work was supported by the Strategic Priority Research Program of CAS

3

(XDB13040600), the National Basic Research Program of China (2017YFA0505002

4

and 2015CB554406), the National Natural Science Foundation of China (21605159),

5

the Innovation Program (16CXZ027), the International S&T Cooperation Program of

6

China (2014DFB30010), and the NCMIS CAS.

7

Author contributions

8

Y.F, C.C. and Y.Z. designed the algorithms and experiments. Z.G., C.C. and Y.J.

9

implemented the algorithms and performed the data analysis. Z.G. and C.C. wrote the

10

initial manuscript. All authors edited and approved the final manuscript.

11

Author information

12

Corresponding Authors:

13

Yan Fu, Email: [email protected]

14

Yunping Zhu, Email: [email protected]

15 16

ORCID

17

Zhiqiang Gao: 0000-0001-8644-1541

18

Cheng Chang: 0000-0002-0361-2438

19

Yan Fu: 0000-0001-6896-5931

20

Competing financial interests

21

The authors declare no competing financial interest.

20

ACS Paragon Plus Environment

Page 21 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

References

2

(1) E., P. C.; H., B. C. Mass Spectrometry Based Biomarker Discovery, Verification,

3

and Validation — Quality Assurance and Control of Protein Biomarker Assays. Mol.

4

Oncol. 2014, 8 (4), 840–858.

5

(2) Rifai, N.; Gillette, M. A.; Carr, S. A. Protein Biomarker Discovery and Validation:

6

The Long and Uncertain Path to Clinical Utility. Nat. Biotechnol. 2006, 24 (8), 971–

7

983.

8

(3) Demeure, K.; Duriez, E.; Domon, B.; Niclou, S. P. PeptideManager: A Peptide

9

Selection Tool for Targeted Proteomic Studies Involving Mixed Samples from

10

Different Species. Front. Genet. 2014, 5, 305.

11

(4) Le Bihan, T.; Robinson, M. D.; Stewart, I. I.; Figeys, D. Definition and

12

Characterization of a “Trypsinosome” from Specific Peptide Characteristics by Nano-

13

HPLC−MS/MS and in Silico Analysis of Complex Protein Mixtures. J. Proteome Res.

14

2004, 3 (6), 1138–1148.

15

(5) Ethier, M.; Figeys, D. Strategy to Design Improved Proteomic Experiments Based

16

on Statistical Analyses of the Chemical Properties of Identified Peptides. J. Proteome

17

Res. 2005, 4 (6), 2201–2206.

18

(6) Fusaro, V. A.; Mani, D. R.; Mesirov, J. P.; Carr, S. A. Prediction of High-

19

Responding Peptides for Targeted Protein Assays by Mass Spectrometry. Nat Biotech

20

2009, 27 (2), 190–198.

21

(7) Webb-Robertson, B. J.; Cannon, W. R.; Oehmen, C. S.; Shah, A. R.; Gurumoorthi,

22

V.; Lipton, M. S.; Waters, K. M. A Support Vector Machine Model for the Prediction

23

of Proteotypic Peptides for Accurate Mass and Time Proteomics. Bioinformatics 2008,

24

24 (13), 1503–1509.

21

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 25

1

(8) Mallick, P.; Schirle, M.; Chen, S. S.; Flory, M. R.; Lee, H.; Martin, D.; Ranish, J.;

2

Raught, B.; Schmitt, R.; Werner, T.; et al. Computational Prediction of Proteotypic

3

Peptides for Quantitative Proteomics. Nat. Biotechnol. 2007, 25 (1), 125–131.

4

(9) Sanders, W. S.; Bridges, S. M.; McCarthy, F. M.; Nanduri, B.; Burgess, S. C.

5

Prediction of Peptides Observable by Mass Spectrometry Applied at the Experimental

6

Set Level. BMC Bioinformatics 2007, 8 Suppl 7, S23.

7

(10) Wedge, D. C.; Kell, D. B.; Gaskell, S. J.; Lau, K. W.; Hubbard, S. J.; Eyers, C.

8

Peptide Detectability Following ESI Mass Spectrometry: Prediction Using Genetic

9

Programming. Gecco 2007 Genet. Evol. Comput. Conf. Vol 1 2 2007, 2219–2225.

10

(11) Eyers, C. E.; Lawless, C.; Wedge, D. C.; Lau, K. W.; Gaskell, S. J.; Hubbard, S.

11

J. CONSeQuence: Prediction of Reference Peptides for Absolute Quantitative

12

Proteomics Using Consensus Machine Learning Approaches. Mol. Cell. Proteomics

13

2011, 10 (11), M110.003384-M110.003384.

14

(12) Qeli, E.; Omasits, U.; Goetze, S.; Stekhoven, D. J.; Frey, J. E.; Basler, K.;

15

Wollscheid, B.; Brunner, E.; Ahrens, C. H. Improved Prediction of Peptide

16

Detectability for Targeted Proteomics Using a Rank-Based Algorithm and Organism-

17

Specific Data. J. Proteomics 2014, 108, 269–283.

18

(13) Tang, H.; Arnold, R. J.; Alves, P.; Xun, Z.; Clemmer, D. E.; Novotny, M. V;

19

Reilly, J. P.; Radivojac, P. A Computational Approach toward Label-Free Protein

20

Quantification Using Predicted Peptide Detectability. Bioinformatics 2006, 22 (14),

21

e481–e488.

22

(14) Jarnuczak, A. F.; Lee, D. C. H.; Lawless, C.; Holman, S. W.; Eyers, C. E.;

23

Hubbard, S. J. Analysis of Intrinsic Peptide Detectability via Integrated Label-Free and

24

SRM-Based Absolute Quantitative Proteomics. J. Proteome Res. 2016, 15 (9), 2945–

25

2959. 22

ACS Paragon Plus Environment

Page 23 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

(15) Kawashima, S.; Ogata, H.; Kanehisa, M. AAindex: Amino Acid Index Database.

2

Nucleic Acids Res. 1999, 27 (1), 368–369.

3

(16) Muntel, J.; Boswell, S. A.; Tang, S.; Ahmed, S.; Wapinski, I.; Foley, G.; Steen,

4

H.; Springer, M. Abundance-Based Classifier for the Prediction of Mass Spectrometric

5

Peptide Detectability upon Enrichment (PPA). Mol Cell Proteomics 2015, 14 (2), 430–

6

440.

7

(17) Siepen, J. A.; Keevil, E.-J.; Knight, D.; Hubbard, S. J. Prediction of Missed

8

Cleavage Sites in Tryptic Peptides Aids Protein Identification in Proteomics. J.

9

Proteome Res. 2007, 6 (1), 399–408.

10

(18) Lawless, C.; Hubbard, S. J. Prediction of Missed Proteolytic Cleavages for the

11

Selection of Surrogate Peptides for Quantitative Proteomics. OMICS 2012, 16 (9), 449–

12

456.

13

(19) Fannes, T.; Vandermarliere, E.; Schietgat, L.; Degroeve, S.; Martens, L.; Ramon,

14

J. Predicting Tryptic Cleavage from Proteomics Data Using Decision Tree Ensembles.

15

J. Proteome Res. 2013, 12 (5), 2253–2259.

16

(20) Hebert, A. S.; Richards, A. L.; Bailey, D. J.; Ulbrich, A.; Coughlin, E. E.;

17

Westphall, M. S.; Coon, J. J. The One Hour Yeast Proteome. Mol. Cell. Proteomics

18

2014, 13 (1), 339–347.

19

(21) Schmidt, A.; Kochanowski, K.; Vedelaar, S.; Ahrné, E.; Volkmer, B.; Callipo, L.;

20

Knoops, K.; Bauer, M.; Aebersold, R.; Heinemann, M. The Quantitative and Condition-

21

Dependent Escherichia Coli Proteome. Nat. Biotechnol. 2016, 34 (1), 104–110.

22

(22) Malmström, E.; Kilsgård, O.; Hauri, S.; Smeds, E.; Herwald, H.; Malmström, L.;

23

Malmström, J. Large-Scale Inference of Protein Tissue Origin in Gram-Positive Sepsis

24

Plasma Using Quantitative Targeted Proteomics. Nat. Commun. 2016, 7, 10261.

23

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 25

1

(23) Wilhelm, M.; Schlegl, J.; Hahne, H.; Moghaddas Gholami, A.; Lieberenz, M.;

2

Savitski, M. M.; Ziegler, E.; Butzmann, L.; Gessulat, S.; Marx, H.; et al. Mass-

3

Spectrometry-Based Draft of the Human Proteome. Nature 2014, 509 (7502), 582–587.

4

(24) Cox, J.; Neuhauser, N.; Michalski, A.; Scheltema, R. A.; Olsen, J. V; Mann, M.

5

Andromeda: A Peptide Search Engine Integrated into the MaxQuant Environment. J.

6

Proteome Res. 2011, 10 (4), 1794–1805.

7

(25) Cox, J.; Mann, M. MaxQuant Enables High Peptide Identification Rates,

8

Individualized

9

Quantification. Nat. Biotechnol. 2008, 26 (12), 1367–1372.

p.p.b.-Range

Mass

Accuracies

and

Proteome-Wide

Protein

10

(26) Hubbard, S. J. The Structural Aspects of Limited Proteolysis of Native Proteins.

11

Biochim. Biophys. Acta - Protein Struct. Mol. Enzymol. 1998, 1382 (2), 191–206.

12

(27) Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32.

13

(28) Braisted, J. C.; Kuntumalla, S.; Vogel, C.; Marcotte, E. M.; Rodrigues, A. R.;

14

Wang, R.; Huang, S.-T.; Ferlanti, E. S.; Saeed, A. I.; Fleischmann, R. D.; et al. The

15

APEX Quantitative Proteomics Tool: Generating Protein Quantitation Estimates from

16

LC-MS/MS Proteomics Results. BMC Bioinformatics 2008, 9, 529.

17

(29) Vucetic, S.; Brown, C. J.; Dunker, A. K.; Obradovic, Z. Flavors of Protein

18

Disorder. Proteins Struct. Funct. Genet. 2003, 52 (4), 573–584.

19

(30) Chen, C.; Liaw, A.; Breiman, L. Using Random Forest to Learn Imbalanced Data.

20

Univ. California, Berkeley 2004, 110, 1–12.

21

(31) Jain, A.; Nandakumar, K.; Ross, A. Score Normalization in Multimodal Biometric

22

Systems. Pattern Recognit. 2005, 38 (12), 2270–2285.

23

(32) Ding, C.; Peng, H. Minimum Redundancy Feature Selection from Microarray

24

Gene Expression Data. J. Bioinform. Comput. Biol. 2005, 3 (2), 185–205.

24

ACS Paragon Plus Environment

Page 25 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

(33) Ding, C.; Jiang, J.; Wei, J.; Liu, W.; Zhang, W.; Liu, M.; Fu, T.; Lu, T.; Song, L.;

2

Ying, W.; et al. A Fast Workflow for Identification and Quantification of Proteomes.

3

Mol. Cell. proteomics 2013, 12 (8), 2370–2380.

4

(34) Ge, S.; Xia, X.; Ding, C.; Zhen, B.; Zhou, Q.; Feng, J.; Yuan, J.; Chen, R.; Li, Y.;

5

Ge, Z.; et al. A Proteomic Landscape of Diffuse-Type Gastric Cancer. Nat. Commun.

6

2018, 9 (1), 1–16.

7

(35) Tyanova, S.; Albrechtsen, R.; Kronqvist, P.; Cox, J.; Mann, M.; Geiger, T.

8

Proteomic Maps of Breast Cancer Subtypes. Nat. Commun. 2016, 7, 10259.

9 10

For TOC Only

11

25

ACS Paragon Plus Environment