Subscriber access provided by Bethel University
Article
AP3: an advanced proteotypic peptide predictor for targeted proteomics by incorporating peptide digestibility ZhiQiang Gao, Cheng Chang, Jinghan Yang, Yunping Zhu, and Yan Fu Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.9b02520 • Publication Date (Web): 06 Jun 2019 Downloaded from http://pubs.acs.org on June 7, 2019
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
1
AP3: an advanced proteotypic peptide predictor for
2
targeted
3
digestibility
4
Zhiqiang Gao1,3||, Cheng Chang2||, Jinghan Yang1,3,
5
Yunping Zhu2,4*, Yan Fu1,3*
6
1. National Center for Mathematics and Interdisciplinary Sciences, Key Laboratory of
7
Random Complex Structures and Data Science, Academy of Mathematics and
8
Systems Science, Chinese Academy of Sciences, Beijing 100190, China.
9
2. State Key Laboratory of Proteomics, Beijing Proteome Research Center, National
10
Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing 102206,
11
China.
12
3. School of Mathematical Sciences, University of Chinese Academy of Sciences,
13
Beijing 100049, China.
14
4. Anhui Medical University, Hefei 230032, China.
15
|| Contributed equally to this work
16
* To whom correspondence should be addressed:
17
Yan Fu, Email:
[email protected] 18
Yunping Zhu, Email:
[email protected] proteomics
by
incorporating
peptide
19
1
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1
Page 2 of 25
ABSTRACT
2
The selection of proteotypic peptides, i.e., detectable unique representatives of
3
proteins of interest, is a key step in targeted proteomics. To date, much effort has been
4
made to understand the mechanisms underlying peptide detection in liquid
5
chromatography-tandem mass spectrometry (LC-MS/MS) based shotgun proteomics
6
and to predict proteotypic peptides in the absence of experimental LC-MS/MS data.
7
However, the prediction accuracy of existing tools is still unsatisfactory. We find that
8
one crucial reason is their neglect of the significant influence of protein proteolytic
9
digestion on peptide detectability in shotgun proteomics. Here, we present an Advanced
10
Proteotypic Peptide Predictor (AP3), which explicitly takes peptide digestibility into
11
account for the prediction of proteotypic peptides. Specifically, peptide digestibility is
12
first predicted for each peptide and then incorporated as a feature into the peptide
13
detectability prediction model. Our results demonstrated that peptide digestibility is the
14
most important feature for the accurate prediction of proteotypic peptides in our model.
15
Compared with the existing available algorithms, AP3 showed 10.3%-34.7% higher
16
prediction accuracy. On a targeted proteomics data set, AP3 accurately predicted the
17
proteotypic peptides for proteins of interest, showing great potential for assisting the
18
design of targeted proteomics experiments.
19
2
ACS Paragon Plus Environment
Page 3 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
1
Introduction
2
In shotgun proteomics, proteins are first digested by some enzyme into peptide
3
mixtures which are then analyzed by liquid chromatography-tandem mass spectrometry
4
(LC-MS/MS). In recent years, MS-based targeted proteomics, such as multiple reaction
5
monitoring (MRM) experiments, are capable of the sensitive identification and
6
quantification of proteins of interest and have become a promising powerful tool for
7
biological or clinical studies, such as the verification of candidate biomarkers1,2. The
8
first key step in developing an MRM assay is the selection of proteotypic peptides, i.e.,
9
the peptides that are unique representatives of their corresponding proteins and are
10
detectable in proteomic experiments3. There are two major approaches to proteotypic
11
peptide selection, i.e., the experimental approach and the computational approach. The
12
former selects previously identified peptides as proteotypic peptides, while the latter
13
predicts proteotypic peptides using empirical or machine learning models. Although the
14
experimental approach has been successfully applied in MRM assays, it has some
15
limitations. For example, not all target proteins have experimental evidences, especially
16
the proteins identified by literature mining. Thus, researchers are paying more attention
17
to the computational approach. However, the mechanisms underlying peptide detection
18
are still unclear, which hinders the development of accurate proteotypic peptide
19
prediction algorithms.
20
To date, much effort has been devoted to understanding the mechanisms of peptide
21
detection and predicting proteotypic peptides. In early researches, empirical score
22
functions were constructed based on the physiochemical properties of peptides, such as
23
hydrophobicity, peptide length and isoelectric point4,5. In recent years, several machine
24
learning-based algorithms6–12 have been developed to predict proteotypic or high-
25
responding peptides. In these studies, designing the features characterizing the peptides 3
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 4 of 25
1
is one of the key steps. The likelihood of observing a peptide in a shotgun proteomics
2
experiment is governed by many factors, such as the physicochemical properties of the
3
peptide, the LC-MS/MS behavior of the peptide, and the abundance of its
4
corresponding protein13,14. Tang et al.13 first proposed the concept of peptide
5
detectability, which is defined as the probability that a peptide would be observed in a
6
standard sample analyzed by a standard proteomics routine. They also invented a
7
machine learning model which used 175 features derived from the peptide sequence to
8
predict the peptide detectability. Later, Sander et al.9, Mallick et al.8 and Eyers et al.11
9
developed their models using 596, 1010, and 1186 features, respectively. These features
10
mainly include AAindex-derived features15 and sequence-derived features. Recently,
11
Muntel et al.16 considered the protein abundance as an additional feature and obtained
12
improved performance. However, the protein abundance information is generally
13
unavailable in the absence of experimental LC-MS/MS data.
14
A common limitation of the above algorithms is that they only considered the
15
features of peptides themselves but did not consider the influence of protein proteolytic
16
digestion on peptide detectability. As we know, a typical shotgun proteomics
17
experiment can be divided into two successive processes: protein proteolytic digestion
18
and peptide detection by LC-MS/MS, and the peptide detectability is the probability of
19
observing a peptide in the whole shotgun proteomics experiment, not only in the peptide
20
detection process by LC-MS/MS. It is important to know which peptides and what
21
proportions of them could be digested out and subjected to subsequent LC-MS/MS
22
analysis. Previous studies have demonstrated that the process of protein digestion by
23
the commonly used enzyme trypsin is always incomplete17. The cleavage probability
24
of a tryptic site is mainly determined by the amino acids surrounding the site and
25
affected by other factors such as the local conformation and tertiary structure of its 4
ACS Paragon Plus Environment
Page 5 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
1
protein as well as experimental conditions of digestion17. Several algorithms have been
2
demonstrated to be successful in predicting the cleavage probabilities of tryptic sites
3
exclusively from the adjacent amino acids18,19. Note that half of the amino acids
4
surrounding a cleavage site are outside the peptide sequence and therefore are not
5
characterized by previous peptide detectability prediction algorithms. For more
6
accurate prediction of peptide detectability, the peptide digestibility must be explicitly
7
taken into account.
8
Here, we present a novel algorithm named AP3 (short for Advanced Proteotypic
9
Peptide Predictor) which, for the first time, integrates the protein digestion and peptide
10
LC-MS/MS detection processes together to predict proteotypic peptides. Specifically,
11
it first predicts the peptide digestibility and then incorporates it as a feature into the
12
peptide detectability prediction model. Our results showed that peptide digestibility was
13
the most important feature for predicting proteotypic peptides, increasing the 10-fold
14
cross-validation AUC (area under the receiver operating characteristic (ROC) curve)
15
from 0.8891 to 0.9269. Furthermore, we demonstrated that the peptide detectability
16
prediction model trained on a yeast data set retained good predictive power over three
17
independent comprehensive data sets from E. coli (AUC 0.9405), mouse (AUC 0.9313)
18
and human (AUC 0.9213) samples. Compared with existing tools, including
19
PeptideSieve8, CONSeQuence11, ESP Predictor6 and PPA16, AP3 exhibited 10.3%-34.7%
20
higher accuracy in AUC. At last, we showed that AP3 can effectively predict
21
proteotypic peptides for targeted proteomics in the absence of experimental LC-MS/MS
22
data.
5
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1
Experimental Section
2
Data sets
Page 6 of 25
3
A public large-scale yeast data set20 was used to train the AP3 model. To test the
4
generalization performance of AP3, we used three independent public data sets from
5
other organisms, i.e., E. coli21, mouse22 and human23. In all of these data sets, the
6
proteolytic digestion was performed by incubation with trypsin overnight. For the
7
human data set, the lymph node and salivary gland data in their original publication
8
were used. We reanalyzed the raw MS files as follows.
9
Mass spectra in each data set were searched against the corresponding organism
10
sequences from the UniProt database using the Andromeda search engine24 in the
11
MaxQuant software25 (version 1.6.0.1). Carbamidomethylation on cysteine was set as
12
a fixed modification. Methionine oxidation and protein N-terminal acetylation were set
13
as variable modifications. Protein sequences were theoretically digested using fully
14
tryptic cleavage constraints, and up to two missed cleavage sites were allowed. The
15
precursor mass tolerance was set to 20 ppm for the first search and 4.5 ppm for the main
16
search. The mass tolerance for fragment ions was set to 0.05 Da for Q Exactive Plus
17
and 0.5 Da for other instruments. False discovery rates at the protein and peptide levels
18
were both set to 1%.
19
Filtration of identified proteins
20
To construct more confident training and test data sets for AP3, the identified
21
proteins were further filtered by their spectral counts (SCs) and sequence coverages.
22
The SC of a protein was calculated as the sum of SCs of its identified peptides. It is
23
reasonable to assume that a protein is confidently identified if it has a large SC and high
24
sequence coverage. Therefore, we sorted the proteins by both their SCs and sequence 6
ACS Paragon Plus Environment
Page 7 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
1
coverages, and kept those proteins that appeared in the top 50% of both ranks. Table 1
2
provides a summary of the four data sets.
3
Table 1. Summary of the four public data sets used for training and testing AP3 Data set
Proteins
Peptides
Proteins after
Peptides after
identified
identified
filtration
filtration
3959
43088
1556
29172
2898
38035
1077
23674
Q Exactive Plus
5304
58184
2120
39095
LTQ Orbitrap
6422
80490
2606
55289
Usage
Instrument
Yeast
training
Orbitrap Fusion
E. coli
test
Mouse
test
Human
test
name
LTQ Orbitrap Elite
4 5
Workflow of AP3 algorithm
6
As Figure 1 illustrates, the development workflow of AP3 has two main
7
components: a peptide digestibility predictor and a peptide detectability predictor. To
8
avoid over-fitting and confounding the peptide digestibility and detectability in the
9
training stage, the identified filtered proteins are randomly divided into two halves. The
10
first half is used to train the peptide digestibility predictor and the other half is used to
11
train the peptide detectability predictor. A random forest classifier is first trained to
12
predict the cleavage probabilities of tryptic sites. Then, the digestibility of each peptide
13
is calculated from the cleavage probabilities of its tryptic sites and integrated into the
14
peptide detectability predictor as one of the features that characterize the peptide. Next,
15
other 587 physicochemical features of peptides are calculated and feature selection is
16
performed. Finally, another random forest classifier is trained for predicting the peptide
17
detectability.
18 19 7
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 8 of 25
1 2 3 4 5 6 7 8 9
Figure 1. The development workflow of AP3. The identified proteins are divided into two halves to completely separate the training sets of peptide digestibility predictor and peptide detectability predictor. The first half is used to train a random forest classifier for predicting cleavage probabilities of tryptic sites. The other half is used to train another random forest classifier for predicting peptide detectability. The predicted digestibility is integrated as a feature into the peptide detectability predictor.
10
Constructing the training set. Identified peptides were first mapped to their
11
corresponding protein sequences. Then, the cleavage information of potential tryptic
12
sites, e.g. the arginine (R) and lysine (K) residues, in the first half of identified proteins
13
was collected, including the SCs of the peptides observed on the N-terminal (SCN) and
14
C-terminal (SCC) of the tryptic site, and the SCs of the peptides containing this tryptic
15
site as a missed cleavage site (SCM). Then, tryptic sites were labeled as positive sites if
16
(1) SCN was at least 1, (2) SCC was at least 1 and (3) SCM was zero, and as negative
Peptide digestibility predictor
8
ACS Paragon Plus Environment
Page 9 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
1
sites if both SCN and SCC were zero and SCM was at least 2. Previous studies have
2
demonstrated that the cleavage probability of a tryptic site is influenced mainly by the
3
adjacent amino acids18,19,26. Therefore, for each tryptic site, a 9-mer consisting of the
4
tryptic site and four adjacent residues on each side was extracted. If the tryptic site was
5
located on the N or C terminus of a protein, resulting in insufficient amino acids to form
6
a 9-mer, the character Z was added to make up a 9-mer. Each character in the 9-mer,
7
except for the tryptic site (arginine or lysine for trypsin enzyme), was converted into a
8
21-dimensional binary vector that indicated whether one of the 20 amino acids or the
9
character Z appears. In the binary vector, if one amino acid appeared, the corresponding
10
position was set to 1, and other positions were set to 0. Thus, each 9-mer was converted
11
into a 168-dimensional binary vector that retained both the position and residue-specific
12
information.
13
Random forest classifier for cleavage probability prediction. We used the
14
random forest algorithm to predict the cleavage probabilities of tryptic sites. A random
15
forest is a nonlinear ensemble classifier consisting of a collection of independent
16
unpruned trees27. Its randomness is reflected in two aspects27: random selection of a
17
training subset for each tree by bootstrap and random selection of the feature for the
18
best split at each node. To implement the random forest algorithm, we used the
19
TreeBagger toolbox in MATLAB, which output a probability for each site to measure
20
the probability of being cleaved. The number of trees was set to 200. The number of
21
randomly selected features for each node was set to the square root of the number of all
22
features.
23
Peptide digestibility calculation. The peptide digestibility, which is defined as
24
the probability of the peptide being produced by the protein digestion process, is
9
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 10 of 25
1
calculated from the predicted cleavage probabilities of tryptic sites using the following
2
formula: 𝑛
3
peptide digestibility = 𝑒𝑁 ∗ 𝑒𝐶 ∗
∏(1 ― 𝑒 ) 𝑖
𝑖=1
4
where 𝑒𝑁 and 𝑒𝐶 are the predicted cleavage probabilities of the N- and C-terminal
5
tryptic sites of the peptide, respectively, 𝑒𝑖 is the predicted cleavage probability of the
6
𝑖-th missed cleavage site in the peptide, and 𝑛 is the total number of missed cleavage
7
sites in the peptide.
8
Peptide detectability predictor
9
Constructing the training set. For each protein in the second half of training data
10
set, in silico digestion was performed with up to 2 missed cleavage sites and the length
11
range of the digested peptides was set as the same as that of those identified peptides.
12
The peptides with more than one SC were taken as positive peptides, and the
13
unidentified digested peptides were taken as negative peptides. To date, the
14
mechanisms underlying peptide detectability are still not clear, so we collected as many
15
computable features of peptides as possible. The first was the peptide digestibility
16
which was calculated for each peptide using the method described above. To our
17
knowledge, the new feature peptide digestibility has never been used before. Other
18
features are as follows. Peptide length, number of missed cleavage sites, peptide
19
molecular weight and frequencies of 20 amino acids were calculated from the peptide
20
sequence. 544 amino acid-related physicochemical features from AAindex15 were used.
21
For each of the 544 AAindex features, the numerical values of the constituent amino
22
acids in a peptide were averaged to produce a single value. In addition, 20 other features
23
collected from previous studies7,11,13,28,29 about peptide detectability were added.
24
Finally, a total of 588 features were used to characterize each peptide (Table S1). 10
ACS Paragon Plus Environment
Page 11 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
1
Random forest classifier for peptide detectability prediction. AP3 employed
2
another random forest classifier to predict the peptide detectability. The random forest
3
toolbox and parameter settings were the same as used for predicting cleavage
4
probability. In construction of the training set, the number of identified peptides was
5
usually far less than the number of unidentified digested peptides. To overcome this
6
imbalance problem, we adopted the down-sampling technique30 to the negative
7
peptides (i.e., the majority class). In brief, the same number of negative peptides as that
8
of positive peptides were randomly selected from the whole set of negative peptides to
9
form a balanced training set. Next, since the scales of the selected features were
10
different, z-score normalization31 was employed for each feature to obtain a zero mean
11
and unity variance. To reduce the redundancy of features and increase the computation
12
efficiency, feature selection was performed using the minimum redundancy maximum
13
relevance (mRMR) method32 (Supplementary Note 1).
14
After peptide detectability prediction, the peptides for each protein of interest were
15
sorted by their predicted detectabilities in descending order and the top peptides were
16
selected as the proteotypic peptides of this protein.
17
Results and Discussion
18
We trained the AP3 algorithm using the Yeast data set, tested its generalization
19
ability on the E. coli, Mouse and Human data sets, and compared it with existing tools.
20
In addition, we evaluated the contribution of the feature peptide digestibility to the
21
performance of AP3.
22
Performance of tryptic-site cleavage probability predictor
23
There were 3959 proteins with 43088 peptides identified in the Yeast data set.
24
After filtering by the sequence coverages and SCs of the proteins, 1556 proteins with
25
29171 peptides remained. These proteins were randomly divided into two halves. The 11
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 12 of 25
1
first half consisting of 778 proteins were used to train the cleavage probability predictor.
2
Following the construction rules of the cleavage probability training set described in
3
the “Experimental Section”, 3867 positive tryptic sites and 2370 negative tryptic sites
4
were obtained. The trained cleavage probability predictor had a 10-fold cross-validation
5
AUC of 0.9708 (Figure 2), and the average AUC for the three test data sets was 0.9684
6
(Figure S3). These results demonstrated the accuracy of the cleavage probability
7
predictor.
8 9 10
Figure 2. The 10-fold cross-validation ROC curve demonstrating the performance of cleavage probability predictor on the Yeast data set.
11 12
Performance of peptide detectability predictor
13
In the second half of the Yeast data set used for training the peptide detectability
14
predictor, there were 12850 identified peptides with SC > 1, which were taken as
15
positive peptides, and the same number of negative peptides were randomly selected
16
from the unidentified digested peptides with SC > 1. For each peptide, 588 features
12
ACS Paragon Plus Environment
Page 13 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
1
were calculated, including the peptide digestibility. Then, we selected 29 features in
2
AP3 algorithm using the mRMR method (Supplementary Note 1).
3
The peptide detectability prediction model trained with the 29 selected features
4
obtained a 10-fold cross-validation AUC of 0.9269 on the training set from the Yeast
5
data set (Figure 3A). The AUCs for the three test data sets (E. coli, Mouse and Human)
6
were 0.9405, 0.9313, and 0.9213, respectively (Figure 3B). These results illustrated
7
that our model retained good predictive power on independent data sets of other
8
organisms. To demonstrate the generalization ability of AP3, we also tested the trained
9
model on the three additive published data sets33–35 with different LC columns and
10
elution conditions (Table S3). The result demonstrated that LC columns and elution
11
condition do not affect the good performance of AP3 (Figure S4). During the analysis,
12
we trained the peptide detectability model 100 times independently and in each time
13
used the down-sampling technique30 to randomly select the same number of negative
14
peptides as positive peptides. The results showed that our model has a good stability
15
(Figure S5).
16
Notably, our proposed feature, peptide digestibility, showed an excellent ability in
17
predicting the peptide detectability. First, peptide digestibility was the top one among
18
the selected features and this conclusion had also been validated on the three test data
19
sets (Table S2). Second, we compared the accuracy and generalization ability of the
20
peptide detectability predictors with/without the peptide digestibility included in the
21
selected features on the four data sets. Note that the model without the peptide
22
digestibility feature was trained using all the proteins after filtration, rather than half of
23
these proteins. By including peptide digestibility, the 10-fold cross-validation AUC
24
increased from 0.8891 to 0.9269 on the training data set (Figure 3A), and the AUCs
25
increased from 0.9126 to 0.9405, from 0.9063 to 0.9313 and from 0.8854 to 0.9213 on 13
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 14 of 25
1
the three test data sets (E. coli, Mouse, Human), respectively (Figure 3B). This
2
conclusion based on the selected features was also confirmed by comparing the
3
accuracy and generalization ability of the peptide detectability predictors with/without
4
the peptide digestibility feature included in the whole set of features (without feature
5
selection) on the four data sets (Figure S6). At last, we investigated the relationship
6
between the peptide digestibility and the number of missed cleavage sites, and found
7
that they were indeed correlated with each other but the former was much more
8
powerful than the latter in predicting the peptide detectability (Supplementary Note 2,
9
Figure S7). By using the peptide digestibility alone, the random forest classifier
10
exhibited a 10-fold cross-validation AUC of as high as 0.8266 (Figure S7B). All these
11
results demonstrated that peptide digestibility is the most important feature for
12
predicting proteotypic peptides.
13
14
ACS Paragon Plus Environment
Page 15 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
1 2 3 4 5 6 7
Figure 3. Performance comparisons of the AP3 algorithm with/without the peptide digestibility feature included in the peptide detectability model. (A) The 10-fold crossvalidation ROC curves on the training data set (Yeast). (B) The AUCs on the three test data sets.
Comparison with existing tools
8
To evaluate the performance of our algorithm, we compared AP3 with existing
9
available tools, including PeptideSieve8, CONSeQuence11, ESP Predictor6, and PPA16.
10
For all these tools, we used the available trained models provided by the authors of their
11
publications. Note that PeptideSieve, CONSeQuence and ESP Predictor were all 15
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 16 of 25
1
trained on Yeast data sets, while PPA was trained on a human data set. PeptideSieve
2
took protein sequences as the input in the FASTA format. The maximum number of
3
missed cleavage sites was set to 2, and the maximum peptide mass was set to 6000 Da.
4
We tested all types of PeptideSieve (PAGE_ESI, PAGE_MALDI, MUDPIT_ESI,
5
MUDPIT_ICAT).
6
(http://king.smith.man.ac.uk/CONSeQuence/) with the number of missed cleavage
7
sites set to 2 and the prediction type set to ANN only. The ESP Predictor was also run
8
online (https://genepattern.broadinstitute.org/) using the default parameters. The Perl
9
script
PPA.pl
CONSeQuence
was
was
downloaded
from
run
the
online
website
10
http://software.steenlab.org/rc4/PPA.php. The ROCs of different tools on the three test
11
data sets are shown in Figure 4. The results showed that AP3 outperformed other tools
12
in terms of true positive rate at arbitrary given false positive rate. On all the three test
13
data sets, AP3 exhibited superior performance to other four tools (PeptideSieve,
14
ConSequence, ESP Predictor and PPA): increasing the AUC by 10.3%, 16.3%, 34.7%,
15
and 25.4% on average on the three test data sets, respectively. Further, among these
16
tools, only PPA provided retraining solution. Thus, we also retrained PPA on our Yeast
17
data set and compared its performance with AP3 on the three test data sets. The result
18
still supported that AP3 is significantly more accurate than PPA (Figure S8).
19
16
ACS Paragon Plus Environment
Page 17 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
1 2 3 4 5 6 7
Figure 4. Performance comparisons between AP3 and other tools on the three test data sets (A) E. coli, (B) Mouse and (C) Human, respectively. Abbreviations: PeptideSieve1: PeptideSieve_ICAT_ESI; PeptideSieve2: PeptideSieve_MUDPIT_ESI; PeptideSieve3: PeptideSieve_PAGE_ESI; and PeptideSieve4: PeptideSieve_PAGE_MALDI. 17
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1
Page 18 of 25
MRM assay validation
2
One direct application of a peptide detectability predictor is selecting proteotypic
3
peptides for targeted proteomics assays. We applied AP3 and other four available tools
4
to an MRM data set published by Fusaro et al.6. Briefly, this data set consists of 14
5
proteins, each of which has several experimentally validated proteotypic peptides. We
6
first predicted the peptide detectabilities of all in silico digested peptides of these
7
proteins, and then measured the performances of different algorithms in terms of the
8
protein sensitivity, which is defined as the percentage of the proteins which have at
9
least one validated proteotypic peptides included in the top five peptides with the
10
highest predicted detectabilities for each protein6. The protein sensitivity of AP3 was
11
100% (14/14), while the protein sensitivities of PeptideSieve, CONSeQuence, ESP
12
Predictor, PPA, were all 93% (13/14) (Table S4). The results indicated that AP3 was
13
capable of accurately selecting proteotypic peptides for MRM-MS assays.
14
Software availability
15
The AP3 software and the user guide document are freely available at
16
http://fugroup.amss.ac.cn/software/AP3/AP3.html. For ease of use, we provide a
17
graphical user interface for AP3.
18
Conclusions
19
In this study, we presented an algorithm, named AP3, for predicting peptide
20
detectability. For the first time, we incorporated peptide digestibility into the peptide
21
detectability prediction model as a novel and powerful feature. We demonstrated that
22
peptide digestibility can greatly increase the accuracy of the peptide detectability
23
prediction. AP3 enables the selection of candidate proteotypic peptides for the proteins
24
of interest in the absence of high-quality MS-based experimental evidence, especially 18
ACS Paragon Plus Environment
Page 19 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
1
for the proteins identified by methods other than proteomics, such as genomic
2
experiments or literature mining. This study may also have a significant effect on
3
improving protein quantification, designing targeted proteomics assays, and
4
discovering biological biomarkers for early diagnosis and therapy.
5
Supporting Information
6
Supplementary Notes: 1. Feature selection. 2. Relationship between peptide
7
digestibility and the number of missed cleavage sites. The incremental feature selection
8
curve was plotted by 10-fold cross-validation, as the top 50 features selected by mRMR
9
were added successively to the random forest classifier (Figure S1); Validation of
10
feature selection (Figure S2); ROC curves demonstrating the performance of cleavage
11
probability model applied to the three test data sets (Figure S3); The performance of
12
AP3 on three published data sets with different LC columns and elution conditions
13
(Figure S4); Demonstration of the stability of peptide detectability model (Figure S5);
14
Performance comparisons of the AP3 algorithm with/without the peptide digestibility
15
feature included in all the 588 features (Figure S6); Comparative analysis of peptide
16
digestibility and the number of missed cleavage sites (Figure S7); Performance
17
comparison between AP3 and PPA on the three independent published data sets (A) E.
18
coli, (B) Mouse, and (C) Human, respectively (Figure S8); The 588 features used to
19
characterize the peptides in AP3 (Table S1); The lists of selected features in four data
20
sets by mRMR (Table S2); Summary of the three additive published data sets with
21
different LC columns and elution conditions (Table S3); Validation result on MRM
22
assay data set (Table S4).
23
19
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1
Page 20 of 25
Acknowledgments
2
This work was supported by the Strategic Priority Research Program of CAS
3
(XDB13040600), the National Basic Research Program of China (2017YFA0505002
4
and 2015CB554406), the National Natural Science Foundation of China (21605159),
5
the Innovation Program (16CXZ027), the International S&T Cooperation Program of
6
China (2014DFB30010), and the NCMIS CAS.
7
Author contributions
8
Y.F, C.C. and Y.Z. designed the algorithms and experiments. Z.G., C.C. and Y.J.
9
implemented the algorithms and performed the data analysis. Z.G. and C.C. wrote the
10
initial manuscript. All authors edited and approved the final manuscript.
11
Author information
12
Corresponding Authors:
13
Yan Fu, Email:
[email protected] 14
Yunping Zhu, Email:
[email protected] 15 16
ORCID
17
Zhiqiang Gao: 0000-0001-8644-1541
18
Cheng Chang: 0000-0002-0361-2438
19
Yan Fu: 0000-0001-6896-5931
20
Competing financial interests
21
The authors declare no competing financial interest.
20
ACS Paragon Plus Environment
Page 21 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
1
References
2
(1) E., P. C.; H., B. C. Mass Spectrometry Based Biomarker Discovery, Verification,
3
and Validation — Quality Assurance and Control of Protein Biomarker Assays. Mol.
4
Oncol. 2014, 8 (4), 840–858.
5
(2) Rifai, N.; Gillette, M. A.; Carr, S. A. Protein Biomarker Discovery and Validation:
6
The Long and Uncertain Path to Clinical Utility. Nat. Biotechnol. 2006, 24 (8), 971–
7
983.
8
(3) Demeure, K.; Duriez, E.; Domon, B.; Niclou, S. P. PeptideManager: A Peptide
9
Selection Tool for Targeted Proteomic Studies Involving Mixed Samples from
10
Different Species. Front. Genet. 2014, 5, 305.
11
(4) Le Bihan, T.; Robinson, M. D.; Stewart, I. I.; Figeys, D. Definition and
12
Characterization of a “Trypsinosome” from Specific Peptide Characteristics by Nano-
13
HPLC−MS/MS and in Silico Analysis of Complex Protein Mixtures. J. Proteome Res.
14
2004, 3 (6), 1138–1148.
15
(5) Ethier, M.; Figeys, D. Strategy to Design Improved Proteomic Experiments Based
16
on Statistical Analyses of the Chemical Properties of Identified Peptides. J. Proteome
17
Res. 2005, 4 (6), 2201–2206.
18
(6) Fusaro, V. A.; Mani, D. R.; Mesirov, J. P.; Carr, S. A. Prediction of High-
19
Responding Peptides for Targeted Protein Assays by Mass Spectrometry. Nat Biotech
20
2009, 27 (2), 190–198.
21
(7) Webb-Robertson, B. J.; Cannon, W. R.; Oehmen, C. S.; Shah, A. R.; Gurumoorthi,
22
V.; Lipton, M. S.; Waters, K. M. A Support Vector Machine Model for the Prediction
23
of Proteotypic Peptides for Accurate Mass and Time Proteomics. Bioinformatics 2008,
24
24 (13), 1503–1509.
21
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 22 of 25
1
(8) Mallick, P.; Schirle, M.; Chen, S. S.; Flory, M. R.; Lee, H.; Martin, D.; Ranish, J.;
2
Raught, B.; Schmitt, R.; Werner, T.; et al. Computational Prediction of Proteotypic
3
Peptides for Quantitative Proteomics. Nat. Biotechnol. 2007, 25 (1), 125–131.
4
(9) Sanders, W. S.; Bridges, S. M.; McCarthy, F. M.; Nanduri, B.; Burgess, S. C.
5
Prediction of Peptides Observable by Mass Spectrometry Applied at the Experimental
6
Set Level. BMC Bioinformatics 2007, 8 Suppl 7, S23.
7
(10) Wedge, D. C.; Kell, D. B.; Gaskell, S. J.; Lau, K. W.; Hubbard, S. J.; Eyers, C.
8
Peptide Detectability Following ESI Mass Spectrometry: Prediction Using Genetic
9
Programming. Gecco 2007 Genet. Evol. Comput. Conf. Vol 1 2 2007, 2219–2225.
10
(11) Eyers, C. E.; Lawless, C.; Wedge, D. C.; Lau, K. W.; Gaskell, S. J.; Hubbard, S.
11
J. CONSeQuence: Prediction of Reference Peptides for Absolute Quantitative
12
Proteomics Using Consensus Machine Learning Approaches. Mol. Cell. Proteomics
13
2011, 10 (11), M110.003384-M110.003384.
14
(12) Qeli, E.; Omasits, U.; Goetze, S.; Stekhoven, D. J.; Frey, J. E.; Basler, K.;
15
Wollscheid, B.; Brunner, E.; Ahrens, C. H. Improved Prediction of Peptide
16
Detectability for Targeted Proteomics Using a Rank-Based Algorithm and Organism-
17
Specific Data. J. Proteomics 2014, 108, 269–283.
18
(13) Tang, H.; Arnold, R. J.; Alves, P.; Xun, Z.; Clemmer, D. E.; Novotny, M. V;
19
Reilly, J. P.; Radivojac, P. A Computational Approach toward Label-Free Protein
20
Quantification Using Predicted Peptide Detectability. Bioinformatics 2006, 22 (14),
21
e481–e488.
22
(14) Jarnuczak, A. F.; Lee, D. C. H.; Lawless, C.; Holman, S. W.; Eyers, C. E.;
23
Hubbard, S. J. Analysis of Intrinsic Peptide Detectability via Integrated Label-Free and
24
SRM-Based Absolute Quantitative Proteomics. J. Proteome Res. 2016, 15 (9), 2945–
25
2959. 22
ACS Paragon Plus Environment
Page 23 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
1
(15) Kawashima, S.; Ogata, H.; Kanehisa, M. AAindex: Amino Acid Index Database.
2
Nucleic Acids Res. 1999, 27 (1), 368–369.
3
(16) Muntel, J.; Boswell, S. A.; Tang, S.; Ahmed, S.; Wapinski, I.; Foley, G.; Steen,
4
H.; Springer, M. Abundance-Based Classifier for the Prediction of Mass Spectrometric
5
Peptide Detectability upon Enrichment (PPA). Mol Cell Proteomics 2015, 14 (2), 430–
6
440.
7
(17) Siepen, J. A.; Keevil, E.-J.; Knight, D.; Hubbard, S. J. Prediction of Missed
8
Cleavage Sites in Tryptic Peptides Aids Protein Identification in Proteomics. J.
9
Proteome Res. 2007, 6 (1), 399–408.
10
(18) Lawless, C.; Hubbard, S. J. Prediction of Missed Proteolytic Cleavages for the
11
Selection of Surrogate Peptides for Quantitative Proteomics. OMICS 2012, 16 (9), 449–
12
456.
13
(19) Fannes, T.; Vandermarliere, E.; Schietgat, L.; Degroeve, S.; Martens, L.; Ramon,
14
J. Predicting Tryptic Cleavage from Proteomics Data Using Decision Tree Ensembles.
15
J. Proteome Res. 2013, 12 (5), 2253–2259.
16
(20) Hebert, A. S.; Richards, A. L.; Bailey, D. J.; Ulbrich, A.; Coughlin, E. E.;
17
Westphall, M. S.; Coon, J. J. The One Hour Yeast Proteome. Mol. Cell. Proteomics
18
2014, 13 (1), 339–347.
19
(21) Schmidt, A.; Kochanowski, K.; Vedelaar, S.; Ahrné, E.; Volkmer, B.; Callipo, L.;
20
Knoops, K.; Bauer, M.; Aebersold, R.; Heinemann, M. The Quantitative and Condition-
21
Dependent Escherichia Coli Proteome. Nat. Biotechnol. 2016, 34 (1), 104–110.
22
(22) Malmström, E.; Kilsgård, O.; Hauri, S.; Smeds, E.; Herwald, H.; Malmström, L.;
23
Malmström, J. Large-Scale Inference of Protein Tissue Origin in Gram-Positive Sepsis
24
Plasma Using Quantitative Targeted Proteomics. Nat. Commun. 2016, 7, 10261.
23
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 24 of 25
1
(23) Wilhelm, M.; Schlegl, J.; Hahne, H.; Moghaddas Gholami, A.; Lieberenz, M.;
2
Savitski, M. M.; Ziegler, E.; Butzmann, L.; Gessulat, S.; Marx, H.; et al. Mass-
3
Spectrometry-Based Draft of the Human Proteome. Nature 2014, 509 (7502), 582–587.
4
(24) Cox, J.; Neuhauser, N.; Michalski, A.; Scheltema, R. A.; Olsen, J. V; Mann, M.
5
Andromeda: A Peptide Search Engine Integrated into the MaxQuant Environment. J.
6
Proteome Res. 2011, 10 (4), 1794–1805.
7
(25) Cox, J.; Mann, M. MaxQuant Enables High Peptide Identification Rates,
8
Individualized
9
Quantification. Nat. Biotechnol. 2008, 26 (12), 1367–1372.
p.p.b.-Range
Mass
Accuracies
and
Proteome-Wide
Protein
10
(26) Hubbard, S. J. The Structural Aspects of Limited Proteolysis of Native Proteins.
11
Biochim. Biophys. Acta - Protein Struct. Mol. Enzymol. 1998, 1382 (2), 191–206.
12
(27) Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32.
13
(28) Braisted, J. C.; Kuntumalla, S.; Vogel, C.; Marcotte, E. M.; Rodrigues, A. R.;
14
Wang, R.; Huang, S.-T.; Ferlanti, E. S.; Saeed, A. I.; Fleischmann, R. D.; et al. The
15
APEX Quantitative Proteomics Tool: Generating Protein Quantitation Estimates from
16
LC-MS/MS Proteomics Results. BMC Bioinformatics 2008, 9, 529.
17
(29) Vucetic, S.; Brown, C. J.; Dunker, A. K.; Obradovic, Z. Flavors of Protein
18
Disorder. Proteins Struct. Funct. Genet. 2003, 52 (4), 573–584.
19
(30) Chen, C.; Liaw, A.; Breiman, L. Using Random Forest to Learn Imbalanced Data.
20
Univ. California, Berkeley 2004, 110, 1–12.
21
(31) Jain, A.; Nandakumar, K.; Ross, A. Score Normalization in Multimodal Biometric
22
Systems. Pattern Recognit. 2005, 38 (12), 2270–2285.
23
(32) Ding, C.; Peng, H. Minimum Redundancy Feature Selection from Microarray
24
Gene Expression Data. J. Bioinform. Comput. Biol. 2005, 3 (2), 185–205.
24
ACS Paragon Plus Environment
Page 25 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
1
(33) Ding, C.; Jiang, J.; Wei, J.; Liu, W.; Zhang, W.; Liu, M.; Fu, T.; Lu, T.; Song, L.;
2
Ying, W.; et al. A Fast Workflow for Identification and Quantification of Proteomes.
3
Mol. Cell. proteomics 2013, 12 (8), 2370–2380.
4
(34) Ge, S.; Xia, X.; Ding, C.; Zhen, B.; Zhou, Q.; Feng, J.; Yuan, J.; Chen, R.; Li, Y.;
5
Ge, Z.; et al. A Proteomic Landscape of Diffuse-Type Gastric Cancer. Nat. Commun.
6
2018, 9 (1), 1–16.
7
(35) Tyanova, S.; Albrechtsen, R.; Kronqvist, P.; Cox, J.; Mann, M.; Geiger, T.
8
Proteomic Maps of Breast Cancer Subtypes. Nat. Commun. 2016, 7, 10259.
9 10
For TOC Only
11
25
ACS Paragon Plus Environment