Machine-learning-based prediction of cell-penetrating peptides and

27 mins ago - Cell-penetrating peptides (CPPs) can enter cells as a variety of biologically active conjugates and have various biomedical applications...
2 downloads 0 Views 2MB Size
Subscriber access provided by UNSW Library

Article

Machine-learning-based prediction of cell-penetrating peptides and their uptake efficiency with improved accuracy Balachandran Manavalan, Sathiyamoorthy Subramaniyam, Tae Hwan Shin, Myeong Ok Kim, and Gwang Lee J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.8b00148 • Publication Date (Web): 12 Jun 2018 Downloaded from http://pubs.acs.org on June 12, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Machine-learning-based prediction of cell-penetrating peptides and their uptake efficiency with improved accuracy Balachandran Manavalan1, Sathiyamoorthy Subramaniyam2, Tae Hwan Shin1,3, Myeong Ok Kim4, Gwang Lee1, 3* 1

Department of Physiology, Ajou University School of Medicine, Suwon, Republic of Korea Research and Development Center, Insilicogen Inc., Yongin-si, Suwon, Republic of Korea 3 Institute of Molecular Science and Technology, Ajou University, Suwon, Republic of Korea 4 Division of Life Science and Applied Life Science (BK21 Plus), College of Natural Sciences, Gyeongsang National University, Jinju, Republic of Korea 2

* To whom correspondence should be addressed: Department of Physiology, Ajou University School of Medicine, 164, World cup-ro, Yeongtong-gu, Suwon 16499, Republic of Korea. Tel: +82-31-219-4554 Fax: +82-31-219-5049 E-mail: [email protected]

1

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Abstract Cell-penetrating peptides (CPPs) can enter cells as a variety of biologically active conjugates and have various biomedical applications. To offset the cost and effort of designing novel CPPs in laboratories, computational methods are necessitated to identify candidate CPPs before in vitro experimental studies. In this study, we developed a two-layer prediction framework called machine-learning-based prediction of cell-penetrating peptides (MLCPP). The first-layer predicts whether a given peptide is a CPP or non-CPP, whereas the second-layer predicts the uptake efficiency of the predicted CPPs. To construct a two-layer prediction framework, we employed four different machine-learning methods and five different compositions including amino acid composition (AAC), dipeptide composition, amino acid index, compositiontransition-distribution, and physicochemical properties (PCP). In the first-layer, hybrid features (combination of AAC and PCP) and extremely randomized tree outperformed state-of-the-art predictors in CPP prediction with an accuracy of 0.896 when tested on independent datasets, whereas in the second-layer, hybrid features obtained through feature selection protocol and random forest produced an accuracy of 0.725 that is better than state-of-the-art predictors. We anticipate that our method MLCPP will become a valuable tool for predicting CPPs and their uptake efficiency and might facilitate hypothesis-driven experimental design. The MLCPP server interface along with the benchmarking and independent datasets are freely accessible at: www.thegleelab.org/MLCPP

Keywords: cell-penetrating peptides, feature selection, machine learning, extremely randomized tree, random forest, uptake efficiency

2

ACS Paragon Plus Environment

Page 2 of 36

Page 3 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Introduction Cell-penetrating peptides (CPPs) are typically five to 30 amino acids in length and can pass through cell membranes via energy dependent and independent mechanisms without specific receptor interaction1. CPPs can transport to the inside of cells while carrying a wide variety of covalently or non-covalently linked cargo, including nanoparticles, peptides, proteins, antisense oligonucleotides, small-interfering RNA, double-stranded DNA, and liposomes2, 3. Interestingly, preclinical evaluations of CPP-derived therapeutics showed promising results in various disease models, which subsequently translated into clinical trials4. Therefore, CPPs represent an effective approach for delivering bioactive molecules into cells for various biomedical applications1, 4-10. In recent years, the number of experimentally determined CPPs has grown steadily according to CPPsite 2.0 (http://crdd.osdd.net/raghava/cppsite/)11, 12. Interestingly, 90% of the peptides in the CPPsite 2.0 database are derived from naturally occurring proteins. With the wide application of next-generation sequencing techniques, generating numerous novel protein sequences can now be performed rapidly and at low cost; however, identifying novel CPPs from these proteins using traditional experimental methods is expensive and often laborious. Therefore, the development of computational methods is essential to promote the rapid identification of potential CPP candidates. To this end, computational methods have been developed to facilitate high-throughput screening of peptides. An early method focused on using z-scales of chemical properties (molecular weight, molecular orbital calculations, and protein nuclear magnetic resonance shift) to encode a set of 87 CPPs and non-CPPs13. Subsequently, other approaches, including identifying quantitative structure-activity relationships14 and machine-learning (ML)-based methods, such as support vector machine (SVM)15-17, random forest (RF) methods18, and neural networks19, were developed. Among these, four are publicly available methods (C2pred, CellPPD, CPPpred and CPPred-RF), with three (C2pred, CellPPD and CPPpred) classifying CPP or non-CPP based on the provided peptide15,

17, 19

, whereas CPPred-RF can predict not only CPPs but also their uptake

efficiencies18. Although these bioinformatics tools yielded encouraging results and stimulated further development in this area, further studies are needed. The number of features used by existing methods is limited, suggesting that other potentially useful features remain to be characterized. Additionally, none of the methods has been validated using a common independent dataset to verify reliability. Moreover, biologically significant features are intrinsically heterogeneous and multidimensional; however, most of the existing methods do not employ systematic feature 3

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

selection to quantify the contributions and importance of the features used to measure model performance, leading to only a partial understanding of sequence-CPP relationships. Given these shortcomings, bioinformatics tools with higher degrees of accuracy need to be developed to facilitate systematic prediction of CPPs and their uptake efficiencies. In our current study, we reported a novel bioinformatics algorithm called machinelearning-based prediction of cell-penetrating peptides (MLCPP) that is capable of not only classifying peptides into CPP or non-CPP classes but can also predict the uptake efficiency of the predicted CPP using information calculated from its amino acid sequence including amino acid composition (AAC), amino acid index (AAI), dipeptide composition (DPC), physicochemical properties (PCP), and composition-transition-distribution (CTD). Due to the problem specific nature of the machine learning (ML) algorithms, we explored four different algorithms including SVM, RF, extremely randomized tree (ERT), and k-nearest neighbour (kNN) in two-layer predictions. Interestingly, ERT-based model showed a consistent performance on both benchmarking and independent datasets in CPPs prediction (first-layer), however, RF-based model performed better in uptake efficiency prediction (second-layer). To the best of our knowledge, this study represents the first application of ERT method in CPP prediction, which is potentially useful for assisting CPP research.

4

ACS Paragon Plus Environment

Page 4 of 36

Page 5 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Methods The overall framework of MLCPP is shown in Figure 1, which consists of four stages in MLCPP development. The first stage involved construction of non-redundant (nr) benchmarking and independent datasets based on the existing datasets and our own data collection. In the second stage, different features were extracted from peptide primary sequences, including AAC, AAI, CTD, DPC, and PCP. In the third stage, various feature sets were generated and inputted to four different ML-based classifiers to develop respective prediction models. All these models were compared in terms of Matthews correlation coefficient (MCC) and the model with highest MCC was selected as the best one. It is to be noted that stages 1 to 3 were carried out independently for CPP and uptake efficiency prediction. Finally, we constructed the two-layer prediction framework using the best model obtained from step 3.

CPP benchmarking dataset We utilized the CPPsite115 and C2Pred datasets17, which contain 1416 peptides (708 CPPs and 708 non-CPPs) and 822 peptides (411 CPPs and 411 non-CPPs), respectively. We combined these two datasets into a single dataset and excluded peptides containing non-natural amino acid residues. To generate a nr dataset, we excluded redundant peptides using the CD-HIT program20 by applying a sequence-identity cut-off of 0.8, thus indicating that sequence identity between any two sequences greater than 80% is discarded. Using a more stringent criterion, such as 30 or 40% as imposed in previous studies21, 22 could improve the reliability of the model. However, we do not use such criterion because the number of samples would be insufficient for showing statistical significance. Finally, we obtained a nr dataset consisting of 427 CPPs and 1038 non-CPPs. However, we considered equal number of CPPs (427) and non-CPPs (randomly selected 427 of 1038 non-CPPs) as the final dataset for prediction-model development.

CPP independent dataset To generate independent dataset, we extracted experimentally validated CPPs (positive examples) from CPPsite 2.0 (http://crdd.osdd.net/raghava/cppsite/)11,

12

. Since only few

experimentally determined non-CPPs are available, we supplemented with random peptides generated from Swiss-prot (http://web.expasy.org/docs/swiss-prot_guideline.html). While generating the random peptides, peptides similar to CPPs were removed and the remaining ones were considered as non-CPPs. This approach for creating a negative-control dataset has 5

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 36

been used in previous studies15, 17, 19. Subsequently, we applied CD-HIT (using a sequenceidentity cut-off of 0.8) on the above dataset and excluded redundant sequences. Finally, we obtained 622 peptides (311 CPPs and 311 non-CPPs). The peptides in the independent dataset were unique [i.e., they were present neither in our CPP benchmarking dataset nor in the prediction models used by other three methods (C2pred, CellPPD, and CPPpred)].

Uptake efficiency benchmarking dataset For uptake-efficiency prediction, we employed the CPPsite3 dataset proposed by Gautam et al.15 containing 187 high- and low-uptake-efficiency CPPs (374 peptides in total). We utilized this dataset to build a prediction model for uptake efficiency. All these datasets can be downloaded from our MLCPP server interface.

Feature extraction We formulated CPPs and uptake efficiency prediction tasks as a binary classification problem (CPPs or non-CPPs; low or high) and solved it using ERT, k-NN, RF, and SVM. One of the most important aspects of this process involves the extraction of relevant features. Here, we used AAC, AAI, DPC, CTD, and PCP, whose definitions are briefly discussed below:

(i) AAC AAC is the percentage of an individual amino acid in the given sequence, which could be computed using the following equation: AAC(i)=

Frequency of amino acid (i) Length of the peptide

(1)

where i can be any natural amino acid. AAC has a fixed length of 20 features. (ii) AAI AAI database consists of numerical indices representing various physicochemical and biochemical properties of amino acid23. Recently, Saha et al.24 applied Fuzzy c-means clustering and classified these amino acid indices into eight clusters, where the central indices of each cluster were named as high-quality amino acid indices (BLAM930101, BIOV880101, MAXF760101, TSAJ990101, NAKH920108, CEDJ970104, LIFS790101, and MIYS990104). We averaged these eight high-quality amino acid indices (20-dimensional vector) and utilized it as an input feature.

6

ACS Paragon Plus Environment

Page 7 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(iii) DPC One of the limitations of AAC is that it does not consider local-order information of amino acids. DPC encapsulates both global and local information for a given sequence25 and represents the total number of dipeptides normalized against all possible combinations of dipeptides in a given peptide sequence. DPC has a fixed length of 400 (20 × 20) features, which can be computed using the following equation:

DPC (j)=

Total number of dipeptides (j) Total number of all possible dipeptides

(2)

where DPC (j) is one of 400 possible dipeptides.

(iv) PCP The frequencies of each property are directly computed from the sequence that includes: (i) hydrophobic (F, I, W, L, V, M, Y, C, A); (ii) hydrophilic (R, K, N, D, E, P); (iii) neutral (T, H, G, S, Q); (iv) positively charged (K, H, R); (v) negatively charged (D, E); (vi) sequence length (n); (vii) turn-forming residues fraction [(N+G+P+S)/n]; (viii) absolute charge per residue (⌈

R+K-D-E n

-0.03⌉); (ix) molecular weight; and (x) aliphatic index [(A+2.9V+3.9I+3.9L)/n].

(v) CTD CTD feature was introduced to predict protein-folding classes26, which has been applied in various sequence-based classification algorithms22, 27, 28. CTD represents the distribution of amino acid patterns along with the primary peptide sequence based on their physicochemical or structural properties, which includes hydrophobicity, polarizability, normalized van der Waals volume, secondary structure, polarity, charge and solvent accessibility. Each peptide primary sequence can be divided into three groups: polar, neutral and hydrophobic. C is the frequency of amino acids of a particular property (hydrophobic, neutral and polar) normalized by peptide length. T describes the percentage frequency with which amino acids of a specific property (polar) is followed by amino acids of a different property (i.e. a polar followed by a neutral or a neutral followed by a polar; a polar followed by a hydrophobic or a hydrophobic followed by a polar; a neutral followed by a hydrophobic or a 7

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

hydrophobic followed by a neutral). D consists of five values for each of three groups. It measures the chain length, within which the first 25, 50, 75, or 100% of the amino acids with a specific property are contained. There are three descriptors and 21 descriptor values (3(C) + 3(T) + 5×3(D) = 21) for single amino acid attribute. Consequently, seven different amino acid attributes produce a sum of 147 features (7×21 = 147).

Machine learning methods We employed four different ML methods, including RF, ERT, SVM, and k-NN to develop the respective prediction models using the benchmarking dataset. It is worthy to mention that all these ML methods have been widely used in bioinformatics22, 28-43, which are described below.

(i) RF RF is an ensemble technique that utilizes hundreds or thousands of independent decision trees to perform classification and regression44. The three most influential parameters of the RF algorithm are number of trees (ntree), the number of variables randomly chosen at each node split (mtry), and the minimum number of samples required to split an internal node (nsplit). These parameters were optimized using a grid search within the following ranges: ntree from 50 to 1000, with a step size of 20; mtry from one to seven, with a step size of one; and nsplit from two to 10, with a step size of one.

(ii) ERT ERT belongs to another class of ensemble methods widely used for developing classification and regression models45. The major difference between ERT and RF is that ERT uses all the training samples to grow trees, whereas RF uses only the bootstrap sample. Additionally, the ERT-splitting criterion is random, whereas that for RF is based on information gained from measuring Gini impurity. The parameter-optimization procedure in ERT is the same as that in the RF method. (iii) SVM SVM is used to develop both classification and regression models and is based on statisticallearning theory46. SVMs focus on the boundary between classes and map the input space created by independent variables using a nonlinear transformation according to a kernel function. We experimented with commonly used kernel types, including a linear kernel, a Gaussian radial-basis function (RBF), and a polynomial kernel. Of these, the RBF kernel was 8

ACS Paragon Plus Environment

Page 8 of 36

Page 9 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

the most suitable for our problem. An RBF-SVM requires the optimization of two critical parameters: C (controls the trade-off between training error and margin) and γ (controls how peaked Gaussians are cantered on the support vectors). These two parameters were optimized using a grid search within the following ranges: C from 2−15 to 210 and γ from 2−10 to 210 in log2 steps.

(iv) k-NN k-NN is among the simplest and most popular data-mining algorithms used for developing classification and regression models. Here, we used Euclidean distance for the distance function, with k requiring optimization, which was performed using grid search in the range of one to 300.

Feature selection protocol Recently, a systematic feature selection protocol was applied and a novel protein model quality assessment method called SVMQA was developed47, which was the best method in CASP12 blind prediction48,

49

. This protocol was applied in our recent studies including DNase I

hypersensitivity predictions31, anti-inflammatory peptide predictions28 and phage virion protein predictions22. Surprisingly, this procedure improved the performance of our method. Therefore, we applied this approach to the current problem. Firstly, we applied RF algorithm and estimated the feature importance scores (FISs) independently for the first-layer (CPPs benchmarking dataset) and the second-layer (uptake-efficiency benchmarking dataset). The linear combination of five compositions (597 features) were inputted to RF and 10-fold crossvalidation (CV) was performed. For each round of CV, we built 1,000 trees and the number of variables at each node was randomly chosen from 1 to 50. The average FIS from all the trees is shown in Supplementary Figure S-1. For first-layer, we excluded features that have FIS less than 0.0003 and generated 32 features set using the remaining features based on FIS cut-off (0.0003≤ FIS ≤0.0035, with a step size of 0.0001). whereas for second-layer, we excluded features that have FIS less than 0.0005 and generated 30 features set using the remaining features based on FIS cut-off (0.0005≤ FIS ≤0.0035, with a step size of 0.0001). Each feature set was provided as an input to four different ML algorithms (RF, ERT, SVM and k-NN) and performed 10-fold CV. During 10-fold CV, the corresponding ML parameters (SVM parameters: C and ; RF & ERT parameters: ntree, mtry, and nsplit; and kNN parameter: k) were optimized using a grid-search approach. Here, 10-fold CV procedure

9

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 36

was repeated for five times, with random partitioning of the benchmarking dataset that resulted five similar/different ML parameters and performances (see Evaluation metrics). However, we considered the average performances and median ML parameter as the final values.

Evaluation metrics To compare prediction methods, we used threshold- and rank-based performance measures. For threshold-based parameters, we used sensitivity, specificity, accuracy, and the MCC using the following equations:

TP TP+FN TN Specificity = TN+FP TP+TN Accuracy = TP+FP+TN+FN TP×TN-FP×FN Sensitivity =

{

MCC =

(3)

√(TP+FP)(TP+FN)(TN+FP)(TN+FN)

where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives. For rank-based comparison, we used receiver-operating characteristic (ROC) curves that plot sensitivity as a function of (1 − specificity) for different decision thresholds. To quantitatively compare two ROC curves, we computed the AUC. A statistical significance difference between two ROC areas (P-value) was assessed using a two-tailed t-test50.

10

ACS Paragon Plus Environment

Page 11 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Results Compositional analysis To understand the residue preference, we performed compositional analyses, including evaluation of AAC and DPC, between CPPs and non-CPPs using a benchmarking dataset. AAC analysis revealed that the average composition of certain residues, including Lys, Arg, His, Leu, and Trp, were dominant in CPPs, whereas Gly, Asp, Glu, Ile, Val, and Tyr were dominant in non-CPPs (Welch’s t-test; P ≤ 0.01) (Figure 2A). Interestingly, the composition of Arg was 3-fold higher in CPPs than in non-CPPs, suggesting that it might play a significant role. Consistent with this observation, previous experimental studies supported our observation that Arg plays an important role in determining CPP-transduction properties, as the guanidine head group of Arg forms a hydrogen bond with negatively charged groups present in cell membranes, leading to cellular internalization of CPPs at physiological pH51. Furthermore, DPC analysis revealed that 125 of 400 possible dipeptides differed significantly between CPPs and non-CPPs (Welch’s t-test; P ≤ 0.01) (Supplementary Table S-1). Of these, the top 10 most abundant dipeptides in CPPs and non-CPPs were RR, KK, RK, KR, LR, KL, LK, LL, RL, and WR and LE, GA, EE, GL, LV, SE, EG, LG, VV, and VA, respectively (Figure 2B). Composition-based analysis showed that in CPPs, the most preferred residues included positively charged and hydrophobic amino acids. Similarly, the dipeptides mostly included pairs of positively charged–positively charged, positively charged–hydrophobic, and hydrophobic–hydrophobic amino acids in different local orders. This result indicated that information on amino acid and dipeptide preference would be useful for differentiating CPPs and non-CPPs. We also performed compositional analysis of low- and high-uptake-efficiency CPPs using the uptake efficiency benchmarking dataset. AAC analysis revealed that three residues (Cys, Met, and Trp) were dominant in high-uptake efficiency CPPs, whereas Ser was dominant in low-uptake efficiency CPPs (Welch’s t-test; P ≤ 0.01) (Figure 2C). Furthermore, DPC analysis revealed that only 3% of the dipeptides differed significantly between CPPs and nonCPPs (Welch’s t-test; P ≤ 0.01) (Figure 2D). Although we observed a significant difference in certain types of amino acids and dipeptides in CPP compositional analysis, few differences were observed in this analysis. This result indicated that amino acid and dipeptide preference information would be less effective at discriminating high- from low-uptake-efficiency CPPs.

11

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Development of first-layer prediction models and the selection of an optimal model To understand the contribution of each property in the prediction of CPPs or their uptake efficiencies, we developed an individual composition-based prediction models and maximized the advantage of ML methods by combining different compositions (hybrid features) to generate corresponding models. Hybrid features can be divided into two groups: 1) a combination of different composition-based features and 2) a set of important features obtained through feature selection protocol (see methods). Initially, we generated 53 prediction models for each ML-based method using different features set, with detailed performance results shown in Figure 3A and 4. Interestingly, RF and ERT methods performed consistently better than other two methods (SVM and k-NN) regardless of input features, indicating decision tree-based algorithm is better suited for CPP prediction. In case of RF and ERT, three models namely H7, H12 and H16 based on the hybrid features AAC+PCP, AAI+PCP, CTD+PCP respectively produced similar performances, which is better than rest of the models. Therefore, we selected H7 based model as the final one for RF and ERT methods. It is necessary to check whether applying a feature selection protocol on H7 (30 features) improved the prediction performance significantly. To this end, we generated additional 15 models with varying number of features (subset of H7 that range from 8 to 29) and compared their performance with H7 model. Figure 3A shows that additional models performed similar to the H7 model, hence, the feature selection protocol did not improve the prediction performance of RF and ERT as we expected. However, this protocol improved the prediction performance of SVM (F15). Finally, we compared the performances of the above selected model from each ML methods; the results are shown in Table 1, where the methods are ranked according to the MCC associated with predictive capability. RF and ERT produced a similar performance with an accuracy and MCC of 88.4% and 0.769, whose metrics are respectively higher than those of other methods by 2–8% and 4–11%. According to 0.05 significant threshold of P-value, there is no significant differences between these methods in terms of AUC. To check the transferability of these methods, we evaluated on independent dataset and compared it with the state-of-the-art methods. Performance of various methods on the independent dataset We evaluated the performance of four ML-based methods and state-of-the-art methods, including CC2Pred, CPPpred, CPPred-RF, and CellPPD, on the independent dataset. Table 2 shows that ERT-based method achieved the highest MCC, accuracy and AUC values of 0.793, 0.896 and 0.959, respectively. As a matter of fact, the corresponding metrics were 4–80%, 2– 12

ACS Paragon Plus Environment

Page 12 of 36

Page 13 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

41% and 1–23% higher than those achieved by other methods. It is interesting to note that although ERT and RF showed similar performance on benchmarking dataset, however, ERT performance is better than RF on the independent dataset. Using a P-value cut-off of 0.01, ERT significantly outperformed two methods (SVM and k-NN,) developed in this study and the state-of-the-art methods including CC2Pred, CPPpred, CellPPD and CPPred-RF, thus showing its improved performance to existing methods for predicting CPPs. Among the existing methods, except CellPPD, other remaining methods produced a reasonable performance. Overfitting is considered a major problem in ML, with previous studies reporting that peptide- and protein-sequence-based ML methods are often associated with high risks of overfitting52-54. However, ERT method performed consistently well, both in benchmarking and independent datasets (Figure 5), suggesting that ERT did not suffer from over-fitting and has the ability to do well in unseen peptides when compared to other methods employed in this study. Hence, we considered ERT-based model for the firstlayer prediction of MLCPP.

Development of second-layer models and the selection of an optimal model Similar to the CPP-prediction model, firstly, we generated 51 uptake-efficiency prediction models for each ML-based method using different features set, with the performance of these models shown in Figure 4B and 6. Similar to first-layer prediction, RF and ERT methods performed consistently better than other two ML-based methods regardless of input features. In case of RF and ERT, they produced similar performance using hybrid features (H7: AAC+PCP), hence we applied a feature selection protocol on H7 and generated additional 14 models to verify whether it improves the prediction performance. Figure 4B shows feature selection protocol improves the prediction performance of all four methods: RF (F17), ERT (F29), SVM (F19) and k-NN (F13) when compared to the control and also other models. The final selected model for each method is shown in Table 3. Using a P-value cut-off of 0.01, RF performed slightly better than ERT and SVM methods and significantly outperformed k-NN method. Hence, we selected RF-based model for the second-layer prediction of MLCPP.

Performance of various methods on the uptake efficiency benchmarking dataset Because no independent datasets were available, we compared our proposed RF-based method with the state-of-the-art methods (CellPPD, SkipCPP-Pred, Diener’s method, and CPPred-RF), where each method developed the prediction model using the same dataset. Among these methods, only CPPred-RF is publicly available. As shown in Table 4, our RF method showed 13

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

better performance than the other methods in terms of MCC (by 2.2–17.5%) and accuracy (by 1.4–9.4%). This result suggested that our predictor was better than the existing predictors at classifying CPP-uptake efficiency. Although our method uses the same RF algorithm as that used by the recently published CPPred-RF method, the input features were different. It is important to mention that 17 features used by RF method were ~9 fold lower than that used by CPPred-RF, thus indicating that the improved performance of our method relative to CPPredRF could be attributed to the more efficient use of important features.

Comparison of MLCPP with state-of-the-art methods methodology We compared our method with state-of-the-art methods in terms of algorithm characteristics (Table 5). CellPPD and C2Pred used the same ML method (SVM), whereas CPPpred and CPPred-RF use a neural network and RF, respectively. However, MLCPP uses ERT and RF to predict CPPs and uptake efficiencies, respectively, making this the first application of ERTbased method in CPP prediction. We generated a training dataset exhibiting low levels of redundancy for use, compared with CellPPD and C2Pred; moreover, our training dataset was the third lowest than those utilized for CellPPD and CPPred-RF validation. The MLCPPparameter-optimization procedure was also unique, involving five independent 10-fold CVs to finalize the ML parameters. It is necessary to highlight that the feature dimension used in ERT is lower than those of the previous methods which can reduce the computation complexity.

Web server implementation Several examples of bioinformatics tools utilized for protein function predictions have been reported55-57. Hence, we developed an online prediction server called MLCPP based on the method proposed in this work, which is freely available at: www.thegleelab.org/MLCPP. Users can paste or upload query protein sequences in FASTA format. After submitting the input protein sequences, the results can be retrieved in a separate interface. All the curated datasets used in this study can be downloaded from the web server.

14

ACS Paragon Plus Environment

Page 14 of 36

Page 15 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Discussion CPPs represent an effective approach to intracellular delivery of bioactive molecules for various biomedical applications1, 4-6. However, the identification and development of novel CPPs using experimental methods can be tedious and expensive. Technically, it is necessary to search the full protein sequence in overlapping-window patterns associated with areas of peptide chains, with each segment tested for potential cell-penetrating activity. This step is expensive and laborious; therefore, the development of sequence-based computational methods for determining CPP candidates can allow rapid screening of CPP candidates prior to synthesis, thereby accelerating CPP-based research. We have made systematic attempt to understand the nature of CPPs prior to the prediction model development. Compositional analysis revealed that charged residues (Lys, Arg and His) is highly abundant in CPPs, compared to non-CPPs. Previous study reported that the majority of functional CPPs harbour stretches of Arg residues (5–9 residues)58. Therefore, we calculated stretches or the total number of Arg residues (cut-off: ≥5 Arg residues) for each peptide in our dataset (738 CPPs with lengths ranging from 5–50 residues). Our results showed that only 40.5% of CPPs contained ≥5 Arg residues, 21.3% of CPPs did not contain any Arg residues, and 38.2% of CPPs contained one to four Arg residues, indicating that factors other than Arg distribution play important roles in cell penetration. In this study, we proposed a two-layer prediction framework called MLCPP for predicting CPPs and their uptake efficiency. Such a two-layer prediction framework is quite common in the field of bioinformatics59, 60. We explored four different ML algorithms (ERT, RF, SVM and k-NN) and five different compositions such as AAC, DPC, AAI, CTD, and PCP for discriminating CPPs or non-CPPs and low or high uptake efficiency. It is worthwhile to mention that all these features and ML algorithms have been previously applied in various sequence-based classification22, 28-43, 61, 62. Based on the CV performances, ERT- and RF- were selected to construct the first- and second-layer prediction and named it as MLCPP. Importantly, nr independent dataset constructed in this study has the lowest sequence identity among the existing datasets, which makes it one of the most stringent and a standard dataset available in the literature. Utilizing this dataset, we evaluated our method developed in this study with the state-of-the-art predictors. Indeed, this is the first instance where all CPP predictors were evaluated using a standard independent dataset. The outcome of this evaluation will be beneficial to the experimentalists to select the best performing method or consensus predictions and also to method developers to address the short comings of the existing methods. Evaluation results showed that MLCPP significantly outperformed the existing CPP-prediction 15

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

methods according to the P-value cut-off of 0.01. The superior performance of MLCPP may be attributed to the following factors: (i) a choice of ERT algorithm and hybrid features (H7) that collectively make a significant contribution to the first-layer performance. (ii) applying a feature selection protocol on hybrid features (H7) to select the optimal feature set in conjugation with a choice of RF algorithm, which further improve the second-layer performance. (iii) number of features used in the first-layer (ERT) and the second-layer prediction are lower than the state-of-the-art methods that may reduce the computational complexity of learning and prediction algorithms. However, this method cannot reveal the mechanism associated with CPPs crossing the lipid bilayer of endosomes. Despite the availability of experimental methods capable of allowing investigation of this activity63, 64, the mechanism by which CPPs enter cells remains to be elucidated. Therefore, further studies, such as structure-based membrane-peptide molecular dynamics simulations, are necessary. Interestingly, the top two methods for predicting CPPs and their uptake efficiencies are RF and ERT, indicating that decision tree-based algorithm was better suited for sequence-based classification of CPPs and uptake efficiency (low or high). Recently, Gabere and Noble65 evaluated publicly available antimicrobial-peptide-prediction methods using a standard independent dataset and showed that the RF-based method outperformed other ML-based methods. Furthermore, our recent study evaluating anticancer-peptide-prediction methods using a standard dataset also showed that an RF-based method was superior to other ML-based methods54. Overall, these studies indicated that the decision tree-based algorithm is better suited for sequence-based peptide classification. In future work, our major focus will be to explore more powerful ML algorithms such as deep learning, incorporating novel features to the optimal feature reported in this study, and exploring different feature selection techniques such as ANOVA39, F-score36 and binomial distribution40. The predictor proposed in this study is quite promising in CPP prediction and available in the form of a web server interface at www.thegleelab.org/MLCPP. As a matter of fact, this is the second method developed to identify CPPs and their uptake efficiency simultaneously, with higher accuracy than the existing method. Compared to experimental approaches, bioinformatics tools, such as MLCPP, represent a powerful and cost-effective approach for proteome-wide prediction of CPPs. Therefore, MLCPP might be useful for large-scale CPP prediction and facilitating hypothesis-driven experimental design.

16

ACS Paragon Plus Environment

Page 16 of 36

Page 17 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Acknowledgements The authors would like to thank Da Yeon Lee for assistance in manuscript preparation. This work was supported by the Basic Science Research Program through the National Research Foundation (NRF) of Korea funded by the Ministry of Education, Science, and Technology [2015R1D1A1A09060192 and 2009-0093826] and the Brain Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning [2016M3C7A1904392]. Conflict of interest None declared. Supporting Information: The following supporting information is available free of charge at ACS website http://pubs.acs.org Figure S-1 - Input features along with their importance score Table S1 - Dipeptide composition (DPC) analysis of CPPs and non-CPPs

17

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

References (1) Raucher, D.; Ryu, J. S., Cell-penetrating peptides: strategies for anticancer treatment. Trends Mol Med 2015, 21, 560-70. (2) Bechara, C.; Sagan, S., Cell-penetrating peptides: 20 years later, where do we stand? FEBS Lett 2013, 587, 1693-702. (3) Brasseur, R.; Divita, G., Happy birthday cell penetrating peptides: already 20 years. Biochim Biophys Acta 2010, 1798, 2177-81. (4) Guidotti, G.; Brambilla, L.; Rossi, D., Cell-Penetrating Peptides: From Basic Research to Clinics. Trends Pharmacol Sci 2017, 38, 406-424. (5) Fominaya, J.; Bravo, J.; Rebollo, A., Strategies to stabilize cell penetrating peptides for in vivo applications. Ther Deliv 2015, 6, 1171-94. (6) Li, H.; Tsui, T. Y.; Ma, W., Intracellular Delivery of Molecular Cargo Using CellPenetrating Peptides and the Combination Strategies. Int J Mol Sci 2015, 16, 19518-36. Dash-Wagh, S.; Jacob, S.; Lindberg, S.; Fridberger, A.; Langel, U.; Ulfendahl, M., (7) Intracellular Delivery of Short Interfering RNA in Rat Organ of Corti Using a Cell-penetrating Peptide PepFect6. Mol Ther Nucleic Acids 2012, 1, e61. (8) Parnaste, L.; Arukuusk, P.; Langel, K.; Tenson, T.; Langel, U., The Formation of Nanoparticles between Small Interfering RNA and Amphipathic Cell-Penetrating Peptides. Mol Ther Nucleic Acids 2017, 7, 1-10. Tuttolomondo, M.; Casella, C.; Hansen, P. L.; Polo, E.; Herda, L. M.; Dawson, K. A.; (9) Ditzel, H. J.; Mollenhauer, J., Human DMBT1-Derived Cell-Penetrating Peptides for Intracellular siRNA Delivery. Mol Ther Nucleic Acids 2017, 8, 264-276. (10) Ndeboko, B.; Narayan, R.; Lemamy, G. J.; Jamard, C.; Nielsen, P. E.; Cova, L., Role of cell-penetrating peptides in intracellular delivery of peptide nucleic acids targeting hepadnaviral replication. Molecular Therapy-Nucleic Acids 2017, 9, 162-169. (11) Agrawal, P.; Bhalla, S.; Usmani, S. S.; Singh, S.; Chaudhary, K.; Raghava, G. P.; Gautam, A., CPPsite 2.0: a repository of experimentally validated cell-penetrating peptides. Nucleic Acids Res 2016, 44, D1098-103. (12) Gautam, A.; Singh, H.; Tyagi, A.; Chaudhary, K.; Kumar, R.; Kapoor, P.; Raghava, G. P., CPPsite: a curated database of cell penetrating peptides. Database (Oxford) 2012, 2012, bas015, https://doi.org/10.1093/database/bas015. (13) Sandberg, M.; Eriksson, L.; Jonsson, J.; Sjostrom, M.; Wold, S., New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. J Med Chem 1998, 41, 2481-91. (14) Dobchev, D. A.; Mager, I.; Tulp, I.; Karelson, G.; Tamm, T.; Tamm, K.; Janes, J.; Langel, U.; Karelson, M., Prediction of Cell-Penetrating Peptides Using Artificial Neural Networks. Curr Comput Aided Drug Des 2010, 6, 79-89. (15) Gautam, A.; Chaudhary, K.; Kumar, R.; Sharma, A.; Kapoor, P.; Tyagi, A.; Open source drug discovery, c.; Raghava, G. P., In silico approaches for designing highly effective cell penetrating peptides. J Transl Med 2013, 11, 74. (16) Sanders, W. S.; Johnston, C. I.; Bridges, S. M.; Burgess, S. C.; Willeford, K. O., Prediction of cell penetrating peptides by support vector machines. PLoS Comput Biol 2011, 7, e1002101. (17) Tang, H.; Su, Z. D.; Wei, H. H.; Chen, W.; Lin, H., Prediction of cell-penetrating peptides with feature selection techniques. Biochem Biophys Res Commun 2016, 477, 150-4. (18) Wei, L.; Xing, P.; Su, R.; Shi, G.; Ma, Z. S.; Zou, Q., CPPred-RF: A Sequence-based Predictor for Identifying Cell-Penetrating Peptides and Their Uptake Efficiency. J Proteome Res 2017, 16, 2044-2053.

18

ACS Paragon Plus Environment

Page 18 of 36

Page 19 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(19) Holton, T. A.; Pollastri, G.; Shields, D. C.; Mooney, C., CPPpred: prediction of cell penetrating peptides. Bioinformatics 2013, 29, 3094-6. (20) Li, W.; Godzik, A., Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22, 1658-9. (21) Ding, H.; Feng, P. M.; Chen, W.; Lin, H., Identification of bacteriophage virion proteins by the ANOVA feature selection and analysis. Mol Biosyst 2014, 10, 2229-35. (22) Manavalan, B.; Shin, T. H.; Lee, G., PVP-SVM: Sequence-Based Prediction of Phage Virion Proteins Using a Support Vector Machine. Front Microbiol 2018, 9, 476. (23) Kawashima, S.; Pokarowski, P.; Pokarowska, M.; Kolinski, A.; Katayama, T.; Kanehisa, M., AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 2008, 36, D202-5. (24) Saha, I.; Maulik, U.; Bandyopadhyay, S.; Plewczynski, D., Fuzzy clustering of physicochemical and biochemical properties of amino acids. Amino Acids 2012, 43, 583-94. (25) Ding, Y.; Cai, Y.; Zhang, G.; Xu, W., The influence of dipeptide composition on protein thermostability. FEBS Lett 2004, 569, 284-8. (26) Dubchak, I.; Muchnik, I.; Holbrook, S. R.; Kim, S. H., Prediction of protein folding class using global description of amino acid sequence. Proc Natl Acad Sci U S A 1995, 92, 8700-4. (27) Cai, C. Z.; Han, L. Y.; Ji, Z. L.; Chen, X.; Chen, Y. Z., SVM-Prot: Web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res 2003, 31, 3692-7. (28) Manavalan, B.; Shin, T. H.; Kim, M. O.; Lee, G., AIPpred: Sequence-Based Prediction of Anti-inflammatory Peptides Using Random Forest. Front Pharmacol 2018, 9, 276. (29) Manavalan, B.; Basith, S.; Shin, T. H.; Choi, S.; Kim, M. O.; Lee, G., MLACP: machine-learning-based prediction of anticancer peptides. Oncotarget 2017, 8, 77121-77136. (30) Manavalan, B.; Lee, J.; Lee, J., Random forest-based protein model quality assessment (RFMQA) using structural features and potential energy terms. PLoS One 2014, 9, e106542. (31) Manavalan, B.; Shin, T. H.; Lee, G., DHSpred: support-vector-machine-based human DNase I hypersensitive sites prediction using the optimal features selected by random forest. Oncotarget 2018, 9, 1944-1956. (32) Yao, B.; Zhang, L.; Liang, S.; Zhang, C., SVMTriP: a method to predict antigenic epitopes using support vector machine to integrate tri-peptide similarity and propensity. PLoS One 2012, 7, e45152. (33) Maier, O.; Schroder, C.; Forkert, N. D.; Martinetz, T.; Handels, H., Classifiers for Ischemic Stroke Lesion Segmentation: A Comparison Study. PLoS One 2015, 10, e0145118. (34) Cao, R.; Wang, Z.; Wang, Y.; Cheng, J., SMOQ: a tool for predicting the absolute residue-specific quality of a single protein model with support vector machines. BMC Bioinformatics 2014, 15, 120. (35) Feng, P.; Yang, H.; Ding, H.; Lin, H.; Chen, W.; Chou, K. C., iDNA6mA-PseKNC: Identifying DNA N(6)-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. Genomics 2018. (36) Lin, H.; Liang, Z. Y.; Tang, H.; Chen, W., Identifying sigma70 promoters with novel pseudo nucleotide composition. IEEE/ACM Trans Comput Biol Bioinform 2017. (37) Chen, W.; Yang, H.; Feng, P.; Ding, H.; Lin, H., iDNA4mC: identifying DNA N4methylcytosine sites based on nucleotide chemical properties. Bioinformatics 2017, 33, 35183523. (38) Lin, H.; Deng, E. Z.; Ding, H.; Chen, W.; Chou, K. C., iPro54-PseKNC: a sequencebased predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res 2014, 42, 12961-72.

19

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(39) Zhao, Y. W.; Su, Z. D.; Yang, W.; Lin, H.; Chen, W.; Tang, H., IonchanPred 2.0: A Tool to Predict Ion Channels and Their Types. Int J Mol Sci 2017, 18. (40) Lai, H. Y.; Chen, X. X.; Chen, W.; Tang, H.; Lin, H., Sequence-based predictive modeling to identify cancerlectins. Oncotarget 2017, 8, 28169-28175. (41) Yang, H.; Tang, H.; Chen, X. X.; Zhang, C. J.; Zhu, P. P.; Ding, H.; Chen, W.; Lin, H., Identification of Secretory Proteins in Mycobacterium tuberculosis Using Pseudo Amino Acid Composition. Biomed Res Int 2016, 2016, 5413903. (42) Chen, X. X.; Tang, H.; Li, W. C.; Wu, H.; Chen, W.; Ding, H.; Lin, H., Identification of Bacterial Cell Wall Lyases via Pseudo Amino Acid Composition. Biomed Res Int 2016, 2016, 1654623. (43) Chen, W.; Feng, P.; Yang, H.; Ding, H.; Lin, H.; Chou, K.-C., iRNA-3typeA: Identifying Three Types of Modification at RNA’s Adenosine Sites. Molecular TherapyNucleic Acids 2018, 11, 468-474. (44) Breiman, L., Random forests. Machine learning 2001, 45, 5-32. (45) Geurts, P.; Ernst, D.; Wehenkel, L., Extremely randomized trees. Machine learning 2006, 63, 3-42. (46) Vapnik, V. N., An overview of statistical learning theory. IEEE Trans Neural Netw 1999, 10, 988-99. (47) Manavalan, B.; Lee, J., SVMQA: Support-vector-machine-based protein single-model quality assessment. Bioinformatics 2017, 16, 2496-2503. (48) Elofsson, A.; Joo, K.; Keasar, C.; Lee, J.; Maghrabi, A. H. A.; Manavalan, B.; McGuffin, L. J.; Menendez Hurtado, D.; Mirabello, C.; Pilstal, R.; Sidi, T.; Uziela, K.; Wallner, B., Methods for estimation of model accuracy in CASP12. Proteins 2018, 86 Suppl 1, 361-373. (49) Kryshtafovych, A.; Monastyrskyy, B.; Fidelis, K.; Schwede, T.; Tramontano, A., Assessment of model accuracy estimations in CASP12. Proteins 2018, 86 Suppl 1, 345-360. (50) Hanley, J. A.; McNeil, B. J., The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982, 143, 29-36. (51) Rothbard, J. B.; Jessop, T. C.; Lewis, R. S.; Murray, B. A.; Wender, P. A., Role of membrane potential and hydrogen bonding in the mechanism of translocation of guanidiniumrich peptides into cells. J Am Chem Soc 2004, 126, 9506-7. (52) Cheng, F.; Li, W.; Liu, G.; Tang, Y., In silico ADMET prediction: recent advances, current challenges and future trends. Curr Top Med Chem 2013, 13, 1273-89. (53) Chen, W.; Ding, H.; Feng, P.; Lin, H.; Chou, K. C., iACP: a sequence-based tool for identifying anticancer peptides. Oncotarget 2016, 7, 16895-16909. (54) Manavalan, B., Basith, S., Tae Hwan Shin, Sun Choi, Myeong Ok Kim, Gwang Lee, MLACP: Machine-learning-based Prediction of Anticancer Peptides. Oncotarget 2017, 8, 16. (55) Basith, S.; Manavalan, B.; Gosu, V.; Choi, S., Evolutionary, structural and functional interplay of the IkappaB family members. PLoS One 2013, 8, e54178. (56) Basith, S.; Manavalan, B.; Govindaraj, R. G.; Choi, S., In silico approach to inhibition of signaling pathways of Toll-like receptors 2 and 4 by ST2L. PLoS One 2011, 6, e23989. (57) Govindaraj, R. G.; Manavalan, B.; Lee, G.; Choi, S., Molecular modeling-based evaluation of hTLR10 and identification of potential ligands in Toll-like receptor signaling. PLoS One 2010, 5, e12713. (58) Futaki, S.; Suzuki, T.; Ohashi, W.; Yagami, T.; Tanaka, S.; Ueda, K.; Sugiura, Y., Arginine-rich peptides. An abundant source of membrane-permeable peptides having potential as carriers for intracellular protein delivery. J Biol Chem 2001, 276, 5836-40. (59) Liu, B.; Fang, L.; Long, R.; Lan, X.; Chou, K. C., iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics 2016, 32, 362-9.

20

ACS Paragon Plus Environment

Page 20 of 36

Page 21 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(60) Liu, B.; Yang, F.; Chou, K. C., 2L-piRNA: A Two-Layer Ensemble Classifier for Identifying Piwi-Interacting RNAs and Their Function. Mol Ther Nucleic Acids 2017, 7, 267277. (61) Cao, R.; Cheng, J., Protein single-model quality assessment by feature-based probability density functions. Sci Rep 2016, 6, 23990. (62) Cao, R.; Freitas, C.; Chan, L.; Sun, M.; Jiang, H.; Chen, Z., ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network. Molecules 2017, 22. (63) Szeto, H. H.; Schiller, P. W.; Zhao, K.; Luo, G., Fluorescent dyes alter intracellular targeting and function of cell-penetrating tetrapeptides. FASEB J 2005, 19, 118-20. (64) Wang, F.; Wang, Y.; Zhang, X.; Zhang, W.; Guo, S.; Jin, F., Recent progress of cellpenetrating peptides as new carriers for intracellular cargo delivery. J Control Release 2014, 174, 126-36. (65) Gabere, M. N.; Noble, W. S., Empirical comparison of web-based antimicrobial peptide prediction tools. Bioinformatics 2017, 13, 1921-1929.

21

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 36

Table 1. Performance of first-layer prediction models on the benchmarking dataset Method

MCC

Accuracy

Sensitivity

Specificity

AUC

RF

0.769

0.884

0.905

0.864

0.934

P-value (AUC) —

ERT

0.768

0.883

0.919

0.845

0.938

0.746

SVM

0.724

0.861

0.895

0.828

0.920

0.289

k-NN

0.658

0.816

0.956

0.675

0.918

0.228

The first column represents the method name developed in this study. The second, the third, the fourth and the fifth columns respectively represent the MCC, accuracy, sensitivity and specificity. The sixth and the seventh columns respectively represent the AUC and pairwise comparison of ROC area under curves (AUCs) between RF and the other methods using a twotailed t-test. RF: random forest; ERT: extra tree classifier; SVM: support vector machine; kNN: k-nearest neighbour;

22

ACS Paragon Plus Environment

Page 23 of 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 2. Performance of various methods on the independent dataset Method

MCC

Accuracy

Sensitivity

Specificity

AUC

ERT

0.793

0.896

0.933

0.858

0.959

P-value (AUC) —

RF

0.758

0.878

0.916

0.839

0.950

0.461

SVM

0.715

0.857

0.884

0.830

0.927

0.020

CPPred-RF

0.672

0.825

0.955

0.695

0.822