Computing Prediction and Functional Analysis of ... - ACS Publications

Subscriber access provided by QUEENS UNIV BELFAST

Article

Computing Prediction and Functional Analysis of Prokaryotic Propionylation Lina Wang, Shao-Ping Shi, Ping-Ping Wen, Zhi-You Zhou, and Jian-Ding Qiu J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00482 • Publication Date (Web): 23 Oct 2017 Downloaded from http://pubs.acs.org on October 24, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Computing

Prediction

and

Functional

Analysis

of

Prokaryotic Propionylation Li-Na Wang,†,§ Shao-Ping Shi,† Ping-Ping Wen,† Zhi-You Zhou,† Jian-Ding Qiu†,‡,* †

College of Chemistry and Institute for Advanced Study, Nanchang University, Nanchang

330031, China. ‡

Department of Materials and Chemical Engineering, Pingxiang University, Pingxiang 337055,

China §

Department of Sciences, Nanchang Institute of Technology, Nanchang 330099, China

ABSTRACT: Identification and systematic analysis of candidates for protein propionylation are crucial steps for understanding its molecular mechanisms and biological functions. Although several proteome-scale methods have been performed to delineate potential propionylated proteins, the majority of lysine propionylated substrates and its role in pathological physiology still remain largely unknown. By gathering various databases and literatures, an experimental prokaryotic propionylation data was collated to be trained in support vector machine with various features via three steps feature selection method. A novel online tool for seeking potential lysine propionylated sites (PropSeek) (http://bioinfo.ncu.edu.cn/PropSeek.aspx) was build. Independent test results of leave-one-out and n-fold cross-validation were similar to each other showed that PropSeek was a stable and robust predictor with a satisfying performance. Meanwhile, analysis of Gene Ontology, Kyoto Encyclopedia of Genes and Genomes analyses together with protein and protein interaction implied a potential role of prokaryotic propionylation in protein synthesis and metabolism.

ACS Paragon Plus Environment

1


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 29

1. INTRODUCTION Propionylation at protein lysine residue is a novel type of acylation modification, which could be found from prokaryotic to eukaryotic cell in histone as well as non-histone proteins. Following the first identification in histones in 2007,1 lots of modification sites have been identified in human and bacteria proteins and pointed to distinct functional consequence of propionylation proteins. Due to the similarity to acetyl group, propionyl group could plausibly be conjugated by acetyltransferases that can accommodate larger acyl-CoA molecules.2 Indeed, in vitro assays have demonstrated that some acetyltransferases such as PCAF, p300 and CBP could catalyze propionylation, while Sirt1 and Sirt2 could remove it.1,3-5 Particularly, Sun et al. showed that bacterial lysine deacetylase CobB and acetyltransferase PatZ could also have regulatory activities for lysine propionylation in E. coli.6 Bioinformatics analyses demonstrated that lysine propionylation played an important role in metabolic pathways.7,8 Systematic analysis and identification of novel candidates for modification are the most reliable manners to study post-translational modification (PTM) of proteins.9 Although lots of propionylation sites have been identified in the past ten years, the majority of lysine propionylation substrates and its role in pathological physiology still remain largely unknown. There are lots of traditional experimental approaches for large-scale identification of propionylated proteins including high-throughput mass spectrometry (MS),1,3,6-10 MS in combination with protein sequence alignment using PTMap and hybrid quadrupole TOF-MS combined with nano-scale liquid chromatography.5,7 Computational identification of PTMs has emerged to be an alternative approach in contrast to experimental approach with its convenience. Lots of computational approaches have been established to predict potential modification sites in query protein based on different machine learning algorithms. For example, Blom et al.


2

Page 3 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


incorporated sequence and structure information to predict potential phosphorylation sites based on artificial neural networks.11 Radivojac et al. designed a random forest-based predictor of ubiquitination sites with various physicochemical prosperities, intrinsic disorder and evolutionary information.12 The group-based phosphorylation scoring (GPS) algorithm which was first designed for the prediction of phosphorylation has been used for various types of PTMs.13-17 Meanwhile, Nanni et al. proposed an ensemble of descriptors/classifiers for sequencebased protein classification which could be considered a baseline system for a given problem.18 There were also several tools which were carried out based on support vector machine (SVM) classifiers method for virulent protein 19 and putative sites of phosphorylation,20 methylation,21,22 acetylation,23-25 ubiquitination,26 succinylation and malonylation.27,28 Although the availability of various PTMs’ site prediction tools, to the best of our knowledge, there is still no online tool to predict propionylation sites. In this work, we first proposed the prediction of prokaryotic propionylation site by integrating 1953 propionylation sites in 856 proteins by combining evolutionary information, sequencedderived information, and predicted structural information based on SVM method. And then, functional analyses such as Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway, together with protein and protein interaction (PPI) were investigated (Figure 1). Besides, the crosstalk between propionylation, malonylation and acetylation were analyzed, and the prediction results for five types of prokaryotic proteins in Uniprot database were supplied for further experimental validation and investigation.29

2. METHOD 2.1 Data Collection and Preprocessing


3


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 29

The experimentally determined propionylation sites were gathered by searching ‘propionylation’, ‘lysine propionylation’ and ‘propiony’ in the PubMed and integrated with publically available datasets.1,3,6-8 1997 propionylation sites, distributed across 860 proteins (Table S1) were collated and disposed to obtain a non-redundant dataset by CD-Hit with a threshold of 30%.30 1782 positive and 13634 negative sequence in 820 prokaryotic proteins with the length of 21 (-10~10, with the central of annotated propionylation lysine or non-propionylation lysine) were obtained to be training by eliminating 4 eukaryotic proteins with 44 propionylation sites (Supporting Information and Figure S1), while 32 proteins with 97 propionylation sites were chosen to be independent test (Table S1).

2.2 Computing Prediction of Propionylated Sites via Three Steps Feature Selection Method 2.2.1 Model Training Based on 10-fold cross-validation, three steps method including feature extraction, optimization and combination was used for selecting the optimal features to build an efficient and fast online tool for predicting prokaryotic propionylation (Figure 1). 2.2.1.1 Feature Extraction Four types of original features including evolutionary information, sequenced-derived information, predicted structural information and feature annotations were investigated in the first step for the construction of the predicting model. •

Evolutionary Information

(i) K nearest neighbor (KNN) rule, which could catch the local sequence similarity, had been successfully used in various classification systems;20,31 (ii) Position-specific scoring matrices


4

Page 5 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


(PSSM), which is generated by performing PSI-BLAST searches against the NCBI noredundant, could evaluate the residues conservation.32 •

Sequence-derived Information

(i) Amino acid compositions (AAC) and binary (Bina) represent the basic information of type and position of the amino acid residues surrounding the modification sites; (ii) Composition of K-spaced amino acid pairs (CKAAP), reflects the composition of k-spaced amino acid pairs with this sequence fragment; (iii) Z-scale encodes each amino acid in the sequence by five physicochemical descriptor variables;33 (iv) AAindex1, one part of the database AAindex,34 contains 566 (release 9.1) amino acid properties collected from the literature; (v) Encoding method based on grouped weight (EBGW), is an encoding scheme of the amino acid sequence considering the hydrophobicity and charged character of amino acid residues.35 •

Predicted Disorder Information

The disorder values for surrounding residues of each possible propionylation site in the query protein are extracted and integrated by means of a widely used disorder prediction tool, VSL2B.36 •

Feature Annotation

Five types of feature annotations for propionylation proteins are derived from UniProt (http://www.uniprot.org/) including binding, metal, region, transmem and turn. More detailed descriptions for all the extracted features are afforded in Supporting Information and Table 1. 2.2.1.2 Feature Optimization In this step, the important and valid columns are selected for every original feature based on information gain (IG) or F-score to reduce possible bias of using a single method.24,37 Firstly, all


5


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 29

the column of the feature are descended based on IG or F-score, separately; Secondly, the first column (which has the highest value of IG/ F-score) is used as model training, then the following columns will be kept as feature if the average accuracy value has been improved by its’ addition as feature, the SVM will be executed n times to decide the optimizing result if the dimension of the feature is n. In particularly, if the sorting order for IG and F-score are same, this procedure will be carried out one time, otherwise, the best consequence of the selection method based on IG and F-score will be chosen as the final result. 2.2.1.3 Feature Combination In the last step, the optimal single features are ranked according to their’ accuracies, if the addition of the feature can improve the accuracy of the model, the feature will be added to be the final feature, otherwise, it will be wiped out (Supporting Information). 2.2.2 Model Evaluation 10-fold cross-validation was executed ten times for optimizing the parameters in the model training, the leave-one-out (LOO) validation and n-fold cross-validation (n=2, 4, 6, 8) were performed with the optimal features. Five parameters such as accuracy (Acc), sensitivity (Sn), specificity (Sp), Mathews correlation coefficient (MCC) and area under the receiver operating characteristic curves (AUC) were chosen to evaluate the prediction performance (Supporting Information).

2.3 Functional Analysis of Propionylated Protein 2.3.1 Sequence Property Two sample logo and Motif-X were chosen to analysis the sequence preference of positive and negative samples.38,39 The Motif X algorithm was applied to 1843 propionylated peptides (coming from the 820 training propionylated protein by eliminating the same sequence), where


6

Page 7 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


the parameters were for the most part default values, with the sequence width of 21 characters, occurrences of at least 20, and significance value of 0.00001. The 14329 non-propionylated peptides were used as the background. 2.3.2 GO and KEGG Analysis Functional annotation of biological significance was performed with the Database for Annotation, Visualization and Integrated Discovery (DAVID) resource v6.8.40 Most of the propionylated proteins in this study were Escherichia coli (E.coli) (70%, 605 of 860), so the 605 propionylated proteins were further analyzed for GO enrichment and KEGG using DAVID as foreground data set, while Escherichia coli K12 MG1655 was used as background data set. Three functional annotation term categories of biological process (BP), molecular function (MF) and cellular component (CC) were evaluated with the two-tailed Fisher exact test. The protein counts for each term were sorted and top terms were taken with statistical significance was assessed at p < 0.01. 2.3.3 PPI Network Analysis by Cytoscape Propionylation proteins in E.coli from a very complex and highly organized protein-protein interaction network were visualized using a search tool for the retrieval of interacting genes/proteins (STRING v10) via the Cytoscape platform (v3.4.0).41,42 The data was searched against the Escherichia coli K12 MG1655 network data with the high confidence score of 0.7. Molecular complex detection (MCODE), a plugin toolkit of Cytoscape, was further utilized to analyze highly enriched clusters.43

3. RESULT 3.1 Development of PropSeek for Predicting Prokaryotic Propionylation


7


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 29

Comparing with the time-consuming and labor-intensive experimental identification of propionylation substrate, the prediction method of PropSeek from protein primary sequences can offer an alternative solution. Based on the prediction result of 10-cross validation, the fragment length was chosen to be 21(Table S2), and the non-redundant benchmark data was established with 1782 positive sites and 13634 negative ones (Table S1). The under-sampling method was chosen to build the balance dataset. 44 Four types of information were investigated and three steps feature optimizing methodology was selected to conform the optimal model with the dimension decreasing from 1781 to 187 while the accuracy increasing from 75.06 to 79.92 (Table S3). Specifically, based on 10-fold cross-validation, 1781 dimension of 10 types of features were selected by first step selection. And then, every feature of ten types was optimized by IG or F-score according to the accuracy and the dimension was decreased to 386. In the last step, the final model was decided by combining the ten types features one by one based on the accuracy of the prediction model. Among the final 187 dimension features, KNN has the best feature performance followed by isoelectric point. The 10-fold cross-validation was done for ten times with the 187 dimension features and the receiver operating characteristic (ROC) curves were shown in Figure 2a and the average AUC score was 0.8148. The similarity of independent test results for LOO and n-fold cross-validation showed that PropSeek was a stable and robust predictor with a satisfying performance (Figure 2b). Meanwhile, the prediction results for five types of prokaryotic proteins in Uniprot database (Only considering the reviewed proteins) were supplied for further experimental validation and investigation in Table S4.

3.1.1 Evolution Information KNN scores were extracted from its similar sequences in both positive and negative datasets as feature according to the theory that substrate sites of same enzyme usually share similar patterns.


8

Page 9 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


The ratios of positive in seven types of nearest neighbors (2, 4, 8, 16, 32, 64 and 128) were extracted to analyze the difference between positive and negative. Figure 3a shows that the average positive KNN scores are higher than that of negative while the difference is decreasing with the increasing of the number of nearest neighbors (Figure S2). With the biggest difference between positive and negative, KNN scores with 3 dimensions got the best performance by using of IG method (Table 1 and Table S3). PSSM which is constructed by performing PSI-BLAST search against the NCBI nonredundant database can evaluate the residues conservation. The accuracy increases from 59.06% to 61.16% while the dimension decreases from 400 to 17. This PSSM feature is not included in the final model because the addition has no improvement for the accuracy of the model (Table 1 and Table S3).

3.1.2 Sequence-derived Information AAC and CKAAP represent the information of type and position of the amino acid residues between the surrounding of propionylation sites and non-propionylation ones. Figure 3b shows that lysine, arginine and histidine which have positive charge are enriched, while aliphatic amino acid such as methionine, proline, leucine, valine and polar amino acid including glutamine, serine and asparagine are depleted in the positive fragment (Figure S3). This result is certified with the results of two-sample logo (Figure S4). With the feature selection method based on Fscore, the dimension of CKAAP (K=0) is decreased from 441 to 40 with the accuracy increasing from 58.58 to 62.43 (Table 1 and Table S3). Figure 3c demonstrates the most important 40 type’s amino acids pairs which have the higher values and the best accuracy when k equals to 0, the higher frequency of KE and KK is in accordance with the motif results (Table S5, Figure 3d and S5).


9


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 29

Isoelectric point has the best performance among the top 15 optimal physical chemistry properties which are chosen to be features based on the accuracies of 551 types in AAindex1 (Figure 4a, 4b, Table S6). It can be seen that the average isoelectric point scores of propionylation are higher than that of non-propionylation especially at the position of 1~10 and 8~-5 with the exception of -1 and -2. Among the most important five positions, 1, 3, 5, 7, 9 have the higher proportion in 15 physical chemistry properties (Table S3), which is identical with the results of two-sample logo and motif-x ( Figure 3d and S4). Based on the hydrophobicity and charge character of amino acid residues, EBGW feature can display the different surrounding between propionylation and non-propionylation (Figure 4c and S6). The average ratio of EBGW scores between positive and negative is below zero with the exception of H2 which means that there are more positively charged surrounding propionylation sites (Figure 4c).

3.1.3 Predicted Disorder Information Phosphorylation sites have been observed to have a strong tendency to be located in disordered region and disordered scores have been chosen as features in the prediction of PTMs’ sites. 20,45,46

The disorder scores of propionylation protein were investigated by VSL2B and the

fragment with length of 21 were extracted for positive and negative dataset in Figure 4d. In all, the average disorder scores for positive fragment are higher than that of negative, but all the average scores for both propinylation and non-propinylation are below 0.5 which means that the fragment centered by lysine in these propionylation prokaryotic proteins most belongs to order region (Figure S7).

3.2 Landscape of Prokaryotic Propionylation


10

Page 11 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


The distribution of lysine prokaryotic propionylated sites per substrate was investigated in Figure S8. Among all 856 propionylated proteins, 458 proteins (53.5%) were propionylated at just one site while 174 (20.3%), 78 (9.1%) and 48 (5.6%) proteins were propionylated at two, there and four sites, respectively. It could be seen that 20 proteins (2.3%) were propionylated at more than ten sites (Table S7), the most heavily propinoylated proteins were 60 kDa chaperonin in Thermus thermophiles (20 sites) and E.coli (16 sites).The most interesting thing is that of all 17 proteins which were propionylated at more than ten sites in E.coli, 10 proteins are included in the most enriched clusters (Table S7).

3.2.1 Crosstalk with Malonylation and Acetylation Various PTMs fine-tune the functions of almost all proteins, and co-regulation of different types of PTMs has been shown within and between a numbers of proteins.

47,48

Because of its’

electron-rich and nucleophilic properties, lysine has been identified to suffer multiple covalent PTMs, including methylation and various acylation.

49,50

The in-situ crosstalk was analyzed

between propionylation, malonylation and acetylation in E.coli. We compared the 1471 propionylation sites in 605 proteins with the dataset of 1745 malonylation sites in 595 proteins from the study of malonylation and 9188 acetylated sites in 1860 proteins from the database of PLMD. 51,52 In total, 600 (40.8%) and 1265 (86.0%) of 1471 propionylation sites are overlapped with malonylation and acetylation sites, respectively, while only 179 (12.2%) are single propionylated sites (Figure 5a). At the protein level, 278 (46%) and 522 (86.3%) of 605 propionylated proteins also overlap with malonylation and acetylation, separately, while 68 (11.2%) proteins are not known for malonylation or acetylation (Figure 5b). Moreover, 573 (39.0%) sites in 263 (43.5%) propionylation proteins were also both acetylated and malonylated,


11


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 29

which suggested that these three acylation reactions could be strongly in-situ crosstalk with each other and regulated protein function all together.

3.2.2 GO and KEGG Analysis To further investigate putative function of the propionylated protein, DAVID online bioinformatics tools were performed against GO and KEGG database for 605 propionylated proteins in E.coli. The result of GO enrichment analysis revealed that translation (p=1.59×10-25), glycolytic process (p=1.58×10-13) and tricarboxylic acid cycle (p=4.69×10-11) were among the most enriched biological processes (Figure 5c, Table S8). Molecular function annotation of GO database showed that propionylated protein were significantly enriched in structural constituent of ribosome (p=4.54×10-27) and multiple bindings, such as identical protein binding, rRNA binding, protein binding, RNA binding and so on. These top-ranked cellular processes were mostly cytoplasm events, especially in cytosol (p=2.15×10-93) and cytosolic large and small ribosomal subunit (p=6.55×10-16), among which there were large numbers of overlaps with a subset of the malonylated proteome that was also cytoplasm proteins.28 The result of KEGG pathway analysis showed that propionylated substrates were highly enriched in multiple metabolic pathways including carbon metabolism, ribosome, glycolysis, microbial metabolism as well as TCA cycle (Figure S9, Table S9). Carbon metabolism is the most significantly enriched pathway, followed by Biosynthesis of antibiotics and Ribosome which is in concordance with the results of GO and PPI (Figure 5c and 6). Moreover, 20 propionylated substrates, including isocitrate dehydrogenase (icd) and citrate synthase (gltA), were associated with tricarboxylic acid (TCA) cycle which is the key metabolic pathway for carbohydrate, fat and protein metabolism. Further investigation showed that highly propionylated enzymes such as icd (12 sites), gltA (7 sites) and Succinate dehydrogenase flavoprotein subunit


12

Page 13 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


(sdhA) (9 sites) were belongs to the fourth highly enriched cluster (Table S10). In addition, 8 propionylated sites of icd (K12, K55, K142, K174, K186, K235, K242 and K265) and 5 propionylated sites of gltA (K283, K295, K310, K328 and K422) were also reported to be acetylation and malonylation. 51,52 Together, the results of GO and KEGG imply a potential role of propionyalted protein in protein synthesis and energy metabolism.

3.2.3 PPI Analysis Protein-protein interaction for 605 and 190 propionylated proteins in E.coli and Thermus thermophilus were analyzed by Cytoscape software and showed by MCODE for highly enriched clusters based on the STRING database, separately (Figure 6 and S10). Twenty one PPI clusters were formed in E.coli with 465 proteins as nodes connected by 4897 identified direct physical interactions with the confidence score of more than 0.7 (Table S10). Figure 6 uncovers the complete network and the top four interaction clusters of propionylation protein in E.coli. Unsurprisingly, the most highly enriched cluster consisted of 58 ribosome-associated proteins with 1565 edges (Figure 6b), which were mainly 50S (29 of 58, 50%) and 30S ribosome protein subunit components, RNA polymerases, elongation factor proteins (Table S10). Twelve propionylated sites were identified in elongation factor G (fusA) and RNA polymerases (rpoB), which represented the two most highly propionylated proteins in this top cluster and suggested the potential role of propionylation in protein biosynthesis. Transfer RNA ligase proteins constituted the second highest enriched cluster (Figure 6c). Fourteen and 12 propionylated sites which were found in the phosphate dehydrogenase (gapA) and alcohol dehydrogenase (adhE) demonstrated the potential role of propionylation in glycometabolic process. The next two highly interconnected clusters were involved in protein


13


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 29

binding and ATPase activity (Figure 6d and 6e), suggesting lysine propionylation could participate in the regulation of energy metabolism.

4. SUMMARY In summary, our study provides a critical framework to systematic analysis and prediction of prokaryotic propionylation. Based on the experimental dataset, an online tool called PropSeek for seeking lysine proionylation sites was developed (http://bioinfo.ncu.edu.cn/PropSeek.aspx) via three steps feature selection and optimization method, and the prediction results for five types of prokaryotic proteins in Uniprot database were supplied for further experimental validation and investigation. Meanwhile, sequence analysis result suggested that most of propionylated sites followed five canonical consensus motifs, which was identical with the sequence analysis of amino acid composition and K-space amino acid pairs. High in-situ crosstalk with malonylation and acetylation suggested that propionylation could regulate protein function together with other acylations. Systematic studies of GO, KEGG and PPI showed that propionylated proteins were involved in protein synthesis and metabolism process. In all, our online predicting tool and systematic analysis results can afford meaningful information for lysine propionylation in prokaryotic organisms.

ASSOCIATED CONTENT AUTHOR INFORMATION Corresponding Author * To whom Correspondence should be addressed: [email protected]

Author Contributions The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript.


14

Page 15 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Notes The authors declare no competing financial interest.

ACKNOWLEDGEMENTS This work was supported by the National Natural Science Foundation of China (21175064 and 21665016), and the graduate student innovation special funds of Nanchang university (cx2016011). Supporting Information Supplemental tables, figures, and methods are available free of charge via the Internet at http://pubs.acs.org. Table S1A-E, the propionylation datasets; Table S2, selection of fragment length; Table S3, feature optimization designed by IG/F-score; Table S4A-F, predicted results for five prokaryotic organisms; Table S5, motif information; Table S6, top fifteen optimal physicochemical properties; Table S7, detailed information for proteins which have more than 10 propionylated sites; Table S8, GO-based enrichment analysis; Table S9, KEGG-based enrichment analysis; Table S10A-B, PPI clusters information (.xlsx). Figure S1, distribution for propionylated proteins and sites in different organisms; Figure S2, KNN scores between propionylation sites and non-propionylation; Figure S3, average AAC value for propinylation and non-propionylation fragment; Figure S4, Two-sample Logos; Figure S5, the Motif logo; Figure S6, comparison of different sizes of group for EBGW; Figure S7, the percentage of disorder and order; Figure S8, distribution of propionylated sites per protein; Figure S9, enrichment analysis of KEGG pathways; Figure S10, PPI networks in thermus thermophiles (PDF).

ABBREVIATIONS


15


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 29

AAC, amino acid compositions; Acc, accuracy; adhE, alcohol dehydrogenase; AUC, area under the receiver operating characteristic curves; Bina, binary; BP, biological process; CC, cellular component; CKAAP, composition of K-spaced amino acid pairs; DAVID, the database for annotation, visualization and integrated discovery; EBGW, encoding method based on grouped weight; E.coli, Escherichia coli; fusA, elongation factor G; gapA, phosphate dehydrogenase; GO, gene ontology; GPS, group-based phosphorylation scoring; IG, information gain; KEGG, kyoto encyclopedia of genes and genomes; KNN, K nearest neighbor; LOO, leave-one-out; MCC, Mathews correlation coefficient; MCODE, molecular complex detection; MF, molecular function; MS, mass spectrometry; PPI, protein and protein interaction; PropSeek, seeking potential lysine propionylated sites; PSSM, position-specific scoring matrices; PTM, posttranslational modification; ROC, receive operating characteristic; rpoB, RNA polymerases; Sn, sensitivity; Sp, specificity; STRING, search tool for the retrieval of interacting genes/proteins; SVM, support vector machine

REFERENCES (1)

Chen, Y.; Sprung, R.; Tang, Y.; Ball, H.; Sangras, B.; Kim, S. C.; Falck, J. R.; Peng, J.;

Gu, W.; Zhao, Y. Lysine propionylation and butyrylation are novel post-translational modifications in histones. Mol. Cell. Proteomics 2007, 6, 812-819. (2)

Choudhary, C.; Weinert, B. T.; Nishida, Y.; Verdin, E.; Mann, M. The growing landscape

of lysine acetylation links metabolism and cell signalling. Nat. Rev. Mol. Cell Biol. 2014, 15, 536-550. (3)

Cheng, Z.; Tang, Y.; Chen, Y.; Kim, S.; Liu, H.; Li, S. S. C.; Gu, W.; Zhao, Y. Molecular

Characterization of Propionyllysines in Non-histone Proteins. Mol. Cell. Proteomics 2009, 8, 4552.


16

Page 17 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


(4)

Leemhuis, H.; Packman, L. C.; Nightingale, K. P.; Hollfelder, F. The human histone

acetyltransferase P/CAF is a promiscuous histone propionyltransferase. ChemBioChem 2008, 9, 499-503. (5)

Zhang, K.; Chen, Y.; Zhang, Z.; Zhao, Y. Identification and Verification of Lysine

Propionylation and Butyrylation in Yeast Core Histones Using PTMap Software. J. Proteome Res. 2009, 8, 900-906. (6)

Sun, M.; Xu, J.; Wu, Z.; Zhai, L.; Liu, C.; Cheng, Z.; Xu, G.; Tao, S.; Ye, B. C.; Zhao, Y.;

Tan, M. Characterization of Protein Lysine Propionylation in Escherichia coli: Global Profiling, Dynamic Change, and Enzymatic Regulation. J. Proteome Res. 2016, 15, 4696-4708. (7)

Okanishi, H.; Kim, K.; Masui, R.; Kuramitsu, S. Lysine Propionylation Is a Prevalent

Post-translational Modification in Thermus thermophilus. Mol. Cell. Proteomics 2014, 13, 23822398. (8)

Okanishi, H.; Kim, K.; Masui, R.; Kuramitsu, S. Proteome-wide identification of lysine

propionylation in thermophilic and mesophilic bacteria: Geobacillus kaustophilus, Thermus thermophilus, Escherichia coli, Bacillus subtilis , and Rhodothermus marinus. Extremophiles 2017, 21, 283-296. (9)

Vertegaal, A. C. Uncovering ubiquitin and ubiquitin-like signaling networks. Chem. Rev.

2011, 111, 7923-7940. (10) Liu, B.; Lin, Y.; Darwanto, A.; Song, X.; Xu, G.; Zhang, K. Identification and characterization of propionylation at histone H3 lysine 23 in mammalian cells. J. Biol. Chem. 2009, 284, 32288-32295. (11) Blom, N.; Gammeltoft, S., S Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J. Mol. Biol. 1999, 294, 1351-1362.


17


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 29

(12) Radivojac, P.; Vacic, V.; Haynes, C.; Cocklin, R. R.; Mohan, A.; Heyen, J. W.; Goebl, M. G.; Iakoucheva, L. M. Identification, analysis, and prediction of protein ubiquitination sites. Proteins: Struct., Funct., Bioinf. 2010, 78, 365-380. (13) Xue, Y.; Zhou, F.; Zhu, M.; Ahmed, K.; Chen, G.; Yao, X. GPS: a comprehensive www server for phosphorylation sites prediction. Nucleic Acids Res. 2005, 33, 184-187. (14) Deng, W.; Wang, C.; Ying, Z.; Yang, X.; Shuang, Z.; Liu, Z.; Yu, X. GPS-PAIL: prediction of lysine acetyltransferase-specific modification sites from protein sequences. Sci. Rep. 2016, 6, 39787. (15) Liu, Z.; Cao, J.; Ma, Q.; Gao, X.; Ren, J.; Xue, Y. GPS-YNO2: computational prediction of tyrosine nitration sites in proteins. Mol. BioSyst. 2011, 7, 1197-1204. (16) Pan, Z.; Liu, Z.; Cheng, H.; Wang, Y.; Gao, T.; Ullah, S.; Ren, J.; Xue, Y. Systematic analysis of the in-situ crosstalk of tyrosine modifications reveals no additional natural selection on multiply modified residues. Sci. Rep. 2013, 4, 7331. (17) Zhao, Q.; Xie, Y.; Zheng, Y.; Jiang, S.; Liu, W.; Mu, W.; Liu, Z.; Zhao, Y.; Xue, Y.; Ren, J. GPS-SUMO: a tool for the prediction of sumoylation sites and SUMO-interaction motifs. Nucleic Acids Res. 2014, 42, W325-W330. (18) Nanni, L.; Lumini, A.; Brahnam, S. An empirical study of different approaches for protein classification. The Scientific World Journal 2014, 62, 236717. (19) Nanni, L.; Lumini, A. An ensemble of support vector machines for predicting virulent proteins. Expert Syst. Appl. 2009, 36, 7458-7462. (20) Gao, J.; Thelen, J. J.; Dunker, A. K.; Xu, D. Musite, a Tool for Global Prediction of General and Kinase-specific Phosphorylation Sites. Mol. Cell. Proteomics 2010, 9, 2586-2600. (21) Shi, S. P.; Qiu, J. D.; Sun, X. Y.; Suo, S. B.; Huang, S. Y.; Liang, R. P. PMeS: Prediction of


18

Page 19 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Methylation Sites Based on Enhanced Feature Encoding Scheme. PloS one 2012, 7, e38772. (22) Wen, P. P.; Shi, S. P.; Xu, H. D.; Wang, L. N.; Qiu, J. D. Accurate in silico prediction of species-specific methylation sites based on information gain feature optimization. Bioinformatics 2016, 32, 3107-3155. (23) Gnad, F.; Ren, S.; Choudhary, C.; Cox, J.; rgen; Mann, M. Predicting post-translational lysine acetylation using support vector machines. Bioinformatics 2010, 26, 1666-1668. (24) Suo, S. B.; Qiu, J. D.; Shi, S. P.; Sun, X. Y.; Huang, S. Y.; Chen, X.; Liang, R. P. Positionspecific analysis and prediction for protein lysine acetylation based on multiple features. PloS one 2012, 7, e49108. (25) Wang, L.; Du, Y.; Lu, M.; Li, T. ASEB: a web server for KAT-specific acetylation site prediction. Nucleic Acids Res. 2012, 40, W376-W379. (26) Chen, X.; Qiu, J. D.; Shi, S. P.; Suo, S. B.; Huang, S. Y.; Liang, R. P. Incorporating key position and amino acid residue features to identify general and species-specific Ubiquitin conjugation sites. Bioinformatics 2013, 29, 1614-1622. (27) Xu, H. D.; Shi, S. P.; Wen, P. P.; Qiu, J. D. SuccFind: a novel succinylation sites online prediction tool via enhanced characteristic strategy. Bioinformatics 2015, 31, 3748-3750. (28) Wang, L. N.; Shi, S. P.; Xu, H. D.; Wen, P. P.; Qiu, J. D. Computational prediction of species-specific malonylation sites via enhanced characteristic strategy. Bioinformatics 2017, 33, 1457-1463. (29) Consortium, U. P. UniProt: a hub for protein information. Nucleic Acids Res. 2015, 43, D204-D212. (30) Huang, Y.; Niu, B.; Gao, Y.; Fu, L.; Li, W. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 2010, 26, 680-682.


19


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 29

(31) Qiu, J. D.; Suo, S. B.; Sun, X. Y.; Shi, S. P.; Liang, R. P. OligoPred: a web-server for predicting homo-oligomeric proteins by incorporating discrete wavelet transform into Chou's pseudo amino acid composition. J. Mol. Graphics Modell. 2011, 30, 129-134. (32) Altschul, S. F.; Madden, T. L.; Schäffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D. J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25, 3389-3402. (33) Sandberg, M.; Eriksson, L.; Jonsson, J.; Sjöström, M.; Wold, S. New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. J. Med. Chem. 1998, 41, 2481-2491. (34) Kawashima, S.; Pokarowski, P.; Pokarowska, M.; Kolinski, A.; Katayama, T.; Kanehisa, M. AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 2008, 36, D202205. (35) Zhang, Z. H.; Wang, Z. H.; Zhang, Z. R.; Wang, Y. X. A novel method for apoptosis protein subcellular localization prediction combining encoding based on grouped weight and support vector machine. FEBS letters 2006, 580, 6169-6174. (36) Peng, K.; Radivojac, P.; Vucetic, S.; Dunker, A. K.; Obradovic, Z. Length-dependent prediction of protein intrinsic disorder. BMC bioinformatics 2006, 7, 1-17. (37) Sokolova, M.; Japkowicz, N.; Szpakowicz, S. Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation. Springer Berlin Heidelberg 2006, 1015-1021. (38) Vacic, V.; Iakoucheva, L. M.; Radivojac, P. Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics 2006, 22, 1536-1537. (39) Chou, M. F.; Schwartz, D. Biological sequence motif discovery using motif-x. Curr. Protoc.


20

Page 21 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Bioinformatics 2013, Chapter 13, Unit 13.15-24. (40) Huang, D. W.; Sherman, B. T.; Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 2009, 4, 44-57. (41) Szklarczyk, D.; Franceschini, A.; Wyder, S.; Forslund, K.; Heller, D.; Huertacepas, J.; Simonovic, M.; Roth, A.; Santos, A.; Tsafou, K. P. STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 2015, 43, 447-452. (42) Shannon, P.; Markiel, A.; Ozier, O.; Baliga, N. S.; Wang, J. T.; Ramage, D.; Amin, N.; Schwikowski, B.; Ideker, T. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003, 13, 2498-2504. (43) Bader, G. D.; Hogue, C. W. An automated method for finding molecular complexes in large protein interaction networks. BMC bioinformatics 2003, 4, 2. (44) Nanni, L.; Fantozzi, C.; Lazzarini, N. Coupling different methods for overcoming the class imbalance problem. Neurocomputing 2015, 158, 48-61. (45) Landry, C. R.; Levy, E. D.; Michnick, S. W. Weak functional constraints on phosphoproteomes. Trends Genet. 2009, 25, 193-197. (46) Iakoucheva, L. M.; Radivojac, P.; Brown, C. J.; O'Connor, T. R.; Sikes, J. G.; Obradovic, Z.; Dunker, A. K. The importance of intrinsic disorder for protein phosphorylation. Nucleic Acids Res. 2004, 32, 1037-1049. (47) Huang, Y.; Xu, B.; Zhou, X.; Li, Y.; Lu, M.; Jiang, R.; Li, T. Systematic characterization and prediction of post-translational modification cross-talk. Mol. Cell. Proteomics 2015, 14, 761770. (48) Lamoliatte, F.; Mcmanus, F. P.; Maarifi, G.; Chelbialix, M. K.; Thibault, P. Uncovering the SUMOylation and ubiquitylation crosstalk in human cells using sequential peptide


21


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 29

immunopurification. Nat. Commun. 2017, 8, 14109. (49) Rea, S. Methylation of histone H3 lysine 9 creates a binding site for HP1 proteins. Nature 2001, 410, 116-120. (50) Yang, X.; Seto, E. Lysine Acetylation: Codified Crosstalk with Other Posttranslational Modifications. Mol. Cell 2008, 31, 449-461. (51) Qian, L.; Nie, L.; Chen, M.; Liu, P.; Zhu, J.; Zhai, L.; Tao, S.; Cheng, Z.; Zhao, Y.; Tan, M. Global Profiling of Protein Lysine Malonylation in Escherichia coli Reveals Its Role in Energy Metabolism. J. Proteome Res. 2016, 15, 2060-2071. (52) Xu, H.; Zhou, J.; Lin, S.; Deng, W.; Zhang, Y.; Xue, Y. PLMD: An updated data resource of protein lysine modifications. J. Genet. Genomics 2017, 44, 243-250.


22

Page 23 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Figures

Figure. 1 The overall framework of prediction and analyses for prokaryotic propionylation.

Figure 2. Performance evaluation of PropSeek. a) Ten different training sets were executed. b) The LOO and n-fold cross-validation were carried out for independent test.


23


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 29

Figure 3. Comparison between propionylation sites and non-propionylation sites on KNN scores, AAC and CKAAP. a) The horizontal axis represents numbers of nearest neighbors in all positive and the same number of negative sample. b) The vertical axis represents the log2 ratio of average of amino acid frequencies between propionylation sites and non-propionylation sites. c) Pie chart for the most important 40 pairs of CKAAP based on F-score, the primary axis refers to the number of occurrence in the positive and negative sample with the unit of 100. d) Motif analysis of propionylated lysine residues.


24

Page 25 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Figure 4. a) The accuracies of 551 physicochemical properties based on 10-fold cross-validation. b) Difference for isoelectric point between propionylated and non-propionylated fragment together with the IG scores on different position. c) Comparison of EBGW between propionylation and non-propionylation. The vertical axis represents the log2 ration of average EBGW between propionylation and non-propionylation and the horizontal axis represents there binary sequences. d) Comparison of disorder score between propionylated and nonpropionylated fragment on different sites.


25


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 29

Figure 5. Crosstalk and function annotation of lysine propionylation in E.coli. a), b) In-situ crosstalk between propionylation, malonylation and acetylation on sites and proteins. c) Three classes of GO terms including biological processes, molecular functions and cellular components were adopted, while the statistical enrichment analysis of GO terms were performed with the binomial distribution with P

Computing Prediction and Functional Analysis of ... - ACS Publications

Recommend Documents