Prediction of Peptidase Category Based on Functional Domain

Sep 3, 2008 - E-mail: [email protected]., ‡ ... and the sequence alignment tool, BLAST, and the excellent results have been obtained for the very nonre...
0 downloads 0 Views 90KB Size
Prediction of Peptidase Category Based on Functional Domain Composition XiaoChun Xu,‡,# Dong Yu,‡,# Wei Fang,‡,# Yushao Cheng,§ Ziliang Qian,| WenCong Lu,⊥ Yudong Cai,*,† and Kaiyan Feng∇ CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China, Department of Life Science and Technology, HuaZhong University of Science and Technology, 1037 Luoyu Road, Wuhan 430074, P. R. China, Department of Mathematics, Shanghai University, No 99 Shang Da Road, Shanghai 200444, China, Graduate School of the Chinese Academy of Sciences, 19 Yuquan Road, Beijing 100039, Bioinformatics Center, Key Lab of Molecular Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China, Laboratory of Chemical Data Mining, Department of Chemistry, College of Sciences, Shanghai University, Shanghai 200444, China, and Division of Imaging Science & Biomedical Engineering, Room G424 Stopford Building, The University of Manchester, Manchester M13 9PT, United Kingdom Received April 17, 2008

Peptidases play pivotal regulatory roles in conception, birth, digestion, growth, maturation, aging, and death of all organisms. These regulatory roles include activation, synthesis and turnover of proteins. In the proteomics era, computational methods to identify peptidases and catalog the peptidases into six different major classessaspartic peptidases, cysteine peptidases, glutamic peptidases, metallo peptidases, serine peptidases and threonine peptidasesscan give an instant glance at the biological functions of a newly identified protein. In this contribution, by combining the nearest neighbor algorithm and the functional domain composition, we introduce both an automatic peptidase identifier and an automatic peptidase classier. The successful identification and classification rates are 93.7% and 96.5% for our peptidase identifier and peptidase classifier, respectively. Free online peptidase identifier and peptidase classifier are provided on our Web page http://pcal.biosino.org/protease_classification.html. Keywords: Peptidases • The Nearest Neighbor Algorithm • BLAST • Functional Domain Composition

Introduction Peptidases are a group of enzymes whose catalytic function is to hydrolyze (breakdown) proteins. They can be used to break down proteins into some short fragments or completely down to their basic building blockssthe amino acidssby breaking all the peptide bonds. Peptidases play pivotal regulatory roles in conception, birth, digestion, growth, maturation, aging, and death of all organisms. These regulatory roles include activation, synthesis and turnover of proteins. Some peptidases, such as the acid peptidases, help to break down the protein molecules into amino acids which can then be absorbed by the intestinal wall during the food digestion. The peptidases appearing in the blood help the metabolism, clotting and lysing of the clotts, and the breakdown of undigested protein, cellular debris, and toxin. Peptidases are also essential in viruses, bacteria and parasites for their replication and spread. Accord* To whom correspondence should be addressed. E-mail: [email protected]. ‡ HuaZhong University of Science and Technology. # These authors contributed equally to this research. § Department of Mathematics, Shanghai University. | Graduate School of the Chinese Academy of Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences. ⊥ Department of Chemistry, College of Sciences, Shanghai University. † CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences. ∇ The University of Manchester. 10.1021/pr800292w CCC: $40.75

 2008 American Chemical Society

ing to both their biological functions and evolutionary conservation, peptidases can be cataloged into six major classes: aspartic peptidases, cysteine peptidases, glutamic peptidases, metallo peptidases, serine peptidases and threonine peptidases.1 Because of the rapid expansion of the protein data, it will be highly beneficial to use high-throughput computational approaches other than costly biochemical experiments to identify and classify the peptidases. As an alternative to a raw BLAST homologous search method,2 functional domain composition method3,4 aided by nearest neighbor algorithm5 was developed to identify peptidases and to classify the peptidases into the six major categories. In the previous work, including the enzyme classification6 and protein subcellular prediction,7 people reported that the functional domain composition methods performed significantly better than the BLAST homologous search methods. Thus, it is anticipated that the prediction accuracy could be substantially improved if one uses functional domain information instead of the BLAST method. An evaluation method, jackknife test,3,4 is used to evaluate the prediction accuracy and it demonstrates that the system achieves good prediction performance for both peptidase identification and classification. The online automatic peptidase identifier and classifier are freely available on our Web page http://pcal.biosino.org/protease_classification.html. Journal of Proteome Research 2008, 7, 4521–4524 4521 Published on Web 09/03/2008

research articles

Xu et al.

Table 1. The Data Set for Peptidase/Nonpeptidase Identification peptidase

non-peptidase

overall

Data Set A PFAM 1677 Non-PFAM 223 Total 1900

(Identity Cutoff ) 25%) 3354 446 3800

5031 669 5700

Data Set B PFAM 532 Non-PFAM Total

(Identity Cutoff ) 10%) 1064 358 1422

contribution, a functional domain composition method is used to encode the protein samples. The encoding procedure can be briefed as the following steps. First, the functional domains of the proteins in our data set (peptidase and nonpeptidase) are drawn by querying the Pfam database.11 As a result, the whole data set covers 892 domains, in which peptidases cover 254 of them. Thus, each protein can be represented as an 892 dimensions vector. If the ith Pfam domain of a peptidase/ nonpeptidase contains the Pfam information, the vector’s ith element is assigned to 1, otherwise 0. It can be explicitly formulated as follows:

()

1596 537 2133

Materials and Methods Positive Data Set (Peptidases). The original peptidase data set came from the peptidase database release 7.80 in MEROPS on 23 Apr 2007 [http://merops.sanger.ac.uk/],1 which contains 64 463 peptidases belonging to the six major peptidase categories. We selected 38 476 peptidases which also appears in the UniProtKB/Swiss-Prot release 54.0 on 24 Jul 2007 and the UniProtKB/TrEMBL release 37.0 on 24 Jul 2007.8 Negative Data Set (Nonpeptidase). By excluding the peptidases, 228 958 nonpeptidases were collected from the UniProtKB/Swiss-Prot release 54.0 on 24 Jul 2007 [http://expasy.org/ sprot/].8 We randomly selected 20 000 of them to build a negative (nonpeptidase) data set. Filtering the Original Data Sets. The original data sets were preprocessed to construct a high quality data set by the following procedures: (1) peptidases with length less than 50 amino acids or more than 5000 amino acids were removed, since, if an amino sequence is too short, it tends to be a protein fragment, and if it is too long, it tends to be a protein complex; (2) peptidases which contain irregular characters such as B, J, O, U, X and Z were removed; (3) peptidases belonging to multiple categories were removed. To avoid the prediction bias because of the sequence redundancy, homologous sequences were removed by two culling program, CD-HIT9 and PISCES.10 As a result, two peptidase data sets with different restrictions on sequence similarity are generated: peptidase data set A in which any two peptidase sequences are no more than 25% identical with each other, and peptidase data set B in which no more than 10% identical. Following the same procedure as above, we can establish the negative data set containing nonpeptidases. As a result, 3800 and 1422 nonpeptidases are filtered to construct two nonpeptidase data sets A and B, respectively, with different restrictions on the sequence similarity. Finally, we combine peptidase data set A and nonpeptidase data set A to construct data set A, and peptidase data set B and nonpeptidase data set B to construct data set B. As shown in Table 1, data set A contains 5700 protein sequences, among which there are 1900 peptidases and 3800 nonpeptidases; data set B contains 2133 sequences, among which there are 711 peptidases and 1422 nonpeptidases. Please refer to the Supporting Information material A and B for the detailed sequence accession numbers for data set A and B, respectively. Numeric Representation of Proteins Based on the Functional Domain Composition. To construct a feasible predictor, each protein sample is represented by a numeric vector. In this 4522

Journal of Proteome Research • Vol. 7, No. 10, 2008

p1 p2 l P) pI l p892

(1)

{

(2)

where, pi )

1, hit found 0, otherwise

When this approach is used, not only sequence information, but also biological annotations are taken into consideration to build up the predictors. The protein functional domain method is a widely used computational approach in bioinformatics. Some investigators applied it to the protein function analysis and classification,12–15 also to the protein structure and subcellular location prediction,16–19 and the excellent results were obtained. In this paper, we approached the functional domain composition method to the peptidase identification and classification. The Nearest Neighbor Algorithm. The nearest neighbor algorithm3,5 predicts the category of a query protein P as the category of its most similar counterpart Pi in the training data set (P1, P2,..., Pi,..., PN). In the contribution, the prediction procedure based on the nearest neighbor algorithm can be briefed as the following steps. We first define the similarity between two protein sequences P and Pi based on their functional domain composition vectors. Λ(P, Pi) ) 1 -

P · Pi |P| · |Pi |

(3)

where P · Pi is the dot product of vectors P and Pi, and |P| and |Pi| are their moduli, respectively. WhenP ≡ Pi, dot product and modulus product are equal. It is easy to know that the similarity is between 0 and 1, that is, 0 e Λ(P, Pi) e 1. Thus, the nearest neighbor P can be identified by the following formula: P ) argmin{Λ(P, Pi) : i ) 1, 2, ... , N} i

(4)

where the operator argmin means finding a vector Pi whose similarity with P is the smallest among the whole training set, and assign Pi’s category to P. If such Pi appears several times, we choose one of them at random and assign its category toP. BLAST Method. Because the nearest neighbor algorithm is based on domain information, it will not work if the domain information of a query protein is not available. Only 75% of the Uniprot proteins have the Pfam domain information. To predict the protein category without any domain information, BLAST2 is used instead, which can be briefed as following. First, HSPs (High-scoring Segment Pairs) score between the querying

research articles

Prediction of Peptidase Category Table 2. The Data Set for Peptidase Classification aspartic

cysteine glutamic metallo serine threonine overall

Table 3. The Performance of Peptidase/Nonpeptidase Identification peptidase

Peptidase Data Set A PFAM 90 Non-PFAM 20 Total 110

(Identity Cutoff ) 25%) 291 0 578 689 38 1 56 106 329 1 634 795

29 2 31

1677 223 1900

Peptidase Data Set B PFAM 37 Non-PFAM 20 Total 57

(Identity Cutoff ) 10%) 99 0 163 223 34 2 46 74 133 2 209 297

overall

(Identity Cutoff ) 25%) FDa 95.1%(1595/1677) 97.1%(3258/3354) 96.5%(4583/5031) 91.0%(1729/1900) 90.6%(3443/3800) 90.7%(5172/5700) BLASTb FD + BLASTc 92.6%(1760/1900) 94.2%(3580/3800) 93.7%(5340/5700) Data Set B

10 3 13

532 179 711

sequence P and each one of the training sequences in (P1, P2,..., Pi,..., PN) is calculated. Then, the protein is predicted to the category which is also the category of the protein Pi where the HSPs score between P and Pi is the highest. It can be formulated as P ) argmax{HSPs_Score(P, Pi), i ) 1, 2, ... , N} i

nonpeptidase

Data Set A

(Identity 95.9%(510/532) FDa 76.8%(546/711) BLASTb FD + BLASTc 83.8%(596/711)

a Predictor based on functional domain (FD). b Predictor based on pure BLAST method. c Predictor by combining the FD and BLAST method.

Table 4. The Performance of Peptidase Classification aspartic cysteine glutamic metallo serine threonine overall

Peptidase Data Set A

(5)

where the operator argmax means finding a vector Pi so that the HSPs score between P and Pi is the highest, and assign Pi’s category to P. If such Pi appears several times, we choose one of them at random and assign its category toP. By combining the nearest neighbor algorithm and the BLAST method together, we can deal with all proteins with or without functional domain information. The combined method will be termed as FD (functional domain) + BLAST method in the following sections. Meanwhile, the method with only functional domain information will be termed as FD method and BLAST using only sequence homology will be termed as pure-BLAST method.

Results and Discussions The computation was performed on a Dell Opitex 260 machine which has an Intel 2.6GHZ CPU and 2G RAM. Some additional information for the data set is described as follows. As shown in Table 1, data set A with no more than 25% sequence identity contains 1900 peptidases; the more rigorous data set B with no more than 10% sequence identity contains 711 peptidases. Furthermore, we remove peptidases without domain information or with very limited domain information in Pfam database [http://www.sanger.ac.uk/ Software/Pfam/]11 to form the Pfam data set. Thus, for data set A, we obtain the Pfam peptidase database with a total of 1677 peptidases. This Pfam peptidase data set A consists of 90 Aspartic peptidases, 291 Cysteine peptidases, 0 Glutamic peptidases, 578 Metallo peptidases, 689 Serine peptidases and 29 Threonine peptidases, as shown in Table 2. For data set B, we obtain the Pfam peptidase database with a total of 532 peptidases. This Pfam peptidase data set B consists of 37 Aspartic peptidases, 99 Cysteine peptidases, 0 Glutamic peptidases, 163 Metallo peptidases, 223 Serine peptidases and 10 Threonine peptidases, as also shown in Table 2. We totally built two predictors. The first predictor is used to identify peptidases from nonpeptidases. Jackknife cross validation3,4 was adopted to evaluate the prediction accuracy. We first obtain the results for the peptidase identifiers. Their results can be seen in Table 3. As for the data set A (with no more than 25% amino acid sequence similarity), using FD method, the prediction accuracies are 95.1%, 97.1%, and 96.5%

Cutoff ) 10%) 98.0%(1043/1064) 97.3%(1553/1596) 87.2%(1240/1422) 83.7%(1786/2133) 91.8%(1306/1422) 89.2%(1902/2133)

100% FDa 88.2% BLASTb FD+BLASTc 90.9%

99.0% 92.1% 95.7%

(Cutoff ) 25%) / 99.1% 98.7% 100% 0 96.2% 97.2% 90.3% 0 97.5% 97.1% 93.5%

99.0% 95.3% 96.5%

Peptidase Data Set B FDa 100% 71.9% BLASTb FD+BLASTc 75.4%

97.0% 63.9% 82.0%

(Cutoff ) / 100% 100%

10%) 98.1% 98.2% 90.0% 80.4% 93.9% 30.8% 87.1% 93.3% 69.2%

97.9% 81.4% 87.5%

a Predictor based on functional domain (FD). b Predictor based on pure BLAST method. c Predictor by combining the FD and BLAST method.

for the positive data set, negative data set, and the whole data set, respectively; using pure BLAST method, the prediction accuracies are 91.0%, 90.6%, and 90.7% for the positive data set, negative data set, and the whole data set, respectively; using FD + BLAST method, the prediction accuracies are 92.6%, 94.2%, and 93.7% for the positive data set, negative data set, and the whole data set, respectively. As for the more rigorous data set B (with no more than 10% amino acid sequence identity), using FD method, the prediction accuracies are 95.9%, 98.0%, and 97.3% for the positive data set, negative data set, and the whole data set, respectively; using pure BLAST method, the prediction accuracies are 76.8%, 87.2%, and 83.7% for the positive data set, negative data set, and the whole data set, respectively; using FD + BLAST method, the prediction accuracies are 83.8%, 91.8%, and 89.2% for the positive data set, negative data set, and the whole data set, respectively. We then obtained the results for the peptidase classifiers. Their results can be seen in Table 4. As for data set A, the overall prediction accuracies for the FD method, pure BLAST method, and the FD + BLAST method are 99.0, 95.3%, and 96.5%, respectively. As for data set B, the overall prediction accuracies for the FD method, pure BLAST method, and the FD + BLAST method are 97.9%, 81.4%, and 87.5%, respectively. Please refer to Table 4 for the prediction accuracy for each individual peptidase. In all cases, we find that under the same conditions, the FD method achieves the highest overall prediction accuracy, then the FD + BLAST method, and last the pure BLAST method. We can conclude that the FD method is indeed superior to the Journal of Proteome Research • Vol. 7, No. 10, 2008 4523

research articles traditional BLAST method in either separating the peptidases from nonpeptidases or in predicting the category of a peptidase. The reason that FD + BLAST method is better than the pure BLAST is because the FD part in FD + BLAST is better than the pure BLAST at dealing with the same peptidase data that have the functional domain information. Because the FD + BLAST method can deal with all peptidase data with or without domain information, in practice, it is chosen as our primary method to make the predictions. The FD method performs well in both data set A and data set B, while the pure BLAST suffers much when dealing with more rigorous cutoff value. Because in data set B all amino acid sequences are very dissimilar with each other, the pure BLAST would allocate the dissimilar peptidases in sequences to different peptidase categories while they are actually from the same category. The pure BLAST is good at dealing with similar peptidases in sequence because by nature they are also more likely coming from the same category if their amino acid sequences are similar to each other.

Conclusion After the analyses of the data set, we find that all these sequences belonging to the same category are significantly similar and homologous. So we can use alignment tools like BLAST to search the training sequences that are similar to the query sequence to identify to which category the query sequence belongs. The prediction results using BLAST are excellent when we eliminate the sequences that have more than 25% similarity (shown in Table 3 with Cutoff ) 25%, total prediction accuracy is 90.7%). BLASTP-searching method only emphasizes on the sequence similarity. On the other hand, the functional domains of the proteins are more conservative and homologous. It is possible that the query sequence is not similar enough with its family, but some region of its sequence is crucial to identify them. These regions are invariable because some important functional domains are conservative during the evolution. These functional domains, also called motifs, could play key role on the identification of peptidases and their types. Meanwhile, these peptidases from MEROPS are classified into families based on the statistically significant similarity between the protein sequences in the part termed ‘peptidase unit’ that is most directly responsible for activity. Thus, we can understand the reason the prediction performance of the FD + BLAST method is better than that of the pure BLAST method.

Acknowledgment. This work is supported by the basic research grant of Chinese Academy of Science (KSCX2-YWR-112). Supporting Information Available: Tables listing the detailed information of the data set, Peptidase and Nonpep-

4524

Journal of Proteome Research • Vol. 7, No. 10, 2008

Xu et al. tidase Sequences. This material is available free of charge via the Internet at http://pubs.acs.org.

References (1) Rawlings, N. D.; Morton, F. R.; Barrett, A. J. MEROPS: the peptidase database. Nucleic Acids Res. 2006, 34 (Database issue), D270–2. (2) Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 1990, 215 (3), 403– 10. (3) Cai, Y. D.; Chou, K. C. Nearest neighbour algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition. Biochem. Biophys. Res. Commun. 2003, 305 (2), 407–11. (4) Chou, K. C.; Cai, Y. D. Using functional domain composition and support vector machines for prediction of protein subcellular location. J. Biol. Chem. 2002, 277 (48), 45765–9. (5) Salzberg, S.; Cost, S. Predicting protein secondary structure with a nearest-neighbor algorithm. J. Mol. Biol. 1992, 227 (2), 371–4. (6) Lu, L.; Qian, Z.; Cai, Y. D.; Li, Y. ECS: an automatic enzyme classifier based on functional domain composition. Comput. Biol. Chem. 2007, 31 (3), 226–32. (7) Jia, P.; Qian, Z.; Zeng, Z.; Cai, Y.; Li, Y. Prediction of subcellular protein localization based on functional domain composition. Biochem. Biophys. Res. Commun. 2007, 357 (2), 366–70. (8) Wu, C. H.; Apweiler, R.; Bairoch, A.; Natale, D. A.; Barker, W. C.; Boeckmann, B.; Ferro, S.; Gasteiger, E.; Huang, H.; Lopez, R.; Magrane, M.; Martin, M. J.; Mazumder, R.; O’Donovan, C.; Redaschi, N.; Suzek, B. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 2006, 34 (Database issue), D187–91. (9) Li, W.; Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22 (13), 1658–9. (10) Wang, G.; Dunbrack, R. L. PISCES: a protein sequence culling server. Bioinformatics 2003, 19 (12), 1589–91. (11) Finn, R. D.; Mistry, J.; Schuster-Bockler, B.; Griffiths-Jones, S.; Hollich, V.; Lassmann, T.; Moxon, S.; Marshall, M.; Khanna, A.; Durbin, R.; Eddy, S. R.; Sonnhammer, E. L.; Bateman, A. Pfam: clans, web tools and services. Nucleic Acids Res. 2006, 34 (Database issue), D247–51. (12) Masseroli, M.; Bellistri, E.; Franceschini, A.; Pinciroli, F. Statistical analysis of genomic protein family and domain controlled annotations for functional investigation of classified gene lists. BMC Bioinformatics 2007, 8, 1–S14. (13) Horan, K.; Lauricha, J.; Bailey-Serres, J.; Raikhel, N.; Girke, T. Genome cluster database. A sequence family analysis platform for Arabidopsis and rice. Plant Physiol. 2005, 138 (1), 47–54. (14) Hayete, B.; Bienkowska, J. R. Gotrees: predicting go associations from protein domain composition using decision trees. Pac. Symp. Biocomput. 2005, 127–38. (15) Cai, Y. D.; Doig, A. J. Prediction of Saccharomyces cerevisiae protein functional class from functional domain composition. Bioinformatics 2004, 20 (8), 1292–300. (16) Yu, X.; Wang, C.; Li, Y. Classification of protein quaternary structure by functional domain composition. BMC Bioinf. 2006, 7, 187. (17) Chou, K. C.; Cai, Y. D. Predicting protein structural class by functional domain composition. Biochem. Biophys. Res. Commun. 2004, 321 (4), 1007–9. (18) Chou, K. C.; Cai, Y. D. Prediction and classification of protein subcellular location-sequence-order effect and pseudo amino acid composition. J. Cell. Biochem. 2003, 90 (6), 1250–60. (19) Cai, Y. D.; Chou, K. C. Predicting subcellular localization of proteins in a hybridization space. Bioinformatics 2004, 20 (7), 1151–6.

PR800292W