Using Functional Domain Composition To Predict Enzyme Family Classes Yu-Dong Cai*,†,§ and Kuo-Chen Chou*,‡,§ Biomolecular Sciences Department, University of Manchester Institute of Science & Technology, Manchester, M60 1QD, United Kingdom, Tianjin Institute of Bioinformatics and Drug Discovery (TIBDD), Tianjin, China, and Gordon Life Science Institute, San Diego, California 92130 Received September 15, 2004
According to their main EC (Enzyme Commission) numbers, enzymes are classified into the following 6 main classes: oxidoreductases, transferases, hydrolases, lyases, isomerases, and ligases. A new method has been developed to predict the enzymatic attribute of proteins by introducing the functional domain composition to formulate a given protein sequence. The advantage by doing so is that both the sequence-order-related features and the function-related features are naturally incorporated in the predictor. As a demonstration, the jackknife cross-validation test was performed on a dataset that consists of proteins with only less than 20% sequence identity to each other in order to get rid of any homologous bias. The overall success rate thus obtained was 85% in identifying the enzyme family classes (including the identification of nonenzyme protein sequences as well). The success rate is significantly higher than those obtained by the other methods on such a stringent dataset. This indicates that using the functional domain composition to represent protein samples for statistical prediction is indeed very promising, and will become a powerful tool in bioinformatics and proteomics. Keywords: classification of enzyme commission • enzymatic attribute • functional domain composition • 20% threshold cutoff • nearest neighbor predictor • bioinformatics • proteomics
I. Introduction Enzymes are generally classified into six families:1 (1) oxidoreductasesscatalyzing oxidoreduction reactions; (2) transferasesstransferring a group from one compound to another; (3) hydrolasesscatalyzing the hydrolysis of various bonds; (4) lyasesscleaving C-C, C-O, C-N, and other bonds by other means than by hydrolysis or oxidation; (5) isomerasess catalyzing geometrical or structural changes within one molecule; and (6) ligasesscatalyzing the joining together of two molecules coupled with the hydrolysis of a pyrophosphate bond in ATP or a similar triphosphate (Figure 1). Given a newly found protein sequence, the following two questions are often asked. Is the new protein an enzyme? If it is, which enzyme family class does it belong to? Both questions are closely related to the function of the protein as well as its specificity and molecular mechanism, and hence are very important to both basic research and drug discovery practice. Particularly, with the explosion of protein sequences entering into databanks and the relatively much slower progress in using biochemical experiments to determine their functions, it is highly desired to develop an automated method that can be used to give a fast answers to these questions. * To whom correspondence should be addressed. E-mail: y.cai@ umist.ac.uk;
[email protected]. † Biomolecular Sciences Department, University of Manchester Institute of Science & Technology. ‡ Tianjin Institute of Bioinformatics and Drug Discovery. § Gordon Life Science Institute. 10.1021/pr049835p CCC: $30.25
2005 American Chemical Society
Figure 1. Schematic drawing to show that enzymes are classified into oxidoreductases, transferases, hydrolases, lyases, isomerases, and ligases according to their main EC (Enzyme Commission) numbers.
Although there are many existing methods2-11 that can be used to predict the enzyme family class after some modification, most of these methods were based on the framework that the sample of a protein was represented by its amino acid Journal of Proteome Research 2005, 4, 109-111
109
Published on Web 12/22/2004
research articles composition. Hence, many important features associated with the sequence order were completely missed, which will certainly impose some intrinsic limitation for enhancing the success rate of prediction. To establish a powerful prediction method to deal with this problem, the first important thing is to find an effective representation for a protein. In this sense, the entire protein sequence contains, of course, the most complete information. Unfortunately, when using the entire sequence of a protein as its representation to formulate the statistical prediction algorithm, one faces the difficulty of dealing with almost an infinite number of sample patterns, as elaborated by Chou.12 To formulate a feasible statistical prediction algorithm, a protein must be expressed in terms of a set of discrete numbers, such as the 20 amino acid components widely used by many previous investigators in various prediction algorithms.2-5,9,13-20 Therefore, we are actually confronted with the dilemma that if we wish to include the complete information, the prediction would become unfeasible; if we wish to make the prediction feasible, some important information must be ignored. In view of this, can we find a compromise scenario, i.e., a new protein representation which is constituted by a set of discrete numbers but which also bears as many important sequence-order-related features as possible? The present study was initiated in an attempt to deal with the problem. To incorporate the sequence-order-related features, the “functional domain composition” is used to represent a given protein, as will be detailed in the next section.
II. Materials and Method The ENZYME database (ftp.expasy.ch)21 was used to construct the 6 enzyme family classes. To get rid of redundancy, only those sequences having less than 20% sequence identity were included. Thus, a total of 1000 sequences were taken from the ENZYME database. Of the 1000 sequences, 153 are oxidoreductases, 290 transferases, 385 hydrolases, 82 lyases, 33 isomerases, and 57 ligases. Meanwhile, a total of 1301 nonenzyme protein sequences were randomly taken from SWISSPROT22 that were also subject to the condition of less than 20% sequence identity. The accession numbers of the 1000 + 1301 ) 2301 protein sequences are given in the online Supporting Information A. To improve the quality of predicting enzyme family classes, the key is to catch the core features of a protein that are intimately related to its biological function. Since the enzyme family classes are classified according to their molecular functions and acting objects, it is anticipated that the prediction quality will be significantly enhanced if we can find a feasible approach to use the knowledge of functional domains to define the sample of a protein. This can be realized through the integrated domain and motif database,23 or the InterPro database at http://www.ebi.ac.uk/interpro. InterPro release 6.2 (April 24, 2003) contains 7785 entries with well-known structural and functional domain types. With each of the 7785 functional domain patterns as a vector-base, a protein can be defined as a 7785D (dimensional) vector according to the following steps. (1) Use the program IPRSCAN23 to search InterPro database for a given protein, if there is a hit (e.g., IPR001938, meaning the protein contains a sequence segment very similar to that of the 1938th domain of the InterPro database), then the 1938th component of the protein in the 7785D functional domain space is assigned 1; otherwise, 0. 110
Journal of Proteome Research • Vol. 4, No. 1, 2005
Cai and Chou
[]
(2) The protein can thus be explicitly formulated as follows p1 p2 l P) pj l p7785 where pj )
{
1, hit found in InterPro database 0, otherwise
(1)
(2)
Defined in this way, a protein will correspond to a 7785D vector P with each of the 7785 functional domain patterns as a base for the vector space. In other words, rather than the 20D space of the amino acid composition approach as often used by many previous investigators,2-5,9,16,24 a protein is now represented in terms of the functional domain composition. By doing so, not only some sequence-order-related features are naturally incorporated in the representation, but some function-related features are as well. To predict enzyme family class, we adopted the NN (Nearest Neighbor) predictor,25,26 which is particularly useful in the situations when the distributions of the samples are unknown. The NN predictor can be briefed below. Suppose there are N samples (P1, P2, ..., PN) which have been classified into categories 1, 2, ..., µ. Now, for a query protein P, how can we predict the category to which it belongs? According to the nearest neighbor principle, the prediction can be formulated as follows. First, let us define a generalized distance between P and Pi (i ) 1, 2, ..., N) given by D(P,Pi) ) 1 -
P‚Pi , (i ) 1, 2, ..., N) |P||Pi|
(3)
where P‚Pi is the dot product of vectors P and Pi, and |P| and |Pi| their modulus, respectively. Obviously, when P ≡ Pi, we have D(P,Pi) ) 0. Generally speaking, the generalized distance is within the range of 0 and 1; i.e., 0 e D(P,Pi) e 1. Accordingly, the NN predictor can be expressed as follows. If the generalized distance between P and Pk (k ) 1, 2, ..., or N) is the smallest; i.e. D(P,Pk) ) Min{D(P,P1), D(P,P2), ..., D(P,PN)}
(4)
where the operator Min means taking the minimal one among those in the brackets, then the query protein P is predicted belonging to the same category as of Pk. If there is a tie, then the query protein is not uniquely determined, but such cases rarely occur.
III. Results and Discussion The computation was performed in a Silicon Graphics IRIS Indigo workstation (Elan 4000). As is well-known, in statistical prediction, the single independent dataset, sub-sampling, and jackknife tests are the three methods often used for crossvalidation. Of these three, the jackknife test is deemed as the most rigorous and objective one (see a review27 for a comprehensive discussion about this, and a monograph28 for the underlying mathematical principle). Therefore, jackknife test has been used by more and more investigators5-11 in examining the power of various prediction methods. With the current approach, the success rates by the jackknife cross-validation
research articles
Enzyme Family Class Prediction Table 1. Success Rates in Identifying Enzyme Family Classes and Non-enzyme Proteins by Jackknife Test for the 2301 Proteins Listed in the Online Supporting Information A, Where None of the Proteins Have g20% Sequence Identity with Others category
success rate
oxidoreductase
115 ) 75.16% 153
transferase
231 ) 79.65% 290
hydrolase
297 ) 77.14% 385
lyase
67 ) 81.71% 82
isomerase
21 ) 63.63% 33
ligase
51 ) 89.47% 57
nonenzyme protein
1182 ) 90.85% 1301
overall
1964 ) 85.35% 2301
for the 2301 proteins in the online Supporting Information A are given in Table 1, from which we can see the overall success rate is 85.35%. For the same cross-validation test on such a strictly nonhomologous dataset, the corresponding overall success rates obtained by the simple geometry approaches2,3 based on the amino acid composition alone were only 1012% (i.e., more than 70% lower than the aforementioned rate), indicating that the current approach is overwhelmingly superior to any of the existing methods in predicting enzyme family class. Recently, Cai et al.29 attempted to use SVM (support vector machines) to predict enzyme family classification and obtained some encouraging results as well. However, in their report, no effort was made to identify a protein molecule between enzyme and nonenzyme according to its sequence, the first important problem investigated in this paper. Second, no cutoff procedure was taken in their datasets to remove the samples with high sequence identity and avoid any homologous bias. Third, no jackknife cross-validation was used to examine the prediction power.
IV. Conclusion The enzymatic attribute of newly found protein sequences are usually determined by either biochemical analysis of eukaryotic and prokaryotic genomes or microarray chips. These experimental methods are both time-consuming and costly. With the explosion of protein sequences entering into databanks, it is highly desirable to develop an automated method to identify whether a given new sequence belongs to enzyme or nonenzyme. If it is an enzyme, to which enzyme family class does it belong? This is important because knowing which family an enzyme belongs to may help deduce its catalytic mechanism and specificity, providing clues to the relevant biological function.
Using the functional domain composition approach to represent protein samples can significantly improve the prediction quality in identifying which enzyme family class it belongs to for a given protein sequence; the predictor developed here also covers the identification for nonenzyme protein sequences. The current method can become a useful high throughput tool in bioinformatics and proteomics.
Supporting Information Available: The accession numbers of 1000 + 1301 ) 2301 protein sequences classified into 6 enzyme family classes and a nonenzyme class. This material is available free of charge via the Internet at http:// pubs.acs.org. References (1) Webb, E. C. Enzyme Nomenclature; Academic Press: San Diego, 1992. (2) Nakashima, H.; Nishikawa, K.; Ooi, T. J. Biochem. 1986, 99, 152162. (3) Chou, P. Y. In. Prediction of Protein Structure and the Principles of Protein Conformation; Fasman, G. D., Ed.; Plenum Press: New York, 1989; pp 549-586. (4) Chou, K. C. Proteins: Struct. Func. Genet. 1995, 21, 319-344. (5) Zhou, G. P. J. Protein Chem. 1998, 17, 729-738. (6) Yuan, Z. FEBS Lett. 1999, 451, 23-26. (7) Feng, Z. P. Biopolymers 2001, 58, 491-499. (8) Hua, S.; Sun, Z. Bioinformatics 2001, 17, 721-728. (9) Zhou, G. P.; Assa-Munt, N. Proteins: Struct. Funct. Genet. 2001, 44, 57-59. (10) Pan, Y. X.; Zhang, Z. Z.; Guo, Z. M.; Feng, G. Y.; Huang, Z. D.; He, L. J. Protein Chem. 2003, 22, 395-402. (11) Zhou, G. P.; Doctor, K. Proteins: Struct. Funct. Genet. 2003, 50, 44-48. (12) Chou, K. C. Proteins: Struct. Funct. Genet. (Erratum: ibid. 2001, 44, 60) 2001, 43, 246-255. (13) Klein, P.; Delisi, C. Biopolymers 1986, 25, 1659-1672. (14) Klein, P. Biochim. Biophys. Acta 1986, 874, 205-215. (15) Deleage, G.; Roux, B. Protein Eng. 1987, 1, 289-294. (16) Chou, K. C.; Zhang, C. T. J. Biol. Chem. 1994, 269, 22014-22020. (17) Kneller, D. G.; Cohen, F. E.; Langridge, R. J. Mol. Biol. 1990, 214, 171-182. (18) Metfessel, B. A.; Saurugger, P. N.; Connelly, D. P.; Rich, S. T. Protein Sci. 1993, 2, 1171-1182. (19) Chandonia, J. M.; Karplus, M. Protein Sci. 1995, 4, 275-285. (20) Bahar, I.; Atilgan, A. R.; Jernigan, R. L.; Erman, B. Proteins: Struct. Funct. Genet. 1997, 29, 172-185. (21) Bairoch, A. Nucleic Acids Res. 2000, 28, 304-305. (22) Bairoch, A.; Apweiler, R. Nucleic Acids Res. 2000, 25, 31-36. (23) Apweiler, R.; Attwood, T. K.; Bairoch, A.; Bateman, A.; Birney, E.; Biswas, M.; Bucher, P.; Cerutti, L.; Corpet, F.; Croning, M. D. R.; Durbin, R.; Falquet, L.; Fleischmann, W.; Gouzy, L.; Hermjakob, H.; Hulo, N.; Jonassen, I.; Kahn, D.; Kanapin, A.; Karavidopoulou, Y.; Lopez, R.; Marx, B.; Mulder, N. J.; Oinn, T. M.; Pagni, M.; Servant, F.; Sigrist, C. J. A.; Zdobnov, E. M. Nucleic Acids Res. 2001, 29, 37-40. (24) Chou, J. J.; Zhang, C. T. J. Theor. Biol 1993, 161, 251-262. (25) Cover, T. M.; Hart, P. E. IEEE Trans. Inf. Theory 1967, IT-13, 2127. (26) Friedman, J. H.; Baskett, F.; Shustek, L. J. IEEE Trans. Inf. Theory 1975, C-24, 1000-1006. (27) Chou, K. C.; Zhang, C. T. Crit. Rev. Biochem. Mol. Biol. 1995, 30, 275-349. (28) Mardia, K. V.; Kent, J. T.; Bibby, J. M. Multivariate Analysis: Chapter 11 Discriminant Analysis; Chapter 12 Multivariate analysis of variance; Chapter 13 cluster analysis; Academic Press: London, 1979; pp. 322-381. (29) Cai, C. Z.; Han, L. Y.; Ji, Z. L.; Chen, Y. Z. Proteins: Struct. Funct. Bioinform. 2004, 55, 66-76.
PR049835P
Journal of Proteome Research • Vol. 4, No. 1, 2005 111