Predicting Enzyme Subclass by Functional Domain Composition and Pseudo Amino Acid Composition Yu-Dong Cai†,‡,§,| and Kuo-Chen Chou*,†,‡,| Bioinformatics Center, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China, Shanghai Centre for Bioinformatics Technology, 100 Qing-Zhou Road, Shanghai 200235, China, Biomolecular Sciences Department, University of Manchester Institute of Science & Technology, Manchester, M60 1QD, United Kingdom, and Gordon Life Science Institute, San Diego, California 92130 Received February 17, 2005
As a continuous effort to use the sequence approach to identify enzymatic function at a deeper level, investigations are extended from the main enzyme classes (Protein Sci. 2004, 13, 2857-2863) to their subclasses. This is indispensable if we wish to understand the molecular mechanism of an enzyme at a deeper level. For each of the 6 main enzyme classes (i.e., oxidoreductase, transferase, hydrolase, lyase, isomerase, and ligase), a subclass training dataset is constructed. To reduce homologous bias, a stringent cutoff was imposed that all the entries included in the datasets have less than 40% sequence identity to each other. To catch the core feature that is intimately related to the biological function, the sample of a protein is represented by hybridizing the functional domain composition and pseudo amino acid composition. On the basis of such a hybridization representation, the FunD-PseAA predictor is established. It is demonstrated by the jackknife cross-validation tests that the overall success rate in identifying the 21 subclasses of oxidoreductases is above 86%, and the corresponding rates in identifying the subclasses of the other 5 main enzyme classes are 94-97%. The high success rates imply that the FunD-PseAA predictor may become a useful tool in bioinformatics and proteomics of the post-genomic era. Keywords: ENZYME database • 40% cutoff • functional domain • pseudo amino acid composition • ISort predictor • FunD-PseAA predictor • bioinformatics • proteomics
I. Introduction study,1
In a previous efforts have been made to address the following two problems. For a newly found protein sequence, can we identify whether it is an enzyme or nonenzyme? If it is, which main class does it belong to? As demonstrated in that study,1 the success rate in identifying the attribute between enzyme and nonenzyme, and the success rate in identifying the attribute among the 6 main enzyme classes, were both above 93%. Even for a very stringent dataset that consists of proteins with only less than 20% sequence identity to each other,2 the overall success rate was over 85%. Since each of the main enzyme classes has its own subclasses, a logical and subsequent question is: for an enzyme with a given main class, can we predict which subclass it belongs to? This is indispensable if we wish to understand the in-depth molecular mechanism of an enzyme. Although the covariant-discriminant predictor was adopted to identify the subclass for oxidoreductases in an earlier study,3 the entire approach there was based * To whom correspondence should be addressed. E-mail: kchou@ san.rr.com. † Chinese Academy of Sciences. ‡ Shanghai Centre for Bioinformatics Technology. § University of Manchester Institute of Science & Technology. | Gordon Life Science Institute. 10.1021/pr0500399 CCC: $30.25
2005 American Chemical Society
on the protein amino acid composition alone. According to the classical definition, the amino acid composition of a protein consists of 20 components with each representing one of the occurrence frequencies of the 20 native amino acids in it (see, e.g., refs 4-7). Obviously, if a protein sample is solely represented by its amino acid composition, all the detailed information about its sequence order and sequence length is totally ignored. Accordingly, although the results obtained in that study3 are quite encouraging, there is plenty room to improve the prediction quality. For instance, by using the amphiphilic pseudo amino acid composition8 approach to take into account some partial sequence-order effects for the case studied in ref 3, a remarkable improvement has been observed. Besides, according to their main EC (Enzyme Commission) numbers,9 enzymes are classified into the following 6 main classes: oxidoreductases, transferases, hydrolases, lyases, isomerases, and ligases. The oxidoreductase class as investigated in refs 3,8 is just one of the 6 main enzyme classes. A powerful predictor should be able to identify the subclass for each of the 6 main enzyme classes as well. But what procedures do we need for this end? With the avalanche of genomic sequences in the postgenomic era, we are facing an exciting but frightening prospect: piles of sequences but only flakes of knowledge. How can the thousands upon thousands of sequences be timely Journal of Proteome Research 2005, 4, 967-971
967
Published on Web 04/08/2005
research articles annotated? The present study was initiated in an attempt to address these challenges.
II. Materials The ENZYME database at ftp.expasy.ch (Release 35, June 2004) was used to construct the subclasses in terms of their accession numbers for each of the 6 main enzyme families. The corresponding sequences were obtained from the databases of UniProt/Swiss-Prot at www.ebi.ac.uk/swissprot (Release 44, 5-July-2004) and UniProt/TrEMBL at www.ebi.ac.uk/ trembl (Release 27.0, 5-July-2004). To avoid any bias, a redundancy cutoff operation was imposed by the program PISCES10 so that none of the included sequences had g40% identity to any other. Thus, the following 6 datasets were constructed. The 1st dataset, SA, consists of 1,697 oxidoreductases classified into the following 21 subclasses: (1) 407 oxidoreductases are with the function acting on the CH-OH group of donors; (2) 128 acting on the aldehyde or oxo group of donors; (3) 119 acting on the CH-CH group of donors; (4) 71 acting on the CH-NH2 group of donors; (5) 95 acting on CH-NH group of donors; (6) 262 acting on NADH or NADPH; (7) 40 acting on other nitrogenous compounds as donors; (8) 67 acting on a sulfur group of donors; (9) 80 acting on a heme group of donors; (10) 45 acting on diphenols and related substances as donors; (11) 52 acting on peroxide as donors; (12) 22 acting on hydrogen as donors; (13) 44 acting on single donors with incorporation of molecular oxygen; (14) 127 acting on paired donors, with incorporation of molecular oxygen; (15) 31 acting on superoxide as acceptor; (16) 12 oxidizing metal ions; (17) 44 acting on CH or CH2; (18) 25 acting on ion-sulfur proteins as donors; (19) 6 acting on phosphorus or arsenic in donors; (20) 8 acting on X-H and Y-H to form an X-Y bond; (21) 12 other oxidoreductases. The 2nd dataset, SB, contains 8 subclasses and 3582 transferases of which (1) 438 transferring one-carbon groups, (2) 34 transferring aldehyde or ketonic groups, (3) 280 acyltransferases, (4) 390 glycosyltransferases, (5) 280 transferring alkyl or aryl groups (other than methyl groups), (6) 104 transferring nitrogenous groups, (7) 2002 transferring phosphorus-containing groups, and (8) 54 transferring sulfur-containing groups. The 3rd dataset, SC, contains 8 subclasses and 2902 hydrolases of which (1) 1064 acting on ester bonds, (2) 476 glycosylases, (3) 16 acting on ether bonds, (4) 442 acting on peptide bonds (peptidases), (5) 367 acting on carbon-nitrogen bonds other than peptide, (6) 519 acting on acid anhydrides, (7) 3 acting on carbon-carbon bonds, and (8) 15 acting on halide bonds. The 4th dataset, SD, contains 6 subclasses and 939 lyases of which (1) 326 carbon-carbon lyases, (2) 450 carbon-oxygen lyases, (3) 52 carbon-nitrogen lyases, (4) 25 carbon-sulfur lyases, (5) 53 phosphorus-oxygen lyases, and (6) 33 other lyases. The 5th dataset, SE, contains 6 subclasses and 503 isomerases of which (1) 95 racemases and epimerases, (2) 78 cis-transisomerases, (3) 176 intramolecular oxidoreductases, (4) 76 intramolecular transferases, (5) 8 intramolecular lyases, and (6) 70 other isomerases. The 6th dataset, SF, contains 6 subclasses and 840 ligases of which (1) 404 forming carbon-oxygen bonds, (2) 34 forming carbon-sulfur bonds, (3) 341 forming carbon-nitrogen bonds, (4) 14 forming carbon-carbon bonds, (5) 37 forming phosphoric ester bonds, and (6) 10 forming nitrogen-metal bonds. 968
Journal of Proteome Research • Vol. 4, No. 3, 2005
Cai and Chou
The accession numbers of the proteins in datasets SA, SB, SC, SD, SE, and SF are given in Supporting Information A, B, C, D, E, and F, respectively.
III. Method It was suggested through an unbiased analysis11 that more than 70% of the pair fragments above 50% sequence identity have different EC numbers (enzymatic functions), implying that enzyme function is much less conserved than anticipated. Therefore, for the aforementioned datasets in which no entry has above 40% sequence identity to any others, it is unlikely to yield a high success rate in identifying enzyme subfamily class by simply using the sequence similarity approach. To enhance the success rate, the key is to catch the core features of a sample that are intimately related to its biological function. Because the enzyme family classes are classified according to their molecular functions and acting objects, it is anticipated that the prediction quality will be significantly enhanced if we can find a feasible approach to use the knowledge of functional domains to define an enzyme sample. This can be realized through the integrated domain and motif database,12 or the InterPro database at http://www.ebi.ac.uk/interpro. InterPro release 6.2 (April 24, 2003) contains 7785 entries with wellknown structural and functional domain types. With each of the 7785 functional domain patterns as a vector-base, the sample of an enzyme can be defined as a 7785D (dimensional) vector according to the following steps. Step 1. Use the program IPRSCAN12 to search InterPro database for a given enzyme, if there is a hit (e.g., IPR001970, meaning the enzyme contains a sequence segment very similar to that of the 1970th domain of the InterPro database), then the 1970th component of the enzyme in the 7785D FunD (functional domain) space is assigned 1; otherwise, 0. Step 2. The enzyme can thus be explicitly formulated as follows:
[]
e1 e2 l E) ej l e7785
(1)
where ej )
{
1, hit found in InterPro database 0, otherwise
(2)
Defined in this way, an enzyme will correspond to a 7785D vector E with each of the 7785 functional domain patterns as a base for the vector space. In other words, rather than the 20D space of the amino acid composition approach as often used by many previous investigators,4-7,13-15 an enzyme is now represented in terms of the functional domain composition. By doing so, not only some sequence-related features but also some function-related features are naturally incorporated in the representation. Step 3. If no hit is found for the entire InterPro database, the enzyme E formulated by eq 1 will correspond to a naught vector. To cope with such a circumstance, the enzyme is instead defined in the (20 + λ)D PseAA (Pseudo Amino Acid composition) space,16 as given below
[]
Enzyme Subfamily Class Prediction
research articles
1 2 l E ) 20 20+1 l 20+λ
Table 1. Breakdown of the Enzyme Entries into the Groups Defined in the 7785D FunD Space (eq 1) and the Group in 56D PseAA Space (eq 3)
(3)
where 1, 2, ‚‚‚, 20 represent the 20 components of the classical amino acid composition,5,14 while 20+1 is the first-tier sequence order correlation factor, 20+2 the second-tier sequence order correlation factor, and so forth.16 It is the additional λ components in eq 3 that incorporate some sequence-coupling effects into the vector representation of an enzyme. Generally speaking, the larger the number of the λ components, the more the sequence-coupling effects incorporated. However, the number λ cannot exceed the length of an enzyme (i.e., the number of its total constituent residues). Also, if the number of λ is too large, the overall success rate by jackknife tests might be decreased owing to the reduction of the cluster tolerant capacity.17 Therefore, for different training datasets, λ may have different optimal values. For the current study, the optimal value of λ is 37. Given an enzyme, the (20 + 37) ) 57 pseudo amino acid components in eq 3 can be easily derived by following the procedures as described in a previous paper16 that has originally introduced the concept of pseudo-amino acid composition. Thus, any enzyme that corresponds to a naught vector in the 7785D FunD space (eq 2) can always be explicitly defined in the 57D PseAA space (eq 3). The concept of pseudo amino acid composition has also been used recently by other investigators for predicting protein subcellular location18 and membrane protein type.19,20 The prediction was performed with the ISort (Intimate Sorting) predictor, which can be described as follows. Suppose there are N enzymes (E1, E2, ..., EN) which have been classified into categories 1, 2, ..., µ. Now, for a query enzyme E, how can we predict which subclass it belongs to? To deal with this problem, let us define the following scale to measure the similarity between E and Ei (i ) 1, 2, ..., N) Λ(E,Ei) )
E‚Ei , (i ) 1, 2, ‚‚‚ , N) ||E||||Ei||
(4)
where E‚Ei is the dot product of vectors E and Ei, and ||E|| and ||>Ei|| their modulus, respectively. Obviously, when E ≡ Ei, we have Λ(E,Ei) ) 1, meaning they have perfect or 100% similarity. Generally speaking, the similarity is within the range of 0 and 1; i.e., 0 eΛ(E,Ei) e1. Accordingly, the ISort predictor can be formulated as follows. If the similarity between E and Ek (k ) 1, 2, ‚‚‚, or N) is the highest; i.e. Λ(E,Ek) ) Max{Λ(E,E1),Λ(E,E2), ‚‚‚ ,Λ(E,EN)}
(5)
where the operator Max means taking the maximum one among those in the brackets, then the query enzyme E is predicted belonging to the same category as of Ek. If there is a tie, the query enzyme may not be uniquely determined, but cases such as that rarely occur. The ISort predictor is particularly useful for the situation when the distributions of the samples are unknown. During the course of prediction, the following self-consistency principle should be followed. If a query enzyme could be defined in the 7785D FunD space (eq 1), then the prediction should be carried out based on those enzymes in the training
dataset
7785D FunD space
57D PseAA space
total
1. oxidoreductasea 2. transferaseb 3. hydrolasec 4. lyased 5. isomerasee 6. ligasef
1510 3439 2694 895 465 827
187 143 208 44 38 13
1697 3582 2902 939 503 840
a From the Supporting Information A. b From the Supporting Information B. c From the Supporting Information C. d From the Supporting Information D. e From the Supporting Information E. f From the Supporting Information F.
Table 2. Success Rates in Identifying 21 Sub-classes of Oxidoreductases by Jack-Knife Test
subfamily class
(1) Acting on the CH-OH group of donors (2) Acting on the aldehyde or oxo group of donors (3) Acting on the CH-CH group of donors (4) Acting on the CH-NH2 group of donors (5) Acting on CH-NH group of donors (6) Acting on NADH or NADPH (7) Acting on other nitrogenous compounds as donors (8) Acting on a sulfur group of donors (9) Acting on a heme group of donors (10) Acting on diphenols and related substances as donors (11) Acting on peroxide as donors (12) Acting on hydrogen as donors (13) Acting on single donors with incorporation of molecular oxygen (14) Acting on paired donors, with incorporation of molecular oxygen (15) Acting on superoxide as acceptor; (16) Oxidizing metal ions; (17) Acting on CH or CH2 (18) Acting on ion-sulfur proteins as donors (19) Acting on phosphorus or arsenic in donors (20) Acting on X-H and Y-H to form an X-Y bond (21) Other oxidoreductases overall
no. of correct prediction
success rate (%)
407
390
95.82
128
109
85.16
119
97
81.51
71
64
90.14
95
78
82.11
262 40
222 28
84.73 70.00
67
61
91.04
80
65
81.25
45
40
88.89
52 22 44
47 14 32
90.38 63.64 72.73
127
110
86.61
31
29
93.55
12 44 25
11 41 19
91.67 93.18 76.00
6
2
33.33
8
3
37.50
12 1697
7 1469
58.33 86.56
no. of enzymes
set that could also be defined in the same 7785D FunD space. If the query enzyme in the 7785D FunD space was a naught vector and hence must be defined instead in the (20 + λ)D PseAA space (see eq 3), then the prediction should be conducted according to the principle that all the proteins in the training set be defined in the same (20 + λ)D Pse AA space as well. Accordingly, the current ISort predictor actually consists of two sub predictors: (1) the ISort-7785 FunD predictor that Journal of Proteome Research • Vol. 4, No. 3, 2005 969
research articles
Cai and Chou
Table 3. Success Rates in Identifying 8 Sub-classes of Transferases by Jack-Knife Test
subfamily class
(1) Transferring one-carbon groups (2) Transferring aldehyde or ketonic groups (3) Acyltransferase (4) Glycosyl-transferase (5) Transferring alkyl or aryl groups (other than methyl groups) (6) Transferring nitrogenous groups (7) Transferring phosphoruscontaining groups (8) Transferring sulfur-containing groups overall
Table 6. Success Rates in Identifying 6 Sub-classes of Isomerases by Jack-Knife Test
no. of enzymes
no. of correct prediction
success rate (%)
438 34
418 34
95.43 100
280 390 280
257 367 274
91.79 94.10 97.85
104 2002
103 1984
99.04 99.10
Table 7. Success Rates in Identifying 6 Sub-classes of Ligases by Jack-Knife Test
54
51
94.44
3582
3488
97.38
no. of success no. of correct rate enzymes prediction (%)
Table 4. Success Rates in Identifying 8 Sub-classes of Hydrolases by Jack-Knife Test
subfamily class
(1) Acting on ester bonds (2) Glycosylases (3) Acting on ether bonds (4) Acting on peptide bonds (peptidases) (5) Acting on carbonnitrogen bonds other than peptide (6) Acting on acid anhydrides (7) Acting on carbon-carbon bonds (8) Acting on halide bonds overall
no. of enzymes
no. of correct prediction
success rate (%)
1064 476 16 442
1005 469 9 426
94.45 98.52 56.25 96.38
367
354
96.46
519 3
499 2
96.15 66.67
15 2902
11 2775
73.33 95.62
Table 5. Success Rates in Identifying 6 Sub-classes of Lyases by Jack-Knife Test
subfamily class
no. of enzymes
no. of correct prediction
success rate (%)
(1) Carbon-carbon lyases (2) Carbon-oxygen lyases (3) Carbon-nitrogen lyases (4) Carbon-sulfur lyases (5) Phosphorus-oxygen lyases (6) Other lyases overall
326 450 52 25 53 33 939
310 445 50 19 53 31 908
95.09 98.89 96.15 76.00 100 93.94 96.70
operates in the 7785 FunD space, and (2) the ISort-57D PseAA predictor that operates in the 57D PseAA space with λ ) 37. The entire process is called the FunD-PseAA hybridization approach.
IV. Results and Discussion The computation was performed in a Silicon Graphics IRIS Indigo workstation (Elan 4000). For the enzymes listed in the Supporting Information A, B, C, D, E, and F, we obtained the following results, respectively, according to Steps 1-3 of section III (Table 1): (1) of the 1697 oxidoreductases, 1510 got the hits and hence were defined in the 7785D FunD space, and the remainder (187) defined in the 57D PseAA space; (2) of the 3582 transferases, 3,439 got the hits and were defined in the 7785D 970
Journal of Proteome Research • Vol. 4, No. 3, 2005
subfamily class
no. of enzymes
no. of correct prediction
success rate (%)
(1) Racemases and epimerases (2) Cis-trans-isomerases (3) Intramolecular oxidoreductases (4) Intramolecular transferases (5) Intramolecular lyases (6) Other isomerases overall
95 78 176 76 8 70 503
90 78 164 72 2 70 476
94.74 100 93.18 94.74 25.00 100 94.63
subfamily class
(1) Forming carbon-oxygen bonds (2) Forming carbon-sulfur bonds (3) Forming carbon-nitrogen bonds (4) Forming carbon-carbon bonds (5) Forming phosphoric ester bonds (6) Forming nitrogen-metal bonds overall
404 34 341 14 37 10 840
404 33 333 12 32 8 822
100 97.06 97.65 85.71 86.49 80.00 97.86
FunD space, and 143 defined in the 57D PseAA space; (3) of the 2902 hydrolases, 2694 defined in the 7785D FunD space, and 208 defined in the 57D PseAA space; (4) of the 939 lyases, 895 defined in the 7785D FunD space, and 44 defined in the 57D PseAA space; (5) of the 503 isomerases, 465 defined in the 7785D FunD space, and 38 defined in the 57D PseAA space; (6) of the 840 ligases, 827 defined in the 7785D FunD space, and 13 defined in the 57D PseAA space. This means that, if the definition of enzymes was only based on the functional domain database, a total of 10463-9830 ) 633 enzymes in datasets SA, SB, SC, SD, SE, and SF would have no definition, leading to a failure of identifying their subfamily classes. That is why it is important to hybridize with the PseAA approach, by which not only an enzyme can always be defined but also its sequence-coupling effects may considerably be taken into account.16 Thus, the hybrid algorithm was operated according to the following procedures: if a query enzyme was defined in the 7785D FunD space, then the ISort-7785D FunD predictor was used to predict its subfamily class; otherwise, the ISort-57D PseAA predictor was used to make the prediction. The demonstration is performed by the jackknife test. In statistical prediction, the single independent dataset test, subsampling test and jackknife test are the three cross-validation means often used to examine the power of a predictor.21 Of these three, the jackknife test is deemed as the most objective and rigorous one and hence have been adopted by more and more investigators.14,15,18,22-27 The mathematical principle and a comprehensive discussion about this can be found in a monograph28 and a review paper,21 respectively. Accordingly, the real power of a predictor should be measured by the success rate of jackknife test. The jackknife cross-validation results obtained by the current FunD-PseAA predictor for the datasets SA, SB, SC, SD, SE, and SF are given in Tables 2, 3, 4, 5, 6, and 7, respectively. As shown from these tables, the overall success rates are 86.56% for identifying oxidoreductases among their 21 subfamily classes, 97.38% for transferases among their 8 subfamily classes, 95.62% for hydrolases among their 8 subfamily classes, 96.70% for lyases among their 6 subfamily
research articles
Enzyme Subfamily Class Prediction
classes, 94.63% for isomerases among their 6 subfamily classes, and 97.86% for ligases among their 6 subfamily classes. Recently, Cai et al.29 also attempted to predict enzyme subfamily classification by SVM (support vector machines) and obtained some encouraged result as well. However, the success rates reported by them were not derived from the jack-knife test, the most rigorous and objective cross-validation procedure compared with the sub-sampling test and independent dataset test.21 Furthermore, no cutoff procedure whatsoever was taken in their datasets to winnow away the samples with high sequence identity so as to avoid any homologous bias.
V. Conclusion In a previous study,1 we have investigated how to identify whether it is an enzyme for a newly found protein sequence, and what main enzyme class it belongs to if it is. To understand the molecular mechanism and acting object of an enzyme at a deeper level, it is equally important to identify the subclass attribute of the query enzyme. With the explosion of protein sequences entering into databanks, the urgency to extend the study to cover the subclass level is self-evident. The FunD-PseAA predictor developed here is very powerful in identifying the subclasses of enzymes. In classification prediction, the more the number of classes to be identified, the less the success rate will be. With the FunD-PseAA predictor, even for the case in identifying the 21 subclasses of oxidoreductases, the overall jackknife success rate can reach above 86%; for the cases of the other 5 main enzyme classes classified into 6-8 subclasses, the corresponding rates are as high as 94-97%. These rates are very encouraging because they were derived by jack-knife cross validation on the stringent datasets in which none of entries had g40% sequence identity with any others. Particularly, enzyme function is much less conserved than anticipated, i.e., the threshold for sequence similarity that implies similarity in enzymatic function is much higher than that of similarity in protein structure.11 If simply based on the sequence similarity approach, then it would hardly get a decent success rate even for a dataset consisting of samples with g50% sequence identity. This is because the function of an enzyme is an extremely complicated that may involve very subtle structural details as well as many other physical chemistry factors. The reason the current FunD-PseAA predictor can yield so high success rates is because it catches the core features of the statistical samples concerned. With the explosion of protein sequences entering into databanks and the relatively much slower process in determining their enzymatic attributes by biochemical experiments, the FunD-PseAA predictor may become a useful high throughput tool for proteomics and bioinformatics.
Acknowledgment. The work was partly supported by Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, and Shanghai Centre for Bioinformatics Technol-
ogy. The authors wish to thank the two anonymous reviewers whose comments are very helpful in strengthening the presentation of this paper.
Supporting Information Available: Accession numbers of the proteins in datasets SA, SB, SC, SD, SE, and SF, and enzymes listed A-F. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)
(13) (14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (28)
(29)
Chou, K. C.; Cai, Y. D. Protein Sci. 2004, 13, 2857-2863. Cai, Y. D.; Chou, K. C. J. Proteome Res. 2005, 4, 109-111. Chou, K. C.; Elrod, D. W. J. Proteome Res. 2003, 2, 183-190. Nakashima, H.; Nishikawa, K.; Ooi, T. J. Biochem. 1986, 99, 152162. Chou, J. J.; Zhang, C. T. J. Theor. Biol. 1993, 161, 251-262. Chou, K. C.; Zhang, C. T. J. Biol. Chem. 1994, 269, 22014-22020. Chou, K. C. Proteins: Struct., Funct., Genet. 1995, 21, 319-344. Chou, K. C. Bioinformatics 2005, 21, 10-19. Webb, E. C. Enzyme Nomenclature; Academic Press: San Diego, 1992. Wang, G.; Dunbrack, R. L., Jr. Bioinformatics 2003, 19, 15891591. Rost, B. J. Mol. Biol. 2002, 318, 595-608. Apweiler, R.; Attwood, T. K.; Bairoch, A.; Bateman, A.; Birney, E.; Biswas, M.; Bucher, P.; Cerutti, L.; Corpet, F.; Croning, M. D. R.; Durbin, R.; Falquet, L.; Fleischmann, W.; Gouzy, L.; Hermjakob, H.; Hulo, N.; Jonassen, I.; Kahn, D.; Kanapin, A.; Karavidopoulou, Y.; Lopez, R.; Marx, B.; Mulder, N. J.; Oinn, T. M.; Pagni, M.; Servant, F.; Sigrist, C. J. A.; Zdobnov, E. M. Nucleic Acids Res. 2001, 29, 37-40. Chou, P. Y. In. Prediction of Protein Structure and the Principles of Protein Conformation; Fasman, G. D., Ed.; Plenum Press: New York, 1989; pp 549-586. Zhou, G. P. J. Protein Chem. 1998, 17, 729-738. Zhou, G. P.; Assa-Munt, N. Proteins: Struct., Funct., Genet. 2001, 44, 57-59. Chou, K. C. Proteins: Struct., Funct., Genet. (Erratum: ibid., 2001, Vol.44, 60) 2001, 43, 246-255. Chou, K. C. Biochem. Biophys. Res. Commun. 1999, 264, 216224. Pan, Y. X.; Zhang, Z. Z.; Guo, Z. M.; Feng, G. Y.; Huang, Z. D.; He, L. J. Protein Chem. 2003, 22, 395-402. Wang, M.; Yang, J.; Liu, G. P.; Xu, Z. J.; Chou, K. C. Protein Eng., Design, Select. 2004, 17, 509-516. Wang, M.; Yang, J.; Xu, Z. J.; Chou, K. C. J. Theor. Biol. 2004, 232, 7-15. Chou, K. C.; Zhang, C. T. Crit. Rev. Biochem. Mol. Biol. 1995, 30, 275-349. Yuan, Z. FEBS Lett. 1999, 451, 23-26. Feng, Z. P. Biopolymers 2001, 58, 491-499. Hua, S.; Sun, Z. Bioinformatics 2001, 17, 721-728. Luo, R. Y.; Feng, Z. P.; Liu, J. K. Eur. J. Biochem. 2002, 269, 42194225. Zhou, G. P.; Doctor, K. Proteins: Struct., Funct., Genet. 2003, 50, 44-48. Chou, K. C.; Cai, Y. D. Biochem. Biophys. Res. Comm. (Corrigendum: ibid., 2005, 329, 1362) 2004, 321, 1007-1009. Mardia, K. V.; Kent, J. T.; Bibby, J. M. Multivariate Analysis; Academic Press: London, 1979; Chapter 11 Discriminant Analysis; Chapter 12 Multivariate analysis of variance; Chapter 13 cluster analysis; pp. 322-381. Cai, C. Z.; Han, L. Y.; Ji, Z. L.; Chen, Y. Z. Proteins: Struct., Funct., Bioinform. 2004, 55, 66-76.
PR0500399
Journal of Proteome Research • Vol. 4, No. 3, 2005 971