Genome-Wide Profile of Oxidoreductases in ... - ACS Publications

Bacteria contain the necessary metabolic, functional, and defense mechanisms in a relatively minimal genome,46,47 whereas eukaryotes have specialized ...
1 downloads 0 Views 133KB Size
Genome-Wide Profile of Oxidoreductases in Viruses, Prokaryotes, and Eukaryotes Richard Kho, Joseph V. Newman, Richard M. Jack, Hugo O. Villar, and Mark R. Hansen* Triad Therapeutics, Inc. 9381 Judicial Drive, San Diego, California 92121 Received July 9, 2003

Enzymes that utilize nicotinamide adenine dinucleotide (NAD) or its 2′-phosphate derivative (NADP) are found throughout the kingdoms of life. These enzymes are fundamental to many biochemical pathways, including central intermediary metabolism and mechanisms for cell survival and defense. The complete genomes of 25 organisms representing bacteria, protists, fungi, plants, and animals, and 811 viruses, were mined to identify and classify NAD(P)-dependent enzymes. An average of 3.4% of the proteins in these genomes was categorized as NAD(P)-utilizing proteins, with highest prevalence in the medium-chain oxidoreductase and short-chain oxidoreductase families. In general, the distribution of these enzymes by oxidoreductase family was correlated to the number of different catalytic mechanisms in each family. Organisms with smaller genomes encoded a larger proportion of NAD(P)dependent enzymes in their proteome (∼6%) as compared to the larger genomes of eukaryotes (∼3%). Among viruses, those with large, double-strand DNA genomes were shown to encode oxidoreductases. Gram-positive and Gram-negative bacteria showed some differences in the distribution of NAD(P)dependent proteins. Several organisms such as M. tuberculosis, P. falciparum, and A. thaliana showed unique distributions of oxidoreductases corresponding to some phenotypic features. Keywords: comparative genome analysis • nicotinamide adenine dinucleotide • NAD • oxidoreductase • chemogenomics

Introduction Nicotinamide adenine dinucleotide (NAD) and its 2′phosphate derivative (NADP) are cofactors found in numerous biochemical electron-transfer reactions. Proteins that utilize NAD(P) are typically oxidoreductases and represent ∼4.5% of all sequences in the Swiss-Prot protein database.1 Electrontransfer reactions involving NAD(P) are fundamental to energy production and central intermediary metabolism, secondary metabolism, and xenobiotic degradation.2-8 NAD(P) is also essential as a substrate in processes such as ADP-ribosylation and DNA repair,9,10 can act as an intracellular signal,11 and has even been implicated as a direct antioxidant that sequesters oxygen and nitrogen radical species.12 Because of their wide distribution and importance to a variety of pathways, NAD(P)dependent proteins represent points for drug intervention to alleviate disorders that can be addressed by modulation of these enzymes, or for infectious diseases treatable by inactivation of enzymes essential to pathogenicity. Numerous oxidoreductases such as 3-hydroxy-3-methylglutaryl-CoA reductase13,14 and dihydrofolate reductase15 have been successfully targeted for a variety of conditions including hypercholesterolemia and cancer, respectively, and continue to be of interest for other indications as well.16-19 The identification, categorization, and determination of the distribution of NAD(P)-dependent enzymes in complete genomes is the focus of this study. * To whom correspondence should be addressed: Phone: 858-457-0100; Fax: 858-457-7893; E-mail: [email protected].

626

Journal of Proteome Research 2003, 2, 626-632

Published on Web 09/11/2003

In this article, we are building on our previous work20 and applying it to a large number of organisms. Methods employing pairwise sequence similarities among proteins in a dataset allowed us to subdivide oxidoreductase sequences into 94 sequence families that correlate with structural fold, which in turn correlate with NAD(P)-binding mode in all cases where structural information was available.20 Each oxidoreductase family contains one or more sequence families that show clear evolutionary relationships based on structure and function. In the present study, the genomes of 25 organisms representing all five kingdoms of life, and 811 viral genomes, were mined for NAD(P)-utilizing enzymes that belong to these families. The genome mining method utilizes a reference set of biologically relevant sequence clusters developed using transitive homology relationships.20-23 The results were compared and found to be consistent with those obtained using Pfam hidden Markov Models and the HMMER software package for families that have representatives in the Pfam database.24,25 By understanding the distribution of oxidoreductases in complete genomes, guidance can be provided for drug target selection and drug discovery efforts, and gain insight into their functions.

Methods Reference Datasets and Genomic Data. Protein sequence data were obtained from the Swiss-Prot and TrEMBL databases (Releases 40 and 22, respectively, Jan. 2003; http://www.expasy.org/sprot/).1 Protein structures and their domain clas10.1021/pr034051h CCC: $25.00

 2003 American Chemical Society

research articles

Genome-Wide Profile of Oxidoreductases

sifications were retrieved from the Protein Data Bank (http:// www.rcsb.org/pdb/)26 and the Structural Classification of Proteins (SCOP) database (http://scop.mrc-lmb.cam.ac.uk/scop/).27 Available Pfam hidden Markov models representing the NAD(P)utilizing proteins were obtained from the Pfam Server (Release 7.7b; http://pfam.wustl.edu/).24 Enzyme Commission (EC) numbers of protein functions were taken from the Expasy Enzyme database (http://www.expasy.org/enzyme/).28 The publicly available genomic and protein sequence data of completed genomes were obtained from the “Entrez Genome” system at the National Center for Biotechnology Information (http:// www.ncbi.nih.gov/Entrez/),29 with the exceptions of the following: the Arabidopsis thaliana proteome was from the Arabidopsis Information Resource (http://www.arabidopsis.org/ ),30 Drosophila melanogaster from the Berkeley Drosophila Genome Project (http://www.fruitfly.org/),31 Plasmodium falciparum from PlasmoDB (http://plasmodb.org/),32 Saccharomyces cerevisiae from the Saccharomyces Genome Database (http://www.yeastgenome.org/),33 Schizosaccharomyces pombe,34 Yersinia pestis,35 and Caenorhabditis elegans36 from the Sanger Institute (http://www.sanger.ac.uk/), and Mus musculus was from Ensembl (Feb. 2002 freeze; http://www.ensembl.org/ Mus_musculus/).37 Genscan predicted protein sequences of the human genome were acquired from the UC Santa Cruz Genome Bioinformatics Server (“genscanPep.txt.gz”; Nov. 2002 freeze, Dec. 20, 2002 update; http://genome.ucsc.edu/).38 Sequence Clustering and Genome Mining for Assignment of NAD(P)-Dependent Proteins. Sequence clustering of the NAD(P)-dependent proteins was performed using methods analogous to those previously described,20 a brief outline of which follows. Proteins that were annotated in the Swiss-Prot and TrEMBL databases as NAD(P)-dependent proteins were retrieved. Removal of redundant and incomplete sequences resulted in 4020 total sequences. A binary string (or vector) that we call a protein key was generated to represent the similarity of a given protein to each sequence in the entire dataset. On and off bits represented homologous and nonhomologous proteins based on comparisons performed using the BLAST algorithm,39 with an empirically derived E-value cutoff of 10-2 as a threshold for homology for sequence clustering purposes. Utilization of E-values in the range of 1 to 10-7, and algorithms other than BLAST such as FASTA40 and Smith-Waterman,41 did not improve clustering results. The protein keys allow identification of homologues based on a network of sequence neighbors that each protein shares, rather than on a single protein-protein sequence comparison. Approaches that group sequences based on such transitive relationships have been shown to yield biochemically relevant structural classifications when sufficiently large datasets are studied. 20-23 The collection of protein keys served as input for a divisive hierarchical clustering algorithm as implemented in the TWINS algorithm.42 The resulting clusters were used for genome mining as described below. In the genome mining step of the procedure, a protein key for each sequence in the query genomes was generated in a fashion analogous to the clustering step described above. Each bit in the protein key represented the given protein’s homology (on bit; BLAST E-value 10-2) to proteins in the reference dataset. A vector comparison algorithm was devised that uses Tanimoto coefficients43 to determine membership (if any) of a given query sequence (or protein key) into one of the clusters of the reference dataset. Given the highest scoring cluster, if a query

sequence had a Tanimoto coefficient greater than or equal to 0.1 to more than 10% of the cluster, it was considered a member. These thresholds were empirically determined to minimize false positives while maximizing true positives as described below. To derive the cutoffs and optimize parameters, two sets of sequences were selected from Swiss-Prot for testing with either the vector comparison algorithm or with Pfam HMMs and HMMER: (1) a set of 1000 random sequences from Swiss-Prot that were not annotated as using “NAD” or “NADP”, and (2) a set of 1000 random NAD(P)-dependent enzymes based on Swiss-Prot annotation. In the optimal case, the vector comparison algorithm misassigned 20 out of 1000 from set 1 as hits and correctly assigned 960 out of 1000 from set 2 as hits, resulting in a specificity of 0.980 and sensitivity of 0.960. Using the Pfam HMMs, there were 14 hits from set 1 and 911 hits from set 2, giving a specificity of 0.986 and sensitivity of 0.911. Nine of the false positives (hits to set 1) were common to both methods. These data are consistent with the fact that homology search algorithms become less specific when sensitivity is increased, i.e., more true hits usually also lead to more false hits.44,45 The vector comparison algorithm was used for the genome mining studies because it had improved sensitivity and comparable specificity while taking about 3-fold less computation time.

Results and Discussion Content and Distribution of NAD(P)-Dependent Enzymes in Complete Genomes. The Swiss-Prot protein database, version 40 (Dec., 2002), contains ∼4.5% NAD(P)-dependent enzymes based on annotation.1 The database is representative of a broad range of organisms (about 49% eukaryotes, 39% bacteria, 7% viruses and 6% archaea), and as such, it was expected that the genomes analyzed in this study would also encode ∼4.5% NAD(P)-utilizing enzymes. However, in this analysis, the content of NAD(P)-dependent enzymes based on the total number of proteins in the 25 nonviral genomes was 3.4% (Table 1). The median value for the content of NAD(P)utilizing enzymes in all genomes was 4.4%, ranging from 1.0% in P. falciparum to 6.6% in E. coli K12. Larger genomes typically had lower percentages of NAD(P)-utilizing enzymes than smaller genomes (see Table 1). This is consistent with the observation that most oxidoreductases are involved in primary and intermediary metabolism (vide supra). Bacteria contain the necessary metabolic, functional, and defense mechanisms in a relatively minimal genome,46,47 whereas eukaryotes have specialized systems beyond the basic metabolic and functional machinery such as signal transduction pathways and complex nervous systems, which result in decreased relative abundance of oxidoreductases. The total content of NAD(P)-dependent proteins in whole genomes conveys information about the rough genetic makeup of an organism, but greater detail can be obtained by examining the distributions in subclasses. Table 2 shows a percent-wise breakdown of NAD(P)-utilizing enzymes by oxidoreductase family for all of the genomes mined in this study. The table highlights the most widely studied and best characterized of the oxidoreductase families, and represents 85% of all NAD(P)-dependent proteins in Swiss-Prot.20 The largest oxidoreductase families are the medium-chain and short-chain oxidoreductases, accounting for 34% of all NAD(P)-utilizing enzymes in the mined genomes. These enzymes comprise a significant portion of the NAD(P)-dependent proteins in intermediary Journal of Proteome Research • Vol. 2, No. 6, 2003 627

research articles

Kho et al.

Table 1. NAD(P)-Dependent Protein Content in Complete Genomes Ordered by Organism Type and Number of Proteins in Genome

species

type

H. pylori C. jejuni H. influenzae V. cholerae Y. pestis E. coli K12 S. typhimurium E. coli O157:H7 P. aeruginosa M. genitalium S. pneumoniae S. aureus C. acetobutylicum B. subtilis M. tuberculosis B. anthracis P. falciparum S. pombe C. albicans S. cerevisiae D. melanogaster C. elegans A. thaliana M. musculus H. sapiens total

Gram Gram Gram Gram Gram Gram Gram Gram Gram Gram + Gram + Gram + Gram + Gram + Gram + Gram + protozoan yeast yeast yeast fruit fly worm plant mouse human

proteins in genome

no. of NAD(P)utilizing proteinsa

% NAD(P)utilizing proteins

1566 1634 1709 3828 3885 4289 4600 5361 5565 480 2094 2593 3672 4100 4187 5311 5334 4968 9168 9374 14332 21197 27288 28097b 43223b 217855

75 84 89 163 243 281 249 284 355 14 79 132 161 243 252 266 51 163 285 271 351 375 823 657 1461 7407

4.8 5.1 5.2 4.3 6.3 6.6 5.4 5.2 6.4 2.9 3.8 5.1 4.4 5.9 6.0 5.0 1.0 3.3 3.1 2.9 2.4 1.8 3.0 2.3 3.4 3.4c

a Numbers of NAD(P)-utilizing proteins based on mining of complete genomes. b Number of mouse and human proteins from predicted translations of latest drafts. c Average of 3.4% based on total numbers from last row of table. Median for all genomes is 4.4%.

metabolism, which is consistent with a higher percentage of the oxidoreductases belonging to these groups. The results corroborate previous studies on the distribution of mediumchain and short-chain oxidoreductases performed with a smaller subset of genomes.48 The remainder of the oxidoreductase families is populated in an order roughly analogous to the number of different catalytic functions in each family. The disulfide oxidoreductases, for example, account for 6.1% of NAD(P)-utilizing enzymes, and include enzymes that catalyze 24 different E. C. functions, whereas the NADP-dependent catalase and dihydrofolate reductase groups each represent one catalytic function and about 1% of the NAD(P)-utilizing enzymes.20 Distribution of Oxidoreductases in Viruses. Viruses are characterized by minimal genomes with total nucleic acid content averaging about 25 kilobases (from NCBI’s Entrez Genome virus collection),29 with extremes from ∼1 kilobase49 to 800 kilobases.50 Out of 13 214 protein sequences from 811 viruses, there were 23 sequences identified as members of the oxidoreductase families. The very low number of oxidoreductases is not surprising, as the vast majority of viruses lack any kind of metabolic machinery.51 Interestingly, all 23 hits were from double-strand DNA (dsDNA) viruses having large genomes on the order of 200 to 300 kilobases. Furthermore, all but two of the viruses encoding these NAD(P)-dependent enzymes were vertebrate viruses belonging to the Poxviridae, Herpesviridae, and Phycodnaviridae families. The presence of oxidoreductases in dsDNA viruses, as opposed to the simpler single-strand DNA and RNA viruses, can be attributed to the level of molecular complexity of the different viruses. Double628

Journal of Proteome Research • Vol. 2, No. 6, 2003

strand DNA viruses are generally the largest and most complex in function, and therefore can afford to encode extra proteins to complement host enzymes.51-53 The distribution of the viral NAD(P)-dependent proteins by oxidoreductase family is also interesting. Ten of the twentythree hits are short-chain oxidoreductase enzymes (43%), and seven of these ten are encoded by members of Poxviridae, including the cowpox and vaccinia viruses. The short-chain oxidoreductase proteins include hydroxysteroid dehydrogenases involved in the production of a virulence factor,54 and enzymes for formation of glycoproteins important in fusion with host cell receptors.51,55 Three proteins belonging to the dihydrofolate reductase family were identified in genomes of the Herpesviridae family. Herpes viruses were the first examples of mammalian viruses to encode this enzyme for de novo synthesis of purines.56 The most unique virus that encodes NAD(P)-dependent proteins was Paramecium bursaria Chlorella virus 1, a dsDNA virus of the Phycodnaviridae family that infects algae.55,57 This virus encoded more oxidoreductases than any other virus, with four of the five hits belonging to the medium-chain and shortchain oxidoreductase families. Two enzymes in the mediumchain oxidoreductase family were identified as UDP-glucose dehydrogenase and a homologue of D-lactate dehydrogenase, the former of which is involved in capsid protein glycosylation.57 The two enzymes of the short-chain oxidoreductase family were fucose synthase and GDP-D-mannose dehydratase. Having these two short-chain oxidoreductases, Chlorella virus is unique in its ability to synthesize its own nucleotide sugars for glycosylation, rather than depending on the host for GDPsugars.57 Oxidoreductases in Gram-Positive Bacteria. The Grampositive bacteria (M. genitalium, S. pneumoniae, S. aureus, C. acetobutylicum, B. subtilis, M. tuberculosis, and B. anthracis) contained an average of 4.7% NAD(P)-dependent enzymes in their genomes. No dramatic differences in the distribution of NAD(P)-utilizing enzymes by family were seen among the Gram-positive bacteria, with the exceptions of M. tuberculosis and M. genitalium. There is a higher proportion of short-chain oxidoreductase enzymes in M. tuberculosis than in the other Gram-positives analyzed, as well as all other bacterial genomes mined in this study (see Table 2). Furthermore, although other bacteria have greater proportions of medium-chain oxidoreductase enzymes than short-chain oxidoreductases, the distribution is reversed in M. tuberculosis. The prevalence of short-chain oxidoreductases involved in fatty acid and sugar metabolism is consistent with the M. tuberculosis phenotype, which has a very complex cell envelope composed of complex fatty acids and glycolipids.58,59 In the case of the Gram-positive bacterium M. genitalium, which encodes the smallest known genome of any free-living organism,60 our analysis shows that it has a greater proportion of disulfide oxidoreductases than other genomes. Three of its 14 NAD(P)-dependent enzymes belong to the disulfide oxidoreductase family (21%), whereas the average content of disulfide oxidoreductases in all genomes is 6.1% (Table 2). The high proportion of disulfide oxidoreductases in this minimal genome affirms the importance of this group of enzymes. This is not surprising considering the role of disulfide oxidoreductases in regulation of redox status and carbohydrate metabolism. The other NAD(P)-dependent enzymes of M. genitalium are also involved in pathways necessary for survival as exemplified by the medium-chain and short-chain oxidoreductases,

a

8.7 20 20 21 23 23 21 20 19 20 21 23 25 19 23 18 26 25 26 22 28 13 13 13 21 16 20

43 13 15 9 9.2 13 9.3 10 11 16 7.1 13 12 15 17 23 11 10 17 15 8.9 17 24 18 13 3.8 14

0 2.7 1.2 1.1 0.6 0.4 0.4 0.8 0.4 1.1 0 0 0.8 0 1.2 0 1.5 0 0.6 0.4 1.5 0.6 1.3 0.6 0.2 0.1 0.6

0 1.3 0 0 1.2 3.3 3.2 3.2 3.9 2.3 0 2.5 4.5 1.9 3.3 0.8 3.4 2 6.7 6 4.1 3.1 4.3 2.7 6.2 1.8 2.7

13 0 0 1.1 0.6 0.4 0.4 0.4 0.4 0.3 7.1 1.3 0.8 0.6 0.4 0.4 0.4 2 0.6 0.4 0.7 0.3 0.3 0.4 0.2 0.5 1.3

0 2.7 3.6 4.5 6.1 5.8 5.3 5.2 5.6 6.5 21 7.6 9.8 8.1 6.2 5.6 8.3 9.8 4.9 6 6.3 2.8 4.3 4.7 2.3 8.4 6.1

0 1.3 0 3.4 6.7 5.8 4.3 4.8 3.9 5.1 0 5.1 2.3 3.7 2.5 2.8 1.9 12 6.1 4.2 5.9 2.3 1.9 2.6 3.2 6.2 3.7

0 1.3 1.2 1.1 0.6 1.2 1.1 0.8 1.1 0.6 0 0 1.5 1.2 1.2 0.8 0.8 2 3.1 2.1 4.4 2.3 1.6 1.3 1.1 0.5 1.3

0 0 0 0 0.6 0.4 0 0 0 0 0 1.3 0.8 0 0 0 0 0 0.6 0.4 0.7 0.3 0.3 0.2 0.2 0.1 0.2

0 0 1.2 0 3.1 0.8 2.5 2.4 2.1 3.9 0 0 1.5 0.6 0.8 3.2 0.4 0 1.2 2.5 4.4 2.6 2.4 0.9 4 0 1.6

0 0 1.2 1.1 1.2 0.8 0.7 0.8 0.7 0.6 0 0 0.8 1.2 1.6 0.4 1.5 0 0.6 0.4 0.4 1.7 0.3 0.7 1.2 0.3 0.7

0 2.7 1.2 3.4 1.8 1.2 1 1.2 1 0.9 0 2.5 1.5 1.9 1.2 2 2.3 2 0.6 0.7 1.9 0.9 0.5 1 1.1 1.2 1.4

Percentages based on number of proteins in each family compared to total number of NAD(P)-utilizing proteins in each genome. b See ref 20 for detailed composition of these oxidoreductase families.

viruses H. pylori C. jejuni H. influenzae V. cholerae Y. pestis E. coli K12 S. typhimurium E. coli O157:H7 P. aeruginosa M. genitalium S. pneumoniae S. aureus C. acetobutylicum B. subtilis M. tuberculosis B. anthracis P. falciparum S. pombe C. albicans S. cerevisiae D. melanogaster C. elegans A. thaliana M. musculus H. sapiens average

0 19 18 11 7.4 8.9 13 13 14 7.6 0 0 3.8 5 2.5 7 5.6 0 1.2 5.3 0.7 7.4 5.9 2.7 4.4 2.6 6.4

NADH: inosine-5′isocitrate/ NADPmonophosphate ubiquinone aldehyde NAD(P) dependent aldoketo- dihydrofolate disulfide NADPH-Cyp450 isopropyl- Hmg-CoA medium-chain short-chain DH DH oxidoreductases oxidoreductases catalase reductases reductase oxidoreductases reductase-like malate DH reductase dehydrogenases malic enzyme

Table 2. Distribution of NAD(P)-Dependent Proteins by Percenta in Each Oxidoreductase Familyb

Genome-Wide Profile of Oxidoreductases

research articles

Journal of Proteome Research • Vol. 2, No. 6, 2003 629

research articles and by dihydrofolate reductase, which is involved in nucleic acid synthesis.61 Oxidoreductases in Gram-Negative Bacteria. The Gramnegative bacteria (H. pylori, C. jejuni, H. influenzae, V. cholerae, S. typhimurium, Y. pestis, E. coli K12, E. coli O157:H7, and P. aeruginosa) averaged 5.5% NAD(P)-dependent enzymes in their genomes. There was not a statistically significant difference compared to the total content of NAD(P)-utilizing proteins in Gram-positive bacteria (4.7%, with a range of 2.9 to 6.0%). There was, however, a difference between Gram-negative and Grampositive bacteria in the percentage of NADH:ubiquinone oxidoreductase/Complex I enzymes of the respiratory chain. An average of 12.4% of all NAD(P)-dependent proteins in Gramnegative bacteria belonged to the NADH:ubiquinone oxidoreductase/Complex I family, whereas in Gram-positives, only an average of 3.4% were in this family. There are two possible explanations for the disparity among them. First of all, two members of the Gram-positives group, M. genitalium and S. pneumoniae, completely lack proteins of Complex I, hence lowering the average for the entire group. M. genitalium, as described above, is a minimal genome and lacks the respiratory chain for ATP synthesis,60 whereas S. pneumoniae is a lactic acid bacterium that relies on fermentation rather than oxidative phosphorylation for ATP production.62 Second, many of the Gram-negative bacteria have redundant mechanisms for electron transfer and respiration, leading to flexibility for dealing with environmental conditions.63 H. pylori and C. jejuni, for example, appear to have additional Fe-S clusters in Complex I that are absent from other bacteria,64 whereas E. coli has two different types of NADH dehydrogenase complexes.65 Oxidoreductases of Eukaryotes. Although significant differences in the overall content of NAD(P)-dependent proteins were not seen between the two types of bacteria, there was a significant difference between bacteria and eukaryotes, which only averaged 2.6% (range of 1.0 to 3.4%) of their genomes devoted to NAD(P)-dependent enzymes. In the case of the malaria parasite P. falciparum, which is evolutionarily distant to the other eukaryotic organisms, the unusually low proportion of NAD(P)-utilizing enzymes (1%) seems to be due to the fact that it has a smaller percentage of its genome devoted to enzymes, as opposed to structural proteins, signal peptides, and systems for evading host immune response.66 Most of the eukaryotes had distributions of NAD(P)-dependent proteins in proportions similar to the remainder of the genomes, including prokaryotes. The exception was an interesting distribution of medium-chain versus short-chain oxidoreductases in A. thaliana, C. elegans, and D. melanogaster. These three organisms each had a greater proportion of shortchain than medium-chain oxidoreductase enzymes, whereas the remainder of eukaryotes all had much higher percentages of the latter (Table 2). It appears that the prevalence of shortchain oxidoreductases is due to local gene duplications that were more frequent in this family than in the other types. All three of these organisms have been shown to have genes organized in tandem arrays,36,67,68 and gene duplications specific to short-chain oxidoreductases may have functional significance. In A. thaliana, for example, local tandem repeats have led to multiple genes encoding 12 homologues of a short-chain oxidoreductase enzyme, tropinone reductase, which is involved in plant secondary metabolism.67,69 630

Journal of Proteome Research • Vol. 2, No. 6, 2003

Kho et al.

Conclusions As continued efforts in genomics provide a constant influx of data, improved methods for interpretation and utilization of those data are becoming invaluable for new discoveries. A sequence-based classification scheme producing biochemically relevant groups with correlations to structural fold and ligandbinding mode provides an alternative means of leveraging the available information. Mining whole genomes to identify potential targets and classifying them into related protein families allows for characterization by prevalence of these proteins across genomes. Broad spectrum antibiotic or antiinfective therapies can be pursued if target proteins of common function can be identified in a broad range of pathogens. Comparative analysis also helps to prioritize potential drug targets, while recognizing possible human homologues that could be associated with toxicity if inhibited. Another advantage of this classification scheme is that groups of proteins that display common ligand-binding modes can be simultaneously targeted for functional characterization, target identification, and inhibitor design.70,71 Development of chemical leads suitable for inhibition of one member of a group can provide useful starting points for targeting the remainder of drugable proteins in that group. An improvement of the classification might involve further subdivision of large oxidoreductase families such as the medium-chain and shortchain oxidoreductases to be able to identify specific functions and substrate specificities. This is a topic of continued investigation and will lead to further development of robust tools for chemogenomics and drug discovery.

References (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16)

(17) (18) (19) (20) (21) (22) (23)

Bairoch, A.; Apweiler, R. Nucleic Acids Res. 2000, 28, 45-48. Kirschbaum, J. J. Chem. Educ. 1968, 45, 28-37. Rapoport, S. Essays Biochem. 1968, 4, 69-103. Griffiths, D. E. Essays Biochem. 1965, 1, 91-120. Seigler, D. S. Plant Secondary Metabolism; Chapman and Hall, Kluwer Academic Publishers: New York, 1999. Putter, J. Eur J. Drug Metab. Pharmacokinet. 1979, 4, 1-7. Danielson, P. B. Curr. Drug Metab. 2002, 3, 561-597. Holmgren, A. Antioxid. Redox Signal. 2000, 2, 811-820. Shall, S. Biochimie 1995, 77, 313-318. Timson, D. J.; Singleton, M. R.; Wigley, D. B. Mutat. Res. 2000, 460, 301-18. Ziegler, M. Eur J. Biochem. 2000, 267, 1550-1564. Kirsch, M.; De Groot, H. FASEB J. 2001, 15, 1569-1574. Lablanche, J. M. Curr. Med. Res. Opin. 2001, 16, 285-295. Simons, J. Fortune 2003, 147, 58-68. Schweitzer, B. I.; Dicker, A. P.; Bertino, J. R. FASEB J. 1990, 4, 2441-2452. Payne, D. J.; Miller, W. H.; Berry, V.; Brosky, J.; Burgess, W. J.; Chen, E.; DeWolf, W. E., Jr.; Fosberry, A. P.; Greenwood, R.; Head, M. S.; Heerding, D. A.; Janson, C. A.; Jaworski, D. D.; Keller, P. M.; Manley, P. J.; Moore, T. D.; Newlander, K. A.; Pearson, S.; Polizzi, B. J.; Qiu, X.; Rittenhouse, S. F.; Slater-Radosti, C.; Salyers, K. L.; Seefeld, M. A.; Smyth, M. G.; Takata, D. T.; Uzinskas, I. N.; Vaidya, K.; Wallis, N. G.; Winram, S. B.; Yuan, C. C.; Huffman, W. F. Antimicrob. Agents Chemother. 2002, 46, 3118-3124. Lell, B.; Ruangweerayut, R.; Wiesner, J.; Missinou, M. A.; Schindler, A.; Baranek, T.; Hintz, M.; Hutchinson, D.; Jomaa, H.; Kremsner, P. G. Antimicrob. Agents Chemother. 2003, 47, 735-738. Hedstrom, L. Curr. Med. Chem. 1999, 6, 545-560. Heath, R. J.; White, S. W.; Rock, C. O. Appl. Microbiol. Biotechnol. 2002, 58, 695-703. Kho, R.; Baker, B. L.; Newman, J. V.; Jack, R. M.; Sem, D. S.; Villar, H. O.; Hansen, M. R. Proteins 2003, 50, 589-599. Park, J.; Teichmann, S. A.; Hubbard, T.; Chothia, C. J. Mol. Biol. 1997, 273, 349-354. Gerstein, M. Bioinformatics 1998, 14, 707-714. Bolten, E.; Schliep, A.; Schneckener, S.; Schomburg, D.; Schrader, R. Bioinformatics 2001, 17, 935-941.

Genome-Wide Profile of Oxidoreductases (24) Bateman, A.; Birney, E.; Cerruti, L.; Durbin, R.; Etwiller, L.; Eddy, S. R.; Griffiths-Jones, S.; Howe, K. L.; Marshall, M.; Sonnhammer, E. L. Nucleic Acids Res. 2002, 30, 276-280. (25) Eddy, S. R. Bioinformatics 1998, 14, 755-763. (26) Berman, H. M.; Battistuz, T.; Bhat, T. N.; Bluhm, W. F.; Bourne, P. E.; Burkhardt, K.; Feng, Z.; Gilliland, G. L.; Iype, L.; Jain, S.; Fagan, P.; Marvin, J.; Padilla, D.; Ravichandran, V.; Schneider, B.; Thanki, N.; Weissig, H.; Westbrook, J. D.; Zardecki, C. Acta Crystallogr. D Biol. Crystallogr. 2002, 58, 899-907. (27) Murzin, A. G.; Brenner, S. E.; Hubbard, T.; Chothia, C. J. Mol. Biol. 1995, 247, 536-540. (28) Bairoch, A. Nucleic Acids Res. 2000, 28, 304-305. (29) Wheeler, D. L.; Church, D. M.; Federhen, S.; Lash, A. E.; Madden, T. L.; Pontius, J. U.; Schuler, G. D.; Schriml, L. M.; Sequeira, E.; Tatusova, T. A.; Wagner, L. Nucleic Acids Res. 2003, 31, 28-33. (30) Rhee, S. Y.; Beavis, W.; Berardini, T. Z.; Chen, G.; Dixon, D.; Doyle, A.; Garcia-Hernandez, M.; Huala, E.; Lander, G.; Montoya, M.; Miller, N.; Mueller, L. A.; Mundodi, S.; Reiser, L.; Tacklind, J.; Weems, D. C.; Wu, Y.; Xu, I.; Yoo, D.; Yoon, J.; Zhang, P. Nucleic Acids Res. 2003, 31, 224-228. (31) Rubin, G. M.; Hong, L.; Brokstein, P.; Evans-Holm, M.; Frise, E.; Stapleton, M.; Harvey, D. A. Science 2000, 287, 2222-2224. (32) Bahl, A.; Brunk, B.; Coppel, R. L.; Crabtree, J.; Diskin, S. J.; Fraunholz, M. J.; Grant, G. R.; Gupta, D.; Huestis, R. L.; Kissinger, J. C.; Labo, P.; Li, L.; McWeeney, S. K.; Milgram, A. J.; Roos, D. S.; Schug, J.; Stoeckert, C. J., Jr. Nucleic Acids Res. 2002, 30, 87-90. (33) Dolinski, K.; Balakrishnan, R.; Christie, K. R.; Costanzo, M. C.; Dwight, S. S.; Engel, S. R.; Fisk, D. G.; Hirschman, J. E.; Hong, E. L.; Issel-Tarver, L.; Sethuraman, A.; Theesfeld, C. L.; Binkley, G.; Lane, C.; Schroeder, M.; Dong, S.; Weng, S.; Andrada, R.; Botstein, D.; Cherry, J. M. Saccharomyces Genome Database. ftp://genomeftp.stanford.edu/pub/yeast/SacchDB/. Jan. 2003. (34) Wood, V.; Gwilliam, R.; Rajandream, M. A.; Lyne, M.; Lyne, R.; Stewart, A.; Sgouros, J.; Peat, N.; Hayles, J.; Baker, S.; Basham, D.; Bowman, S.; Brooks, K.; Brown, D.; Brown, S.; Chillingworth, T.; Churcher, C.; Collins, M.; Connor, R.; Cronin, A.; Davis, P.; Feltwell, T.; Fraser, A.; Gentles, S.; Goble, A.; Hamlin, N.; Harris, D.; Hidalgo, J.; Hodgson, G.; Holroyd, S.; Hornsby, T.; Howarth, S.; Huckle, E. J.; Hunt, S.; Jagels, K.; James, K.; Jones, L.; Jones, M.; Leather, S.; McDonald, S.; McLean, J.; Mooney, P.; Moule, S.; Mungall, K.; Murphy, L.; Niblett, D.; Odell, C.; Oliver, K.; O’Neil, S.; Pearson, D.; Quail, M. A.; Rabbinowitsch, E.; Rutherford, K.; Rutter, S.; Saunders: D.; Seeger, K.; Sharp, S.; Skelton, J.; Simmonds, M.; Squares, R.; Squares, S.; Stevens, K.; Taylor, K.; Taylor, R. G.; Tivey, A.; Walsh, S.; Warren, T.; Whitehead, S.; Woodward, J.; Volckaert, G.; Aert, R.; Robben, J.; Grymonprez, B.; Weltjens, I.; Vanstreels, E.; Rieger, M.; Schafer, M.; MullerAuer, S.; Gabel, C.; Fuchs, M.; Dusterhoft, A.; Fritzc, C.; Holzer, E.; Moestl, D.; Hilbert, H.; Borzym, K.; Langer, I.; Beck, A.; Lehrach, H.; Reinhardt, R.; Pohl, T. M.; Eger, P.; Zimmermann, W.; Wedler, H.; Wambutt, R.; Purnelle, B.; Goffeau, A.; Cadieu, E.; Dreano, S.; Gloux, S. Nature 2002, 415, 871-880. (35) Parkhill, J.; Wren, B. W.; Thomson, N. R.; Titball, R. W.; Holden, M. T.; Prentice, M. B.; Sebaihia, M.; James, K. D.; Churcher, C.; Mungall, K. L.; Baker, S.; Basham, D.; Bentley, S. D.; Brooks, K.; Cerdeno-Tarraga, A. M.; Chillingworth, T.; Cronin, A.; Davies, R. M.; Davis, P.; Dougan, G.; Feltwell, T.; Hamlin, N.; Holroyd, S.; Jagels, K.; Karlyshev, A. V.; Leather, S.; Moule, S.; Oyston, P. C.; Quail, M.; Rutherford, K.; Simmonds, M.; Skelton, J.; Stevens, K.; Whitehead, S.; Barrell, B. G. Nature 2001, 413, 523-527. (36) C. elegans Sequencing Consortium. Science 1998, 282, 20122018. (37) Brooksbank, C.; Camon, E.; Harris, M. A.; Magrane, M.; Martin, M. J.; Mulder, N.; O’Donovan, C.; Parkinson, H.; Tuli, M. A.; Apweiler, R.; Birney, E.; Brazma, A.; Henrick, K.; Lopez, R.; Stoesser, G.; Stoehr, P.; Cameron, G. Nucleic Acids Res. 2003, 31, 43-50. (38) Kent, W. J.; Sugnet, C. W.; Furey, T. S.; Roskin, K. M.; Pringle, T. H.; Zahler, A. M.; Haussler, D. Genome Res. 2002, 12, 996-1006. (39) Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; Lipman, D. J. J. Mol. Biol. 1990, 215, 403-410. (40) Pearson, W. R.; Lipman, D. J. Proc. Natl. Acad. Sci. U.S.A. 1988, 85, 2444-8. (41) Smith, T. F.; Waterman, M. S. J. Mol. Biol. 1981, 147, 195-197. (42) Kaufman, L.; Rousseeuw, P. J. Finding Groups in Data: An Introduction to Cluster Analysis; John Wiley & Sons: New York, 1990; pp 1-67, 199-279. (43) Willett, P.; Winterman, V. Quant. Struct. Activ. Relat. 1986, 5, 1825. (44) Brenner, S. E.; Chothia, C.; Hubbard, T. J. Proc. Natl. Acad. Sci. U.S.A. 1998, 95, 6073-6078.

research articles (45) Levitt, M.; Gerstein, M. Proc. Natl. Acad. Sci. U.S.A. 1998, 95, 5913-5920. (46) Harrison, P. M.; Gerstein, M. J. Mol. Biol. 2002, 318, 1155-1174. (47) Casjens, S. J. Mol. Microbiol. Biotechnol. 2000, 2, 401-410. (48) Jornvall, H.; Hoog, J. O.; Persson, B. FEBS Lett. 1999, 445, 261264. (49) Katul, L.; Maiss, E.; Vetten, H. J. J. Gen. Virol. 1995, 76 (Pt 2), 475-479. (50) La Scola, B.; Audic, S.; Robert, C.; Jungang, L.; de Lamballerie, X.; Drancourt, M.; Birtles, R.; Claverie, J. M.; Raoult, D. Science 2003, 299, 2033. (51) Cann, A. J. Principles of Molecular Virology, 3rd ed; Academic Press: San Diego, 2001. (52) McGeoch, D. J.; Cook, S.; Dolan, A.; Jamieson, F. E.; Telford, E. A. J. Mol. Biol. 1995, 247, 443-458. (53) Chen, F.; Suttle, C. A. Virology 1996, 219, 170-178. (54) Moore, J. B.; Smith, G. L. Embo J. 1992, 11, 1973-1980. (55) Li, Y.; Lu, Z.; Burbank, D. E.; Kutish, G. F.; Rock, D. L.; Van Etten, J. L. Virology 1995, 212, 134-150. (56) Trimble, J. J.; Murthy, S. C.; Bakker, A.; Grassmann, R.; Desrosiers, R. C. Science 1988, 239, 1145-1147. (57) Tonetti, M.; Zanardi, D.; Gurnon, J. R.; Fruscione, F.; Armirotti, A.; Damonte, G.; Sturla, L.; De Flora, A.; Van Etten, J. L. J. Biol. Chem. 2003, 278, 21 559-21 565. (58) Cole, S. T.; Brosch, R.; Parkhill, J.; Garnier, T.; Churcher, C.; Harris, D.; Gordon, S. V.; Eiglmeier, K.; Gas, S.; Barry, C. E., 3rd; Tekaia, F.; Badcock, K.; Basham, D.; Brown, D.; Chillingworth, T.; Connor, R.; Davies, R.; Devlin, K.; Feltwell, T.; Gentles, S.; Hamlin, N.; Holroyd, S.; Hornsby, T.; Jagels, K.; Barrell, B. G.; et al. Nature 1998, 393, 537-544. (59) Kolattukudy, P. E.; Fernandes, N. D.; Azad, A. K.; Fitzmaurice, A. M.; Sirakova, T. D. Mol. Microbiol. 1997, 24, 263-270. (60) Fraser, C. M.; Gocayne, J. D.; White, O.; Adams, M. D.; Clayton, R. A.; Fleischmann, R. D.; Bult, C. J.; Kerlavage, A. R.; Sutton, G.; Kelley, J. M. Science 1995, 270, 397-403. (61) Blakley, R. L. In Folates and Pterines; Blakley, R. L., Benkovic, S. J., Eds.; Wiley: New York, 1984; pp 191-253. (62) Hoskins, J.; Alborn, W. E., Jr.; Arnold, J.; Blaszczak, L. C.; Burgett, S.; DeHoff, B. S.; Estrem, S. T.; Fritz, L.; Fu, D. J.; Fuller, W.; Geringer, C.; Gilmour, R.; Glass, J. S.; Khoja, H.; Kraft, A. R.; Lagace, R. E.; LeBlanc, D. J.; Lee, L. N.; Lefkowitz, E. J.; Lu, J.; Matsushima, P.; McAhren, S. M.; McHenney, M.; McLeaster, K.; Mundy, C. W.; Nicas, T. I.; Norris, F. H.; O’Gara, M.; Peery, R. B.; Robertson, G. T.; Rockey, P.; Sun, P. M.; Winkler, M. E.; Yang, Y.; Young-Bellido, M.; Zhao, G.; Zook, C. A.; Baltz, R. H.; Jaskunas, S. R.; Rosteck, P. R., Jr.; Skatrud, P. L.; Glass, J. I. J. Bacteriol. 2001, 183, 5709-5717. (63) Poole, R. K.; Cook, G. M. Adv. Microb. Physiol. 2000, 43, 165224. (64) Smith, M. A.; Finel, M.; Korolik, V.; Mendz, G. L. Arch. Microbiol. 2000, 174, 1-10. (65) Friedrich, T. Biochim. Biophys. Acta 1998, 1364, 134-146. (66) Gardner, M. J.; Hall, N.; Fung, E.; White, O.; Berriman, M.; Hyman, R. W.; Carlton, J. M.; Pain, A.; Nelson, K. E.; Bowman, S.; Paulsen, I. T.; James, K.; Eisen, J. A.; Rutherford, K.; Salzberg, S. L.; Craig, A.; Kyes, S.; Chan, M. S.; Nene, V.; Shallom, S. J.; Suh, B.; Peterson, J.; Angiuoli, S.; Pertea, M.; Allen, J.; Selengut, J.; Haft, D.; Mather, M. W.; Vaidya, A. B.; Martin, D. M.; Fairlamb, A. H.; Fraunholz, M. J.; Roos, D. S.; Ralph, S. A.; McFadden, G. I.; Cummings, L. M.; Subramanian, G. M.; Mungall, C.; Venter, J. C.; Carucci, D. J.; Hoffman, S. L.; Newbold, C.; Davis, R. W.; Fraser, C. M.; Barrell, B. Nature 2002, 419, 498-511. (67) Arabidopsis Genome Initiative. Nature 2000, 408, 796-815. (68) Adams, M. D.; Celniker, S. E.; Holt, R. A.; Evans, C. A.; Gocayne, J. D.; Amanatides, P. G.; Scherer, S. E.; Li, P. W.; Hoskins, R. A.; Galle, R. F.; George, R. A.; Lewis, S. E.; Richards, S.; Ashburner, M.; Henderson, S. N.; Sutton, G. G.; Wortman, J. R.; Yandell, M. D.; Zhang, Q.; Chen, L. X.; Brandon, R. C.; Rogers, Y. H.; Blazej, R. G.; Champe, M.; Pfeiffer, B. D.; Wan, K. H.; Doyle, C.; Baxter, E. G.; Helt, G.; Nelson, C. R.; Gabor, G. L.; Abril, J. F.; Agbayani, A.; An, H. J.; Andrews-Pfannkoch, C.; Baldwin, D.; Ballew, R. M.; Basu, A.; Baxendale, J.; Bayraktaroglu, L.; Beasley, E. M.; Beeson, K. Y.; Benos, P. V.; Berman, B. P.; Bhandari, D.; Bolshakov, S.; Borkova, D.; Botchan, M. R.; Bouck, J.; Brokstein, P.; Brottier, P.; Burtis, K. C.; Busam, D. A.; Butler, H.; Cadieu, E.; Center, A.; Chandra, I.; Cherry, J. M.; Cawley, S.; Dahlke, C.; Davenport, L. B.; Davies, P.; de Pablos, B.; Delcher, A.; Deng, Z.; Mays, A. D.; Dew, I.; Dietz, S. M.; Dodson, K.; Doup, L. E.; Downes, M.; DuganRocha, S.; Dunkov, B. C.; Dunn, P.; Durbin, K. J.; Evangelista, C.

Journal of Proteome Research • Vol. 2, No. 6, 2003 631

research articles C.; Ferraz, C.; Ferriera, S.; Fleischmann, W.; Fosler, C.; Gabrielian, A. E.; Garg, N. S.; Gelbart, W. M.; Glasser, K.; Glodek, A.; Gong, F.; Gorrell, J. H.; Gu, Z.; Guan, P.; Harris, M.; Harris, N. L.; Harvey, D.; Heiman, T. J.; Hernandez, J. R.; Houck, J.; Hostin, D.; Houston, K. A.; Howland, T. J.; Wei, M. H.; Ibegwam, C. Science 2000, 287, 2185-2195. (69) Lin, X.; Kaul, S.; Rounsley, S.; Shea, T. P.; Benito, M. I.; Town, C. D.; Fujii, C. Y.; Mason, T.; Bowman, C. L.; Barnstead, M.;

632

Journal of Proteome Research • Vol. 2, No. 6, 2003

Kho et al. Feldblyum, T. V.; Buell, C. R.; Ketchum, K. A.; Lee, J.; Ronning, C. M.; Koo, H. L.; Moffat, K. S.; Cronin, L. A.; Shen, M.; Pai, G.; Van Aken, S.; Umayam, L.; Tallon, L. J.; Gill, J. E.; Venter, J. C. Nature 1999, 402, 761-768. (70) Sem, D. S.; Yu, L.; Coutts, S. M.; Jack, R. J. Cell. Biochem. Suppl. 2001, Suppl. 37, 99-105. (71) Zheng, X. F.; Chan, T. F. Curr. Issues Mol. Biol. 2002, 4, 33-43.

PR034051H