Combination of Several Bioinformatics Approaches for the

Dec 16, 2008 - E-mail: [email protected]., † ..... The method was calibrated using a defined set of GT and non-GT protein sequences to determine...
0 downloads 9 Views 2MB Size
Combination of Several Bioinformatics Approaches for the Identification of New Putative Glycosyltransferases in Arabidopsis Sara Fasmer Hansen,† Emmanuel Bettler,‡ Michaela Wimmerova´,§ Anne Imberty,† Olivier Lerouxel,† and Christelle Breton*,† CERMAV-CNRS (affiliated to ICMG), University of Grenoble, F-38041 Grenoble, France, CNRS, UMR 5086, IBCP, University of Lyon, F-69367 Lyon, France, and Department of Biochemistry and NCBR, Faculty of Sciences, Masaryk University, 61137 Brno, Czech Republic Received September 24, 2008

Approximately 450 glycosyltransferase (GT) sequences have been already identified in the Arabidopsis genome that organize into 40 sequence-based families, but a vast majority of these gene products remain biochemically uncharacterized open reading frames. Given the complexity of the cell wall carbohydrate network, it can be inferred that some of the biosynthetic genes have not yet been identified by classical bioinformatics approaches. With the objective to identify new plant GT genes, we designed a bioinformatic strategy that is based on the use of several remote homology detection methods that act at the 1D, 2D, and 3D level. Together, these methods led to the identification of more than 150 candidate protein sequences. Among them, 20 are considered as putative glycosyltransferases that should further be investigated since known GT signatures were clearly identified. Keywords: fold recognition • glycosyltransferase • DUF266 • DUF246 • profile HMMs • structural overlap

Introduction Glycosyltransferases (GTs; EC 2.4.x.y) constitute a large family of enzymes that are involved in the biosynthesis of oligosaccharides, polysaccharides, and glycoconjugates.1 These molecules of enormous diversity mediate a wide range of functions from structure and storage to signaling. Particularly abundant are the GTs that transfer a sugar residue from an activated nucleotide sugar donor to specific acceptor molecules, forming glycosidic bonds. Acceptors cover various chemical classes of compounds including carbohydrates, proteins, lipids, DNA and numerous small molecules such as antibiotics, flavonols, steroids, and so forth. In eukaryotes, most of the glycosylation reactions that generate the diversity of oligosaccharide structures occur in the Golgi apparatus. In plants, building of the complex polysaccharide network found in the cell wall requires a large set of GTs because of the diversity in the linkage types connecting all monosaccharides within these polymers. However, relatively few of these genes have been identified, and the molecular basis of cell wall biosynthesis remains poorly understood.2 While cellulose (and also callose) is assembled at the plasma membrane and deposited directly into the wall,3 other matrix polysaccharides are synthesized in the Golgi and delivered to the wall in secretory vesicles.4 Golgi-resident GTs are typically type II membrane proteins with a large C-terminal globular catalytic domain facing the luminal side.5,6 However, other protein architectures have been described for GTs, * To whom correspondence should be addressed. Christelle Breton, Molecular Glycobiology, CERMAV-CNRS, BP53, F-38041 Grenoble cedex 9, France.Tel:+33476037635.Fax:+33476547203.E-mail:[email protected]. † University of Grenoble. ‡ University of Lyon. § Masaryk University. 10.1021/pr800808m CCC: $40.75

 2009 American Chemical Society

including type I membrane proteins (located in the endoplasmic reticulum), and integral membrane proteins (i.e., cellulose synthases).2 Plant UGTs that perform the numerous glycosylation reactions of secondary metabolites are cytosolic enzymes in contrast to their mammalian counterparts that are ERlocalized type I membrane proteins.7 GTs have been classified into families on the basis of amino acid sequence similarities available at http://www.cazy.org/.8 At the present time (September 2008), the CAZy database comprises more than 34 000 confirmed and putative GT sequences that have been divided into 90 families (denoted as GTx). However, the number of families is continually increasing with the discovery and biochemical characterization of new GT genes. Approximately 450 Arabidopsis thaliana sequences have already been listed in this database, spread into 40 GT families. Since less than 20% of these genes are annotated, identifying the precise function of every putative plant GT represents a huge task. By comparison, the human genome has only 240 GT genes, of which more than 80% are annotated. The large number of GT sequences found in plant genomes can be explained by the complex structure of the plant cell wall, the large number of glycosylated secondary metabolites and also by genome duplication events.9 Structural information is currently available for 30 GT families. In contrast to glycosylhydrolases that adopt a large variety of folds, including all R, all β, or mixed R/β structures, GT folds have been observed to consist mainly of R/β/R sandwiches,10 similar or very close to the Rossmann domain, a classical structural motif consisting of six-stranded parallel β-sheet with 321456 topology and found in many nucleotidebinding proteins.11 Until recently, only two structural superJournal of Proteome Research 2009, 8, 743–753 743 Published on Web 12/16/2008

research articles families have been described for GTs, named GT-A and GT-B. The GT-A family, that includes most of the Golgi resident GTs that have been crystallized to date, consists of one Rossmannlike domain. The central β-sheet is flanked by a smaller one, and the association of both creates the active site. Most GT-A enzymes also contains a conserved Asp-x-Asp (DxD) or equivalent motif that coordinates the phosphate atoms in nucleotide donors via a divalent cation (Mn2+ or Mg2+).10 The GT-B fold family, that includes most of the prokaryotic enzymes involved in the biosynthesis of secondary metabolites and bacterial cell wall components, consists of a tandem of Rossmann domains with the catalytic site located between them. Recently, variants of the GT-A and GT-B folds have been reported. Variations of the GTA-fold are observed in the structures of bacterial sialyltransferases (family GT42).12,13 These enzymes display a different type of R/β/R sandwich (a seven-stranded parallel β-sheet with 8712456 topology). Two variants of the GT-B fold have also been recently described for a mammalian R1,6fucosyltransferase14 and a bacterial R1,3-fucosyltransferase15 belonging to GT families 23 and 10, respectively. Very recently, completely different folds have been observed for glycosyltransferases that utilize lipid-phosphate activated donor substrates. The crystal structures of the GT domain of the peptidoglycan glycosyltransferases from Staphylococcus aureus16 and Aquifex aeolicus17 display structural similarity to the bacteriophage λ-lysozyme, and the STT3 catalytic subunit of oligosaccharyltransferase from Pyrococcus furiosus shows very different and modular protein architecture.18 Novel folds can be expected for those enzymes that are not constrained by the need to bind a nucleotide.19 Fold recognition is a theoretical approach which produces alignment of one sequence with one structure by a process called “threading”.20 Although there is little or no sequence similarity between GT families, the limited number of observed folds facilitates the use of fold recognition methods to predict whether GT-A, GT-B, or none of them will be the most probable fold for a given sequence. When performed on selected sequences representing all GT families present in CAZy database, such ‘threading’ analyses predicted that many GT families with no available structural information would adopt the GT-A or GT-B fold.10,21,22 Fold recognition has also been used in a more innovative manner to identify GTs in the translated genome of Mycobacterium tuberculosis23 or to predict 3D structures and function to unannotated genes in the Mimivirus genome.24 Given the complexity of the cell wall carbohydrate network, it can be inferred that some of the genes have not yet been identified using classical bioinformatics approaches. The aim of the present study is to search for new GT genes in the Arabidopsis genome that are not classified in CAZy database. Taking advantage of the growing number of GT structures demonstrating a surprising high level of fold conservation, we designed a bioinformatic strategy to search the Arabidopsis translated genome (that will be referred as the ‘proteome’ throughout this study) for new candidate GT sequences (Figure 1). This strategy relied on the use of remote homology detection methods at the 1D, 2D, and 3D levels. It is based on the use of (i) profile Hidden Markov Model (HMM) method to search for remote homologues in the Arabidopsis proteome,25 (ii) the wellknown PSI-BLAST program linked to Structural overlap approach (Sov) which acts at 2D level,26 and (iii) ProFit, a fold recognition program which acts at 3D level.27 Together, these methods allowed the identification of more than 150 candidate 744

Journal of Proteome Research • Vol. 8, No. 2, 2009

Fasmer Hansen et al.

Figure 1. Bioinformatic strategy used to search the Arabidopsis proteome (translated genome) for new putative glycosyltransferases.

Figure 2. Protein data resources. (A) Arabidopsis protein sequences were retrieved from the TAIR database (TAIR6). The 5-step filtering process led to the test Arabidopsis proteome comprising 5315 accessions. (B) The test 3D library comprises the PDB files from GT and non-GT structures that were split into 6 fold groups. (C) The profile HMMs library comprises the HMM consensus sequences created for each of the 316 subgroups resulting of protein sequences extraction from CAZy database.

genes. A detailed analysis of the retrieved protein sequences indicated that 20 of them are good candidates since known GT signatures were clearly identified.

Materials and Methods The Data Resources. The Arabidopsis proteome was downloaded from The Arabidopsis Information Resource (http:// www.arabidopsis.org/, TAIR6 release, 30 690 protein sequences). Already precalculated data have been used from TAIR ftp server to reduce the size of the proteome as indicated in Figure 2: TargetP 1.1 was used to predict the subcellular localization of protein sequences and exclude those with a predicted mitochondrial targeting peptide (mTP) or chloroplast transit peptide (cTP) with a cutoff value corresponding to a specificity of 0.95. For plant sequences, both sensitivity and specificity of cTP and mTP prediction are at 80-90%.28 Prediction of transmembrane domains in protein sequences was done using TMHMM 2.0 server.29,30 Information about gene annotation and presence of ESTs was also taken from TAIR ftp server. The test “3D library” was made by extracting from the PDB database (http://www.rcsb.org/, April 2007) a

research articles

Bioinformatics Approaches for Glycosyltransferases set of 96 reference protein structures including GT and nonGT proteins selected on the basis of SCOP classification (http:// scop.mrc-lmb.cam.ac.uk/scop/).31 When several PDB files were available for the same protein, the most appropriate one was selected based on resolution, and presence of substrates. Profile Hidden Markov Models (HMM). All protein sequences from the CAZy database (July 2007, 90 GT families) were retrieved and sorted by family number. Via the NPS@ web server, ClustalW32 was first used to generate multiple sequence alignments within each family. To facilitate batch submission, the large GT families were divided into subgroups. A library of profile HMMs was then generated automatically from ClustalW output for each subgroup using HMMer. Profile HMMs (in HMMER format) were then used to search the test Arabidopsis proteome to identify, through HMMsearch, any distant relationship to known GT families.25,29 ClustalW, HMMer were run at the NPS@ web server (http://npsa-pbil.ibcp.fr/cgi-bin/ npsa_automat.pl?page)/NPSA/npsa_server.html).33 An E-value below 0.1 was defined to select positive hits in the Arabidopsis proteome. BLAST and PSI-BLAST. PSI-BLAST is a sensitive sequence similarity search method performed in an iterative manner.34 Searching was done at NPS@ Web site by using all 5315 protein sequences from the test Arabidopsis proteome as queries against sequences extracted from the 3D library (96 sequences), and all of the hits with an E-value less than 0.35 were kept for secondary structure prediction and Sov calculation. PSI-BLAST searches were done with an iteration of 7. As part of the validation process, BLAST was used for rapid searching of the nonredundant protein database with each selected gene (www.ncbi.nlm.nih.gov/blast/Blast.cgi).35 Secondary Structure and Structural Overlap (Sov) Parameter Calculation. Secondary structures in proteins of known 3D structure that constitute the test 3D library were defined using the DSSP program.36 Secondary structure prediction was performed for each protein sequence of the test Arabidopsis proteome using three methods: SOPMA,37 DSC,38 and PHD.39 These methods use information derived from multiple sequence alignment. Only three states were considered (helix, sheet, and coil) to define a consensus prediction that was shown to give a better prediction accuracy.40 Sov is a secondary structure prediction method discriminating between related and unrelated proteins in the 10-30% sequence identity range, reading the output obtained from PSI-BLAST. For each aligned sequence pair (query sequence and 3D template sequence), the agreement between secondary structures (predicted versus observed) was estimated by calculating the Sov parameter39 as most recently defined.41 By applying a Sov threshold of 60%, it is possible to validate the homology between two proteins at a success rate of 95%.26 The outputs which gave E-values for PSI-BLAST below 0.35 and Sov values above 60 were selected. Sov calculation were performed with the help of NPS@ softwares. 3D-Fold Recognition. Sequences from the “test Arabidopsis proteome” were searched against the 3D library using the ProFit program (ProCeryon Biosciences, Austria) with default values. Several scores are calculated for each alignment generated (combined score, pair score, surface score, threading index). The threading index (ThdIdx) is considered to better assess the global quality of the model. It is a combination of the z-sequence score (sequence substitution component) and the z-combined score (a weighted combination of surface and pairwise potential components of the alignment matrix) nor-

malized by query sequence length. The program was tested on a series of glycosyltransferases and nonglycosytransferases protein sequences that were ranked according to their ThdIdx value. A threshold value of 50 appears to be necessary for avoiding false positives. All positive hits selected from the three different bioinformatic approaches described in Figure 1 were submitted to another fold recognition program called PHYRE.42 PHYRE is a fully automatic program that performs a profile-profile matching algorithm together with predicted secondary structure matching (http://www.sbg.bio.ic.ac.uk/phyre/index.cgi). Hydrophobic Cluster Analysis (HCA). HCA is a graphical method based on the detection and comparison of hydrophobic clusters that are presumed to correspond to the regular secondary structure elements constituting the architecture of globular proteins. HCA plots were obtained from the DrawHCA server (http://bioserv.rpbs.jussieu.fr).43

Results Protein Data Resources. The bioinformatic strategy used to identify new GT candidate genes (Figure 1) requires protein data resources as starting materials. Because some of the methods are computer-time-consuming, the size of the Arabidopsis proteome was reduced as shown in Figure 2A. The initial Arabidopsis proteome extracted from the TAIR database (TAIR Release 6) comprised 30 690 protein sequences. The very large sequences (>1000 amino acids) were first removed since the largest GTs characterized to date are far to reach this size. Precalculated information from TAIR database available from TAIR ftp server was used as additional filters. Thus, we made use of TargetP results28 to detect possible N-terminal sorting signals, and exclude the protein sequences that are predicted to localize to the mitochondrion and chloroplast. In the same way, most GTs are membrane proteins that localize in the ER and Golgi compartment; thus, only protein sequences harboring at least one transmembrane domain (predicted by TMHMM 2.0)29 were considered. These different steps resulted in a number of sequences decreasing to 5666. The final step consisted in removing protein sequences that are either redundant (identical sequences) or already present in the CAZy database. The test Arabidopsis proteome finally comprised 5315 protein sequences. A library of 3D-structures (PDB files) has been set up to be used in the Profit approach. This library is made up from structures falling into six fold families (Figure 2B). Two groups correspond to the PDB files of all crystal structures of GTs (available in May 2007), including 21 GT-A and 23 GT-B enzymes as well as one GT-A and two GT-B variants. Two other groups correspond to non-GT proteins that are involved in carbohydrate biosynthesis and that have been reported to adopt a fold similar to GTs.21 These include 22 GT-A like and 10 GT-B like proteins. We also include in the library eight unrelated proteins that adopt the typical Rossmann fold (a structural motif also found in GTs), as well as a random set of 12 protein structures with different 3D architectures (all R and all β) that were selected from the SCOP database.31 A total of 96 structures have been selected with sizes ranging from 162 to 879 amino acids (Supplementary Table 1 in Supporting Information). All protein sequences from the CAZy database (http:// www.cazy.org/, July 2007, 28 135 entries) were retrieved and sorted by the family number. The large GT families that typically comprise several thousands of protein sequences were divided into subgroups to facilitate batch submission when Journal of Proteome Research • Vol. 8, No. 2, 2009 745

research articles

Fasmer Hansen et al.

Table 1. Twenty-One Putative Glycosyltransferase Sequences Selected As a Result of Profile HMMs Search of the Test Arabidopsis Proteome sequence name AGI ID

a

protein length

E-value -138

GT family

ESTa

At5g03795.1

518

3.8 × 10

GT47

8

At4g01210.1

981

2.3 × 10-9

GT4

41

At1g68390.1 At5g16170.1 At3g21310.1 At1g10880.1 At1g73810.1

408 411 383 651 418

2.3 3.2 3.4 4.1 6.6

10-5 10-5 10-5 10-5 10-5

GT14 GT14 GT14 GT14 GT14

59 11 7 4 5

At1g68380.1 At1g51770.1 At5g25970.1 At4g30060.1 At5g11730.1 At1g10280.1 At3g52060.1 At5g25330.1 At3g57420.1 At1g51630.1 At2g41770.1 At3g02250.1 At1g31130.1

392 406 436 401 386 412 346 366 765 423 771 512 321

6.8 × 10-5 8 × 10-5 2.4 × 10-4 3.6 × 10-4 4.6 × 10-4 1.7 × 10-3 1.1 × 10-2 1.4 × 10-2 1.7 × 10-2 2 × 10-2 2.5 × 10-2 5.3 × 10-2 6.6 × 10-2

GT14 GT14 GT14 GT14 GT14 GT14 GT14 GT14 GT75 GT65 GT75 GT65 GT58

8 37 32 8 21 16 129 1 40 47 24 35 84

At5g15740.1

508

9.9 × 10-2

GT65

33

b

× × × × ×

Expressed Sequence Tags as indicated in TAIR.

b

Strong positive candidates are indicated in bold.

using the multiple sequence alignment program ClustalW (316 subgroups were defined for the whole CAZy database).32 For each subgroup, profile hidden Markov models (HMMs) were built using HMMer that turns the multiple sequence alignment generated by ClustalW into a position HMM profile suitable for searching databases for remotely homologous sequences.25 Profiles of scores are calculated for each of the structural environment classes based on their observed frequencies in a database of structures, and this resulted in a profile HMMs library (Figure 2C). Profile Hidden Markov Models (Profile HMMs). Profile HMMs are statistical models of the primary structure consensus of a sequence family and are designed to improve the sensitivity of database searching. Each query sequence from the test Arabidopsis proteome was searched against the library of profile HMMs using HMMer program25 available via the NPS@ web service33 (Figure 1). This approach aimed at identifying any distant relationship, through the HMM “consensus sequence signatures”, to known GT families (sequence comparison at the 1D level). With the use of this approach, a total of 51 sequences with an E-value less than 0.1 were retrieved (Supplementary Table 2 in Supporting Information). Positive hits were obtained with profile HMMs established for CAZy families 4, 14, 47, 50, 58, 65, 75, and 87. The results were refined by searching the TAIR database for any information such as annotation and expression data (presence of ESTs). Sequences with a clear annotation to unrelated protein families were eliminated as well as those for which no EST is reported. Twenty-eight out of the 51 sequences that are assigned to GT87 are annotated as membrane transporters or as containing plant DUF6 or DUF803 domains. In the PFAM database (http://pfam.sanger.ac.uk/), DUF annotation refers to the grouping of genes of unknown function that are sequence related. Protein sequences annotated DUF6 and DUF803 are considered to belong to the 746

annotations

Similar to exostosin family protein. Contains Exostosin-like domain Glycosyltransferase family protein. Similar to unknown proteins. Contains Glycosyltransferase group 1 domain Similar to unknown proteins. Contains plant DUF266 domain Similar to unknown proteins. Contains plant DUF266 domain Similar to unknown proteins. Contains plant DUF266 domain Similar to unknown proteins. Contains plant DUF266 domain Similar to unknown proteins. Contains N-terminal plant DUF266 domain Similar to unknown proteins. Contains plant DUF266 domain Similar to unknown proteins. Contains plant DUF266 domain Similar to unknown proteins. Contains plant DUF266 domain Similar to unknown proteins. Contains plant DUF266 domain Similar to unknown proteins. Contains plant DUF266 domain Similar to unknown proteins. Contains plant DUF266 domain Similar to unknown proteins. Contains plant DUF266 domain Similar to unknown proteins. Contains plant DUF266 domain Similar to unknown proteins. Similar to unknown proteins. Contains plant DUF246 domain Similar to unknown proteins. Similar to unknown proteins. Contains plant DUF246 domain Similar to unknown proteins. Contains family not named (PTHR22597) domain Similar to unknown proteins. Contains plant DUF246 domain

Journal of Proteome Research • Vol. 8, No. 2, 2009

same clan called Drug/Metabolite transporter superfamily. It is worth mentioning that this clan also includes sugar and nucleotide-sugar transporters. Family GT87 comprises bacterial mannosyltransferases that use polyprenol-P-mannose as the sugar donor. These enzymes are predicted to be integral membrane proteins with multiple transmembrane domains, a topology similar to membrane transporters. Because sequence homology could be related to the topology rather to the function, all of the Arabidopsis accessions displaying a GT87 HMM signature were not further considered. The 21 sequences resulting from the filtering are listed in Table 1. Positive hits were found for GT families 4, 14, 47, 58, 65, and 75. The two highest score obtained for At5g03795.1 (GT47) and At5g01210.1 (GT4) have been also identified by other approaches and they will be discussed later. Thirteen out of the 21 sequences are considered close to GT14, a polyspecific GT family that currently comprises mammalian β-glucosyltransferase, β-N-acetylglucosaminyltransferases and β-xylosyltransferases involved in the synthesis of O-glycans in mammals, and many related sequences of unknown function from eukaryotes and prokaryotes (including 11 Arabidopsis sequences). As shown in Table 1, these putative plant GT sequences are annotated as containing a DUF266 domain, a plant domain of unknown function that is described in the PFAM database as ‘likely to be glycosyltransferase related‘. Three sequences, annotated DUF246 with no indication of function, are close to GT65, a monospecific GT family which comprises the mammalian protein-Ofucosyltransferase 1 (Pofut1).44 The remaining sequences are related to families GT58 and GT75. Family GT75 corresponds to the plant RGP (reversibly glycosylated polypeptide) protein family that is described as self-glucosylating-glucosyltransferase activity.45 However, no experimental evidence has been obtained to show that RGPs are GTs, and recently it was demonstrated that one related protein from rice displays a

research articles

Bioinformatics Approaches for Glycosyltransferases

Table 2. Ten Putative Glycosyltransferase Sequences Selected As a Result of Combined PSI-BLAST and Structural Overlapping Approach on the Test Arabidopsis Proteome sequence AGI ID

prot. length

PSI-BLAST E-value

Sov value

PDB code

At5g15725.1 At2g36630.1

95 459

0.34 6.1 × 10-2

94.3 81.5

At4g01210.1

981

2.0 × 10-4

At3g28320.1

280

At1g05370.1

fold

ESTa

annotations

1V84 2EX0

GT-A GT-B

3 31

81.0

2GEK

GT-B

41

0.32

78.8

2C1Z

GT-B

1

439

2.5 × 10-2

72.2

1V84

GT-A

2

At5g03795.1

518

0.2

69.3

2NZX

GT-B

8

At4g14360.1

608

5.4 × 10-2

64.3

2BVL

GT-A

32

At4g14100.1

206

0.2

61.5

2DE0

GT-B

14

At5g23920.1 At1g26710.1

229 168

9.3 × 10-2 0.21

61.3 60.1

2IYA 1FO8

GT-B GT-A

19 3

Unknown protein Similar to unknown proteins. Similar to membrane protein-like protein. Contains DUF81 domain Glycosyltransferase family protein. Similar to unknown proteins. Contains Glycosyltransferase group 1 domain Similar to AT14A A. thaliana. Similar to proteins of unknown function. Contains Phosphoinositide-binding clathrin adaptor, N-terminal domain. Contains DUF677 domain Similar to unknown proteins. Contains N-terminal Phosphatidylinositol transfer protein-like domain. Contains C-terminal cellular retinaldehyde-binding/triple domain Similar to exostosin family protein. Similar to EXO C. melo. Contains Exostosin-like domain Dehydration-responsive protein-related. Contains DUF248 domain. Contains putative methyltransferase domain. Contains Generic methyltransferase domain Transferase, transferring glycosyl groups. Similar to unknown proteins. Contains Cystatin/monellin domain (SSF54403) Similar to unknown proteins Similar to unknown protein

b

a

Expressed Sequence Tags as indicated in TAIR.

b

Strong positive candidates are indicated in bold.

UDP-arabinopyranose mutase activity.46 At1g31130.1 accession is related to GT58 family that comprises eukaryotic mannosyltransferases using dolichol-P-mannose as a sugar donor. These proteins, like GT87 proteins, are integral membrane proteins with multiple transmembrane domains. From this primary analysis, only those sequences with HMM signatures for families GT4, GT14, GT47, and GT65 will be further investigated (see below). PSI-BLAST and Structural Overlap (Sov). The combined use of PSI-BLAST and Structural overlap (Sov) parameter calculation, which relies on sequence comparison at the 2D level, already proved to be sensitive for identifying remote homologies at low level of sequence similarity.26 It has been established that in the 10-30% sequence identity range (twilight zone), when the Sov parameter is above 60%, almost all proteins can be correctly assigned as structurally similar on the basis of predicted secondary structures. Sequences from the test Arabidopsis proteome were searched against the 1D sequences (extracted from the 3D library) using PSI-BLAST (Figure 1), and for each aligned sequence pair, the agreement between secondary structures was estimated. Only those sequences that showed an E-value below 0.35 by PSI-BLAST and a Sov value above 60 were retained.26 This approach led to the selection of 56 candidate genes (Supplementary Table 3 in Supporting Information). As for the profile HMM approach, further refinement against the information from the TAIR database concerning annotation and expression data was done. This led to the selection of 10 protein sequences (Table 2) that were further analyzed using a combination of methods (BLAST, TMHMM, HCA, and PHYRE). The aim was to identify sequence features indicative of a putative GT (i.e., type II membrane topology, the presence of peptide motifs such as the DxD motif, predicted R/β 3D structure). At this stage, only two sequences were considered as strong positive candidates: At4g01210.1 and At5g03795.1 that were also found using profile HMM, and which display a HMM GT signature for GT4 and GT47, respectively. Two other sequences can be considered as potential candidates, At4g14100.1 and At4g14360.1. Although

we were unable to identify any GT signature, these two sequences appear to fulfill other interesting criteria such as type II membrane topology and a predicted R/β fold (data not shown). Fold Recognition Analysis. The fold recognition program used in the present work is ProFit (Proceryon, Biosciences GmbH) that allows sequence/structure comparison at the 3D level.27 It is based on the use of knowledge-based potentials that are derived from our local 3D library (including GT and non-GT structures). In practice, all of the protein sequences derived from the test Arabidopsis proteome were “threaded” on each representative of the 3D library to determine the ones that are more likely to adopt one of the known GT folds. A total of 510 240 models were built by aligning the 5315 Arabidopsis sequences against the 96 structures of the test 3D library. Scoring and ranking of these models according to their threading index (ThdIdx) values is considered to better assess the global quality of the model. The ThdIdx values calculated for each model are scattered between -25.29 and 71.35, but different maximum values are obtained for each fold family (Figure 3A). The method was calibrated using a defined set of GT and non-GT protein sequences to determine the optimal cutoff value. To be highly selective and minimize false positives, a ThdIdx value of 50 was selected. Fifty-nine protein sequences gave a ThdIdx value above 50 for a glycosyltransferase fold (either GT-A or GT-B fold) (Supplementary Table 4 in Supporting Information). As shown in Figure 3B, maximal ThdIdx values for other fold groups are less than 50 for these 59 candidate sequences. The same filtering process used for the PSI-BLAST/Sov approach was applied and this led to the selection of eight candidate sequences (Table 3). Among the resulting sequences, and after extensive analysis, four sequences were considered as positive hits (At5g28910.2, At5g03795.1, At5g25330.1, and At5g22070.1) and another sequence as a potential candidate (At3g03210.1). At5g03795.1 was also found using the two other bioinformatics approaches (Tables 1 and 2), and At5g25330.1 using the profile HMM method (Table 1). A larger number of positive hits would Journal of Proteome Research • Vol. 8, No. 2, 2009 747

research articles

Fasmer Hansen et al.

Figure 3. Fold recognition analysis of the test Arabidopsis proteome. (A) Minimum and maximum threading index (ThdIdx) values calculated for each model generated for the 5315 Arabidopsis protein sequences within each fold group. (B) Minimum and maximum ThdIdx values calculated for each model generated for the 59 selected sequences within each fold group. The selected sequences are those that gave a ThdIdx above 50 for a GT-A or GT-B fold. Table 3. Eight Putative Glycosyltransferase Sequences Selected As a Result of 3D-Fold Recognition Analysis of the Test Arabidopsis Proteome sequence name AGI ID

protein length

threading index

PDB code

fold

At5g28910.2 At5g03795.1

535 518

71.35 64.39

2DE0 2IYF

GT-B GT-B

3 8

At5g25330.1 At5g64020.1

366 408

54.60 53.44

2GAK 1FO8

GT-A GT-A

1 16

At5g01360.1 At5g22070.1 At5g37300.1 At3g03210.1

434 362 481 368

53.41 52.69 51.81 50.79

1XV5 1RRV 2EX0 1LL2

GT-B GT-B GT-B GT-A

6 29 14 13

b

a

Expressed Sequence Tags as indicated in TAIR.

b

ESTa

Similar to unknown proteins Similar to exostosin family protein. Similar to EXO C. melo. Contains Exostosin-like domain Similar to unknown proteins. Contains plant DUF266 domain Similar to unknown proteins. Similar to leaf senescence protein-like in O. sativa. Contains plant DUF231 domain Similar to unknown proteins. Contains plant DUF231 domain Similar to unknown proteins. Contains plant DUF266 domain Similar to unknown proteins. Contains plant DUF1298 domain Similar to conserved hypothetical protein from M. truncatula

Strong positive candidates are indicated in bold.

certainly have been obtained using a lower cutoff value (i.e., in the threading index range 45-50) but with a higher risk of false-positives and difficulties to clearly separate GT-A/GT-B folds from the GT-A like/GT-B like folds.

Evaluation of the Candidate GT Sequences and Discussion The bioinformatic strategy used in the present study (Figure 1) allowed us to identify more than 150 putative GT sequences (listed in Supplementary Tables 2-4 in Supporting Information). Removal of protein sequences with a clear non-GT annotation, as well as those with no EST, led to the selection of approximately 40 protein sequences (listed in Tables 1-3). 748

annotations

Journal of Proteome Research • Vol. 8, No. 2, 2009

Among them, 20 protein sequences were considered as strongly positive after extensive sequence analysis and they will be further discussed. At5g03795.1 is the only accession that was found using the three bioinformatic approaches. Annotation in TAIR indicates that it is similar to exostosin (EXT) family protein. EXT genes have been shown to encode bifunctional GTs involved in the chain elongation of heparan sulfate biosynthesis.47 The Nterminal domain, which adds β-1,4-glucuronic acid residues, belongs to family GT47, while the C-terminal domain, which adds R1,4-N-acetylglucosaminyl residues, belongs to family GT64. Family GT47 that is predicted to be a GT-B fold10 currently comprises 39 Arabidopsis sequences. Among them,

Bioinformatics Approaches for Glycosyltransferases

research articles

Figure 4. Phylogenetic tree of the A. thaliana DUF266 protein family. DUF266 annotated protein sequences were retrieved from TAIR and aligned using ClustalW. Phylogenetic tree was created using Mega4.164 Accessions in bold characters correspond to the selected sequences of the multiple alignment displayed in Figure 5.

only three are known Arabidopsis GTs involved in plant cell wall polysaccharide biosynthesis, with specificities ranging from xyloglucan β-galactosyltransferase,48 arabinan R-L-arabinosyltransferase,49 and xylogalacturonan β-xylosyltransferase.50 Although At5g03795.1 accession number is not present in the CAZy database, a close examination of its peptide sequence revealed that it is partly identical to a nonannotated Arabidopsis protein sequence (At5g03800) of family GT47. The explanation was that the original At5g03800 gene was incorrectly assigned and was later split into the two current At5g03800 and At5g03795 locus in TAIR6 release. The correct accession number is therefore At5g03795.1 (protein code Q9FFN2), and it belongs to subgroup C of Family GT47.50 Afterward, this sequence was considered as a robust blind test for the three bioinformatics methods. At4g01210.1 was identified using PSI-BLAST/Sov and profile HMM search. It is annotated as being similar to Glycosyltransferase 1 domain (PF00534 in PFAM database). Both, its HMM signature and the best hit obtained using PSI-BLAST/Sov are indicative of a distant relationship to GT4 protein members. The family GT4 that belongs to the GT-B structural superfamily, is one of the largest families in CAZy comprising numerous and very different GT activities. At4g01210.1 protein sequence comprises 981 amino acids with one TMD at its N-terminus, and the GT-B domain is predicted to cover the region [150-520]. An excellent structural conservation between the predicted secondary structure elements of the GT-B domain and those

observed in various crystal structures from family GT4 was observed using the fold recognition program PHYRE (data not shown). From these results, At4g01210.1 can be confidently assigned to the GT-B fold family. A BLAST search of the C-terminal region of At4g01210.1 [520-981] failed to give any indication of its function, if any. At5g25330.1 was found using profile HMM and the ProFit fold recognition program. It is annotated as containing a plant DUF266 domain. A similar annotation is observed for 12 other protein coding regions listed in Table 1, and one in Table 3. A search of TAIR database for other DUF266 sequences indicated that this family comprises 23 different accessions (Figure 4), with overall amino acid sequence identity ranging from 15 to 80%. As stated above, this protein domain is believed to be distantly related to members of the polyspecific family GT14. Using PHYRE and the hydrophobic cluster analysis method (HCA), we confirmed the structural similarity between DUF266annotated genes and the murine Core 2 β6-N-acetylglucosaminyltransferase (C2GnT, a key enzyme in the biosynthesis of branched O-glycans) that has been recently crystallized.51 Structural data show that this enzyme possesses the canonical GT-A fold but lacks the characteristic metal ion binding DxD motif. Figure 5 shows the level of sequence conservation in the most conserved regions between DUF266 protein members and C2GnT. The first region mostly corresponds to the Rossmann-type nucleotide binding domain encompassing the first 120 amino acids of the catalytic domain. The second region is Journal of Proteome Research • Vol. 8, No. 2, 2009 749

research articles

Fasmer Hansen et al.

Figure 5. Multiple sequence alignment of DUF266 protein members with the murine Core2 β1,6-N-acetylglucosaminyltransferase (C2GnT) of family GT14. The selected Arabidopsis protein sequences are considered to be representative members of DUF266 protein family (see Figure 4). Residues that are invariant are indicated in white on a black background and other conserved positions are shaded in gray. Protein sequence and secondary structure prediction for C2GnT were taken from the PDB file 2GAK. Secondary structures are represented as arrows for β-strands and cylinders for R-helices and they are named as in Pak et al.51 The putative catalytic residue (Glu) is marked with an asterisk.

Figure 6. The three conserved peptide regions of the large fucosyltransferase superfamily. (A) Alignment of AT5g28910.2 with fucosyltransferase sequences from families GT11, GT23, GT37, GT65, and GT68 that constitute the large fucosyltransferase superfamily characterized by the presence of three conserved peptide regions, labeled I-III.52,53 In the human FUT8 enzyme, mutation of residues marked with an asterisk into alanine yielded inactive enzyme.14 (B) Alignment of DUF246 annotated A. thaliana sequences (Table 1) with members of family GT65. The only invariant Arg residue is indicated in white on a black background. The other conserved positions throughout this large family are shaded in gray. The accession numbers (from UniProtKB/Swiss-Prot) for the non plant sequences are: P19526 (human, hFUT1), P91200 (Caenorhabditis elegans, CE2FT1), Q9X3N7 (Helicobacter pylori, HpFucT2), Q43966 (Azorhizobium caulinodans, AcNodZ), Q9H488 (human, hPofut1), Q9V6X7 (Drosophila melanogaster, DmPofut1), Q9Y2G5 (human, hPofut2).

a common structural motif that is observed in all GT-A enzymes, and which typically comprises residues that interact with both the donor and acceptor substrates.10 Of particular interest is the conserved acidic position, Glu-320 in C2GnT, that is expected to be the catalytic base.51 The third region corresponds to the C-terminus of the catalytic domain and comprises three invariant residues. One of them, Lys-401 in C2GnT, is positioned to interact with the diphosphate group of the nucleotide sugar and is also found conserved in the plant DUF266 family. At5g28910.2 gave a very high threading index value for the mammalian R6-fucosyltransferase (FUT8) that classifies into GT23 (PDB code 2DE0, a variant of the GT-B fold). This plant sequence comprises 535 amino acids with one TMD at its N-terminus. With the use of PHYRE and HCA, we were able to 750

Journal of Proteome Research • Vol. 8, No. 2, 2009

clearly identify the three peptide regions that are specific for the large fucosyltransferase superfamily.52,53 This superfamily comprises all of the known R2- and R6-fucosyltransferase (from prokaryotes and eukaryotes) as well as the protein-O-fucosyltransferases, corresponding to various GT families in CAZy (GT11, GT23, GT37, GT65, and GT68). Family GT37 is represented by the plant xyloglucan R2-fucosyltransferases. Figure 6A displays the sequence alignment of the At5g28910.2 sequence with representative members of this fucosyltransferase superfamily. Of particular interest is the peptide motif I which comprises the invariant arginine (R) catalytic residue14,54 and that is also conserved in At5g28910.2. When BLAST was used, only one sequence (At5g28960) was identified in the Arabidopsis genome as being similar to At5g28910.2. However, the three protein sequences listed in Table 1, that are annotated as

research articles

Bioinformatics Approaches for Glycosyltransferases containing a DUF246 domain, have been shown to display a GT65 HMM signature, and thus, they are expected to belong to the same clan. The monospecific family GT65 is represented by the protein-O-fucosyltransferases 1 (Pofut1), an enzyme that adds a fucose residue to Ser or Thr residues into Epidermal Growth Factor-repeats.55 Family GT65 also comprises a unique nonannotated A. thaliana sequence (At3g05320, 445 aa) which displays a type II membrane protein topology. BLAST analysis of At3g05320 detected a DUF246 domain but failed to detect significant sequence similarity with the human Pofut1 protein sequence, thus, indicating a distant relationship with the animal Pofut1 sequences. A search of DUF246 annotated sequences in TAIR revealed that it is a large multigene family of unknown function comprising 34 protein coding sequences (see Supplemental Figure S1 in Supporting Information). Since GT65 belongs to the large fucosyltransferase superfamily, we searched for structural similarities and the presence of specific peptide motifs in the three DUF246-annotated proteins of Table 1, as those presented in Figure 6A. Figure 6B shows the presence of the three conserved peptide regions (I-III), thus, confirming their belonging, like At5g28910.2, to the fucosyltransferase superfamily. Nervertheless, it remains to be determined if the presence of such conserved motifs is always indicative of a fucosyltransferase activity. The most promising candidate sequences identified in the present study have the typical type II membrane topology of Golgi-resident GTs. For some of them (At1g51630.1, At4g01210.1, At5g11730.1), recent data based on the use of a proteomics technique for the subcellular localization of A. thaliana integral membrane proteins confirmed their Golgi localization56,57 (K. Lilley, P. Dupree, personal communication). In an effort to identify new plant cell wall GTs, Egelund and colleagues58 adopted a bioinformatic approach based on the use of successive filters (from 1D to 3D). TMHMM was used first to retrieve protein sequences with one and two N-terminal domains, and then the SUPERFAMILY prediction server that uses a library of HMMs representing all proteins of known structure.59 The resulting sequences were further analyzed using the fold recognition servers 3D-PSSM60 and mGenTHREADER.61 As a result of this study, 27 putative GTs were obtained, and later, four of them gave rise to a new family in CAZy, the family GT77, when their biochemical function was elucidated.62,63 The approach we used is different since it relies on the use of separate (and not successive) methods that act at the 1D, 2D, and 3D-level. This demonstrated how important are the strategy and filters that are used to search the translated genome of Arabidopsis since only two sequences were commonly retrieved (At4g01210.1 and At5g11730.1). Our list of candidate genes is probably far from being complete. The use of filters to reduce the proteome size probably led to the removal of other potential candidates (i.e., type II membrane proteins may be missed if the first exon is missing in the gene model). Although genome annotation has made remarkable progress since the first draft of the plant model genome, and annotation in the TAIR database is considered of high standard, probably many genes still remain incorrectly annotated. Therefore, a close examination of all protein sequences listed in Supplementary Tables 1-4 in Supporting Information will probably be recommended before discarding those sequences that are annotated as similar to other (non-GT) protein function. However, in the present study, our main objective was to identify a few strong candidates among all putative

sequences and to tentatively demonstrate their belonging to the large GT family.

Conclusion With the use of various bioinformatic strategies, we and others58 demonstrated that the number of plant GT genes is underestimated. In the present study, 20 A. thaliana sequences were identified as new GT candidates. We demonstrated a clear GT signature for the DUF266 and DUF246 protein families that are distantly related to CAZy families GT14 and GT65, respectively, and for At5g28910.2 that is related to a large fucosyltransferase superfamily. Although these results will contribute to the general annotation of A. thaliana genome, one of the challenges currently facing glycobiologists in this field is the biochemical characterization of the proteins encoded by these numerous GT genes. Abbreviations: EST, expressed sequence tag; GT, glycosyltransferase; HCA, Hydrophobic Cluster Analysis; ThdIdx, threading index, TMD, transmembrane domain. Note Added: During the reviewing process, a paper from Zhou et al. has been published online in the Plant Journal. The authors report on the isolation of a brittle culm mutant (bc10) from rice. The loss-of-function mutation of BC10 causes a reduction in the levels of cellulose and arabinogalactan proteins in cell walls. The gene responsible for the phenotype is a member of DUF266 gene family, and preliminary data suggest that BC10 is a glycosyltransferase, thus, supporting the validity of the bioinformatic approach used in the present study.

Acknowledgment. This work was supported by the sixth Framework Programme of the European Union (Contract number MRTN-CT-2004-512265, “Wallnet”) and Ministry of Education of the Czech Republic (Contract number MSM0021622413). Ph.D. student Nicolas Garnier (IBCP, Lyon) is acknowledged for his help in creating python scripts. Supporting Information Available: Supplementary Table 1, library of 96 crystal structures used in the fold recognition and PSI-BLAST/structural overlapping approaches. 3D structures fall into 6 fold groups referred as to GT-A, GT-B, GT-A like, GT-B like, Rossmann, and Random. Structures of the Rossmann and Random fold groups were selected from the SCOP database.31 Supplementary Table 2, 51 putative glycosyltransferase sequences identified as a result of profile HMM search of the test Arabidopsis proteome. Supplementary Table 3, 56 putative glycosyltransferase sequences identified as a result of combined PSI-BLAST and structural overlapping approach applied to the test Arabidopsis proteome. Supplementary Table 4, 59 putative glycosyltransferase sequences identified as a result of 3D-fold recognition analysis of the test Arabidopsis proteome. Supplementary Figure 1, phylogenetic tree of the Arabidopsis thaliana DUF246 protein family. DUF246 annotated protein sequences were retrieved from TAIR and aligned using ClustalW. Phylogenetic tree was created using Mega4.1.64 This material is available free of charge via the Internet at http://pubs.acs.org. References (1) Taniguchi, N.; Honke, K.; Fukuda, M., Handbook of Glycosyltransferase and Related Genes; Springer: Tokyo, 2002. (2) Lerouxel, O.; Cavalier, D. M.; Liepman, A. H.; Keegstra, K. Biosynthesis of plant cell wall polysaccharidessa complex process. Curr. Opin. Plant Biol. 2006, 9 (6), 621–630.

Journal of Proteome Research • Vol. 8, No. 2, 2009 751

research articles (3) Somerville, C. Cellulose synthesis in higher plants. Annu. Rev. Cell Dev. Biol. 2006, 22, 53–78. (4) Carpita, N.; McCann, M. The cell wall. In Biochemistry and Molecular Biology of Plants; Buchanan, B. B., Wilhelm, G., Jones, R. L., Eds.;America Society of Plant Physiologists: Rockville, MD, 2000; pp 52-108. (5) Breton, C.; Mucha, J.; Jeanneau, C. Structural and functional features of glycosyltransferases. Biochimie 2001, 83 (8), 713–718. (6) Paulson, J. C.; Colley, K. J. Glycosyltransferases. Structure, localization, and control of cell type- specific glycosylation. J. Biol. Chem. 1989, 264 (30), 17615–17618. (7) Radominska-Pandya, A.; Ouzzine, M.; Fournel-Gigleux, S.; Magdalou, J. Structure of UDP-glucuronosyltransferases in membranes. Methods Enzymol. 2005, 400, 116–147. (8) Coutinho, P. M.; Deleury, E.; Davies, G. J.; Henrissat, B. An evolving hierarchical family classification for glycosyltransferases. J. Mol. Biol. 2003, 328 (2), 307–317. (9) Coutinho, P. M.; Stam, M.; Blanc, E.; Henrissat, B. Why are there so many carbohydrate-active enzyme-related genes in plants. Trends Plant Sci. 2003, 8 (12), 563–565. (10) Breton, C.; Sˇnajdrova´, L.; Jeanneau, C.; Koca, J.; Imberty, A. Structures and mechanisms of glycosyltransferases. Glycobiology 2006, 16 (2), 29R–37R. (11) Lesk, A. M. NAD-binding domains of dehydrogenases. Curr. Opin. Struct. Biol. 1995, 5 (6), 775–783. (12) Chiu, C. P.; Lairson, L. L.; Gilbert, M.; Wakarchuk, W. W.; Withers, S. G.; Strynadka, N. C. Structural analysis of the alpha-2,3sialyltransferase Cst-I from Campylobacter jejuni in apo and substrate-analogue bound forms. Biochemistry 2007, 46 (24), 7196– 7204. (13) Chiu, C. P.; Watts, A. G.; Lairson, L. L.; Gilbert, M.; Lim, D.; Wakarchuk, W. W.; Withers, S. G.; Strynadka, N. C. Structural analysis of the sialyltransferase CstII from Campylobacter jejuni in complex with a substrate analog. Nat. Struct. Mol. Biol. 2004, 11 (2), 163–170. (14) Ihara, H.; Ikeda, Y.; Toma, S.; Wang, X.; Suzuki, T.; Gu, J.; Miyoshi, E.; Tsukihara, T.; Honke, K.; Matsumoto, A.; Nakagawa, A.; Taniguchi, N. Crystal structure of mammalian alpha1,6-fucosyltransferase, FUT8. Glycobiology 2007, 17 (5), 455–466. (15) Sun, H. Y.; Lin, S. W.; Ko, T. P.; Pan, J. F.; Liu, C. L.; Lin, C. N.; Wang, A. H.; Lin, C. H. Structure and mechanism of H. pylori fucosyltransferase: A basis for lipopolysaccharide variation and inhibitor design. J. Biol. Chem. 2007, 282 (13), 9973–9982. (16) Lovering, A. L.; de Castro, L. H.; Lim, D.; Strynadka, N. C. Structural insight into the transglycosylation step of bacterial cell-wall biosynthesis. Science 2007, 315 (5817), 1402–1405. (17) Yuan, Y.; Barrett, D.; Zhang, Y.; Kahne, D.; Sliz, P.; Walker, S. Crystal structure of a peptidoglycan glycosyltransferase suggests a model for processive glycan chain synthesis. Proc. Natl. Acad. Sci. U.S.A. 2007, 104 (13), 5348–5353. (18) Igura, M.; Maita, N.; Kamishikiryo, J.; Yamada, M.; Obita, T.; Maenaka, K.; Kohda, D. Structure-guided identification of a new catalytic motif of oligosaccharyltransferase. EMBO J. 2008, 27 (1), 234–243. (19) Lairson, L. L.; Henrissat, B.; Davies, G. J.; Withers, S. G. Glycosyltransferases: structures, functions, and mechanisms. Annu. Rev. Biochem. 2008, 77, 521–555. (20) Godzik, A. Fold recognition methods. Methods Biochem. Anal. 2003, 44, 525–546. (21) Breton, C.; Heissigerova, H.; Jeanneau, C.; Moravcova, J.; Imberty, A. Comparative aspects of glycosyltransferases. Biochem. Soc. Symp. 2002, 69, 23–32. (22) Franco, O. L.; Rigden, D. J. Fold recognition analysis of glycosyltransferase families: further members of structural superfamilies. Glycobiology 2003, 13 (10), 707–712. (23) Wimmerova, M.; Engelsen, S. B.; Bettler, E.; Breton, C.; Imberty, A. Combining fold recognition and exploratory data analysis for searching for glycosyltransferases in the genome of Mycobacterium tuberculosis. Biochimie 2003, 85, 691–700. (24) Saini, H. K.; Fischer, D. Structural and functional insights into Mimivirus ORFans. BMC Genomics 2007, 8, 115. (25) Eddy, S. R. Profile hidden Markov models. Bioinformatics 1998, 14 (9), 755–763. (26) Geourjon, C.; Combet, C.; Blanchet, C.; Deleage, G. Identification of related proteins with weak sequence identity using secondary structure information. Protein Sci. 2001, 10 (4), 788–797. (27) Sippl, M. J.; Flockner, H. Threading thrills and threats. Structure 1996, 4 (1), 15–19. (28) Emanuelsson, O.; Brunak, S.; von Heijne, G.; Nielsen, H. Locating proteins in the cell using TargetP, SignalP and related tools. Nat Protoc. 2007, 2 (4), 953–971.

752

Journal of Proteome Research • Vol. 8, No. 2, 2009

Fasmer Hansen et al. (29) Krogh, A.; Larsson, B.; von Heijne, G.; Sonnhammer, E. L. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol. 2001, 305 (3), 567– 580. (30) Sadovskaya, N. S.; Sutormin, R. A.; Gelfand, M. S. Recognition of transmembrane segments in proteins: review and consistencybased benchmarking of internet servers. J. Bioinform. Comput. Biol. 2006, 4 (5), 1033–1056. (31) Murzin, A. G.; Brenner, S. E.; Hubbard, T.; Chothia, C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 1995, 247 (4), 536–540. (32) Thompson, J. D.; Higgins, D. G.; Gibson, T. J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22 (22), 4673– 4680. (33) Combet, C.; Blanchet, C.; Geourjon, C.; Dele´age, G. NPS@: Network Protein Sequence Analysis. Trends Biochem. Sci. 2000, 25 (3), 147– 150. (34) Altschul, S. F.; Madden, T. L.; Schaffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D. J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17), 3389–3402. (35) Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 1990, 215 (3), 403– 410. (36) Kabsch, W.; Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983, 22 (12), 2577–2637. (37) Geourjon, C.; Deleage, G. SOPMA: significant improvements in protein secondary structure prediction by consensus prediction from multiple alignments. Comput. Appl. Biosci. 1995, 11 (6), 681– 684. (38) King, R. D.; Saqi, M.; Sayle, R.; Sternberg, M. J. DSC: public domain protein secondary structure predication. Comput. Appl. Biosci. 1997, 13 (4), 473–474. (39) Rost, B.; Sander, C.; Schneider, R. PHD--an automatic mail server for protein secondary structure prediction. Comput. Appl. Biosci. 1994, 10 (1), 53–60. (40) Errami, M.; Geourjon, C.; Deleage, G. Detection of unrelated proteins in sequences multiple alignments by using predicted secondary structures. Bioinformatics 2003, 19 (4), 506–512. (41) Zemla, A.; Venclovas, C.; Fidelis, K.; Rost, B. A modified definition of Sov, a segment-based measure for protein secondary structure prediction assessment. Proteins 1999, 34 (2), 220–223. (42) Bennett-Lovsey, R. M.; Herbert, A. D.; Sternberg, M. J.; Kelley, L. A. Exploring the extremes of sequence/structure space with ensemble fold recognition in the program Phyre. Proteins 2008, 70 (3), 611– 625. (43) Gaboriaud, C.; Bissery, V.; Benchetrit, T.; Mornon, J. P. Hydrophobic cluster analysis: an efficient new way to compare and analyse amino acid sequences. FEBS Lett. 1987, 224 (1), 149–155. (44) Shi, S.; Stanley, P. Protein O-fucosyltransferase 1 is an essential component of Notch signaling pathways. Proc. Natl. Acad. Sci. U.S.A. 2003, 100 (9), 5234–5239. (45) Dhugga, K. S.; Tiwari, S. C.; Ray, P. M. A reversibly glycosylated polypeptide (RGP1) possibly involved in plant cell wall synthesis: purification, gene cloning, and trans-Golgi localization. Proc. Natl. Acad. Sci. U.S.A. 1997, 94 (14), 7679–7684. (46) Konishi, T.; Takeda, T.; Miyazaki, Y.; Ohnishi-Kameyama, M.; Hayashi, T.; O’Neill, M. A.; Ishii, T. A plant mutase that interconverts UDP-arabinofuranose and UDP-arabinopyranose. Glycobiology 2007, 17 (3), 345–354. (47) Lind, T.; Tufaro, F.; McCormick, C.; Lindahl, U.; Lidholt, K. The putative tumor suppressors EXT1 and EXT2 are glycosyltransferases required for the biosynthesis of heparan sulfate. J. Biol. Chem. 1998, 273 (41), 26265–26268. (48) Madson, M.; Dunand, C.; Li, X.; Verma, R.; Vanzin, G. F.; Caplan, J.; Shoue, D. A.; Carpita, N. C.; Reiter, W. D. The MUR3 gene of Arabidopsis encodes a xyloglucan galactosyltransferase that is evolutionarily related to animal exostosins. Plant Cell 2003, 15 (7), 1662–1670. (49) Harholt, J.; Jensen, J. K.; Sorensen, S. O.; Orfila, C.; Pauly, M.; Scheller, H. V. ARABINAN DEFICIENT 1 is a putative arabinosyltransferase involved in biosynthesis of pectic arabinan in Arabidopsis. Plant Physiol. 2006, 140 (1), 49–58. (50) Jensen, J. K.; Sorensen, S. O.; Harholt, J.; Geshi, N.; Sakuragi, Y.; Moller, I.; Zandleven, J.; Bernal, A. J.; Jensen, N. B.; Sorensen, C.; Pauly, M.; Beldman, G.; Willats, W. G.; Scheller, H. V. Identification of a xylogalacturonan xylosyltransferase involved in pectin biosynthesis in Arabidopsis. Plant Cell 2008, 20 (5), 1289–1302.

research articles

Bioinformatics Approaches for Glycosyltransferases (51) Pak, J. E.; Arnoux, P.; Zhou, S.; Sivarajah, P.; Satkunarajah, M.; Xing, X.; Rini, J. M. X-ray crystal structure of leukocyte type core 2 beta1,6-N-acetylglucosaminyltransferase. Evidence for a convergence of metal ion-independent glycosyltransferase mechanism. J. Biol. Chem. 2006, 281 (36), 26693–26701. (52) Martinez-Duncker, I.; Mollicone, R.; Candelier, J. J.; Breton, C.; Oriol, R. A new superfamily of protein-O-fucosyltransferases, {alpha}2-fucosyltransferases, and {alpha}6-fucosyltransferases: phylogeny and identification of conserved peptide motifs. Glycobiology 2003, 13 (12), 1C–5C. (53) Oriol, R.; Mollicone, R.; Cailleau, A.; Balanzino, L.; Breton, C. Divergent evolution of fucosyltransferase genes from vertebrates, invertebrates and bacteria. Glycobiology 1999, 9 (4), 323–334. (54) Chazalet, V.; Uehara, K.; Geremia, R. A.; Breton, C. Identification of essential amino acids in the Azorhizobium caulinodans fucosyltransferase NodZ. J. Bacteriol. 2001, 183 (24), 7067–7075. (55) Stahl, M.; Uemura, K.; Ge, C.; Shi, S.; Tashima, Y.; Stanley, P. Roles of Pofut1 and O-fucose in mammalian Notch signaling. J. Biol. Chem. 2008, 283 (20), 13638–13651. (56) Dunkley, T. P.; Watson, R.; Griffin, J. L.; Dupree, P.; Lilley, K. S. Localization of organelle proteins by isotope tagging (LOPIT). Mol. Cell. Proteomics 2004, 3 (11), 1128–1134. (57) Sadowski, P. G.; Dunkley, T. P.; Shadforth, I. P.; Dupree, P.; Bessant, C.; Griffin, J. L.; Lilley, K. S. Quantitative proteomic approach to study subcellular localization of membrane proteins. Nat. Protoc. 2006, 1 (4), 1778–1789. (58) Egelund, J.; Skjot, M.; Geshi, N.; Ulvskov, P.; Petersen, B. L. A complementary bioinformatics approach to identify potential plant

(59)

(60) (61) (62)

(63)

(64)

cell wall glycosyltransferase-encoding genes. Plant Physiol. 2004, 136 (1), 2609–2620. Gough, J.; Karplus, K.; Hughey, R.; Chothia, C. Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J. Mol. Biol. 2001, 313 (4), 903–919. Kelley, L. A.; MacCallum, R. M.; Sternberg, M. J. E. Enhanced genome annotation using structural profiles in the program 3DPSSM. J. Mol. Biol. 2000299, (2), 499–520. McGuffin, L. J.; Jones, D. T. Improvement of the GenTHREADER method for genomic fold recognition. Bioinformatics 2003, 19 (7), 874–881. Egelund, J.; Obel, N.; Ulvskov, P.; Geshi, N.; Pauly, M.; Bacic, A.; Petersen, B. L. Molecular characterization of two Arabidopsis thaliana glycosyltransferase mutants, rra1 and rra2, which have a reduced residual arabinose content in a polymer tightly associated with the cellulosic wall residue. Plant Mol. Biol. 2007, 64 (4), 439– 451. Egelund, J.; Petersen, B. L.; Motawia, M. S.; Damager, I.; Faik, A.; Olsen, C. E.; Ishii, T.; Clausen, H.; Ulvskov, P.; Geshi, N. Arabidopsis thaliana RGXT1 and RGXT2 encode Golgi-localized (1,3)-alphaD-xylosyltransferases involved in the synthesis of pectic rhamnogalacturonan-II. Plant Cell 2006, 18 (10), 2593–2607. Tamura, K.; Dudley, J.; Nei, M.; Kumar, S. MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol. Biol. Evol. 2007, 24 (8), 1596–1599.

PR800808M

Journal of Proteome Research • Vol. 8, No. 2, 2009 753