The Bologna Annotation Resource: a Non Hierarchical Method for the

The procedure that exploits information from molecular function Gene Ontology and PDB/SCOP databases allows increasing the annotated genome fraction w...
0 downloads 0 Views 3MB Size
The Bologna Annotation Resource: a Non Hierarchical Method for the Functional and Structural Annotation of Protein Sequences Relying on a Comparative Large-Scale Genome Analysis Lisa Bartoli,† Ludovica Montanucci,† Raffaele Fronza,† Pier Luigi Martelli,† Piero Fariselli,† Luciana Carota,‡ Giacinto Donvito,§ Giorgio P. Maggi,§,| and Rita Casadio*,† Biocomputing Group, CIRB/Dept of Biology, University of Bologna, Italy, CNAF-INFN, National Institute of Nuclear Physics, Bologna, Italy, BA-INFN, National Institute of Nuclear Physics, Bari, Italy, and Physics Department, Politecnico di Bari, Italy Received March 4, 2009

Abstract: Protein sequence annotation is a major challenge in the postgenomic era. Thanks to the availability of complete genomes and proteomes, protein annotation has recently taken invaluable advantage from crossgenome comparisons. In this work, we describe a new non hierarchical clustering procedure characterized by a stringent metric which ensures a reliable transfer of function between related proteins even in the case of multidomain and distantly related proteins. The method takes advantage of the comparative analysis of 599 completely sequenced genomes, both from prokaryotes and eukaryotes, and of a GO and PDB/SCOP mapping over the clusters. A statistical validation of our method demonstrates that our clustering technique captures the essential information shared between homologous and distantly related protein sequences. By this, uncharacterized proteins can be safely annotated by inheriting the annotation of the cluster. We validate our method by blindly annotating other 201 genomes and finally we develop BAR (the Bologna Annotation Resource), a prediction server for protein functional annotation based on a total of 800 genomes (publicly available at http:// microserf.biocomp.unibo.it/bar/). Keywords: protein functional annotation • cross-genome comparison • alignment coverage • Grid technology

Introduction Functional annotation of protein sequences is a major requirement in the postgenomic era. This is particularly true given the large number of ongoing genome sequencing projects which require a fast and efficient automatic annotation process. Historically function prediction was based on the relationship between protein sequence and structure: proteins that * To whom correspondence should be addressed. Via Irnerio 42, 40126 Bologna, Italy. Tel.: +39 051 2094005. Fax: +39 051 242576. E-mail: casadio@ biocomp.unibo.it. † University of Bologna. ‡ CNAF-INFN, National Institute of Nuclear Physics. § BA-INFN, National Institute of Nuclear Physics. | Physics Department, Politecnico di Bari.

4362 Journal of Proteome Research 2009, 8, 4362–4371 Published on Web 06/24/2009

share some 40-50% of sequence homology, are also likely to share a similar function.1 This observation is at the basis of a number of methods developed to functionally annotate proteins starting from their structure (for example DALI,2 CE,3 and the most recent CATHEDRAL,4 EVEREST5). In automatic protein structure classification, however, it is not trivial to determine the level of structural similarity that corresponds to functional similarity. Greene and co-workers,6 analyzing the SCOP database (http://scop.mrc-lmb.cam.ac.uk/ scop/) observed that although most domains with a common fold show a similar function, some “superfolds” (such as the Rossmann fold) can correspond to about 50 different functions. A sequence similarity search is a common initial step of all the annotation methods including RefSeq,7 VEGA8 and PEDANT.9 However, in the absence of a golden rule for correlating function to sequence, different constraints were ruled out for determining to which extent functional and structural information can be transferred between pairs of related protein sequences. Depending on how the function conservation is defined, the level of protein sequence similarity that allows functional annotation may vary from 30-40% down to 20-25%, provided that folding is also conserved.10 A pairwise sequence identity higher than 40% was suggested as a confident threshold to transfer the first three digits of an EC number11,12 (the Enzyme Commission number is a four digit numerical classification scheme for enzymes). As a general trend, sequence similarity search allows clustering procedures by which protein chains are collected into sets of similarity. Clustering is therefore a basic task in automatic processing of large data sets of protein sequences: annotation methods following this procedure are classified as hierarchical and non hierarchical ones, depending on the clustering technique. Hierarchical clustering methods aim to categorize data items into a tree-structured hierarchical organization. This is a possible approach to deal with different levels of similarity required to annotate different protein families. Earlier attempts of hierarchical clustering include SYSTERS,13 Picasso,14 and iProClass.15 CluSTr16,17 and ProtoNet18,19 are presently the hierachical algorithms that take advantage of the largest number of fully sequenced genomes for protein sequence classification into families. CluSTr (that includes the UniProt Knowledgebase (www.uniprot.org), all International Protein Index (IPI, www.ebi.ac.uk/IPI/IPIhelp.html) and all the com10.1021/pr900204r CCC: $40.75

 2009 American Chemical Society

technical notes

Functional and Structural Annotation of Protein Sequences pletely sequenced genomes retrieved from Integr8, (www.ebi. ac.uk/integr8/EBI-Integr8-HomePage.do)), clusters proteins starting from a similarity matrix of all-against-all protein sequence alignments computed with the Smith-Waterman algorithm. Z-scores for evaluating the similarity between potentially related proteins are then computed with a Monte Carlo simulation and proteins are grouped using a singlelinkage algorithm by choosing different levels of protein similarities.16,17 ProtoNet is an automatic and unsupervised agglomerative clustering system that builds a hierarchical tree of proteins based only on sequence similarity by means of an Unweighted Pair Group Method with Arithmetic mean (UPGMA). The latest data set of ProtoNet consists of all the protein sequences taken from UniProt Knowledgebase (release 8.1).18,19 Both these methods use hierarchical algorithms to group sequences into clusters, ultimately relying on the sequence identity of the aligned proteins as taken into account by selecting different E-value thresholds (E-values are set to 100 and to 1e-40 by CluSTr and ProtoNet, respectively). Non hierarchical methods generate a classification by partitioning a data set, giving a set of (generally) nonoverlapping groups having no hierarchical relationships between them. This clustering procedure is extremely used in fields such as multispectral image and microarrays analysis.20 Few applications are available for sequence analysis. Methods were developed based on a standard k-means approach with proteins represented by vectors21 and based on random sampling of subsequences partitioned into homogeneous groups of proteins.22 In the contest of protein annotation, the TribeMCL algorithm23 allows a fast annotation that is rather independent of the presence of multidomain proteins, promiscuous domains and fragmented proteins. This method includes an all-againstall BLAST comparison with an E-value threshold of 1 × 10-10. The results (represented with a binary matrix) cluster into protein families after successive rounds of Smith-Waterman dynamic programming alignments. Summing up, the most common and widely exploited approach for automatic annotation of unknown sequences relies on the so-called “inheritance through homology” based on the notion that similar sequences share similar functions and structures. However, additional levels of complexity are due to: (i) proteins that contain multiple domains; (ii) proteins that share common domains and that do not necessarily share the same function; (iii) the finding that different combinations of shared domains can lead to different biological roles. Multidomain proteins are less functionally conserved than single-domain ones, with the exception of those proteins in which structures show an overlapping combination of domain folds.24 As a consequence, methods based on the “inheritance through homology” may run into the risk of pulling sequences in the same cluster and erroneously transferring functions, particularly when dealing with putative multidomain proteins. Therefore both a high sequence identity value and the overlap extent among the aligned sequences (coverage) should be considered when applying sequence comparison techniques to problems of annotation. So far, none of the available methods explicitly faces the problem of multidomain protein annotation (only ProtoNet refers to EVEREST, an external database containing protein domains). In this paper we propose a fast and reliable automatic method for bridging the gap between protein sequence, structure and function. We develop a non hierarchical clustering procedure after BLAST extensive cross-genome comparison,

where pairs of sequences are selected by imposing very strict criteria of similarity and coverage, justified from what we discussed above. We then map on the clusters the Gene Ontology molecular functions and determine with statistical validation the functional specificity of the clusters. This allows an increasing functional annotation with respect to UniProt of some 30% of a set including 2 624 555 sequences from 599 genomes (S599). We also map on the clusters the PDB database, and its SCOP annotation allowing, when possible, a sequence to structure and function fine grain annotation that explicitly distinguishes among mono and multidomain proteins. This is possible for some 20% of S599. For validating our procedure, we blindly annotate some other 201 genomes (S201), recovering most of the UniProt annotation (in this case 13% of the total number of sequences). With our system the amount of annotated S201 sequences increases of another 37% and some 25% is also endowed with a PDB/SCOP structure. Overall, our results indicate that the method fully exploits the sequence to function and structure information, as it is recovered from the presently available reference databases. Finally we implement an annotation system that upon request can annotate any new sequence provided that its alignment toward the clusters verifies some stringent criteria. The annotation system (BAR) is based on the cross-comparison of some 800 genomes and it is publicly available at http://microserf.biocomp.unibo.it/bar/.

Materials and Methods Data Sets. The data set for generating annotation clusters includes 599 completely sequenced proteomes (S599), among which 551 are from prokaryotic organisms (18 Archaea and 533 Bacteria) and 48 are from Eukaryotes (2 Protozoa, 9 Fungi, 4 Plants and 33 Animals). The data set contains 2 624 555 protein sequences: 65% (1 713 574 sequences) from Prokaryotes and 35% (910 981 sequences) from Eukaryotes. Another 201 genomes, comprising 4 Eukaryotes and 197 Prokaryotes, were adopted as a blind test for the annotation procedure (S201), including 724 854 protein sequences (11% of which are from Eukaryotes and 89% from Prokaryotes). The bacterial genomes were downloaded from the NCBI genome resource database (ftp://ftp.ncbi.nih.gov/genomes/Bacteria) while the eukaryotic ones were taken both from the RefSeq at NCBI (ftp://ftp.ncbi. nih.gov/refseq/release/) and from the Ensembl Genome Project (ftp://ftp.ensembl.org/pub/). A complete list of the organisms used in this study is provided in the Supporting Information. The total set of 800 genomes (S800) is presently at the basis of BAR (the Bologna Annotation Resource), the freely available web server. Sequence Alignments. An all-against-all pairwise comparison of the 2 624 555 protein sequences of S599 was first performed with the BLAST program.25 To consider only highscored alignments, we set the E-value to 1 × 10-10, a high restrictive threshold. In addition, in order to obtain reproducible results, the BLAST database size was kept constant for each independent run (100 000 sequences). Default values were adopted for the alignment parameters (gap opening ) -11, gap extension ) -1, matrix ) BLOSUM62). All the BLAST comparisons, which represent the data production step of our method, are carried out in parallel on the Grid. The Grid middleware is a suitable tool for handling the data size problem and for accessing, sharing and processing the retrieved and computed data (grid.infn.it/ modules/wfchannel/). Journal of Proteome Research • Vol. 8, No. 9, 2009 4363

technical notes

Bartoli et al.

Clustering the BLAST Results. The protein space, encoded in the computed alignments, was represented by means of an undirected graph structure. The nodes of the graph are the protein sequences. Given two protein sequences that share a BLAST hit, we define the coverage of the match as: Coverage ) I/U, where I is the length of the intersection of the aligned regions on the two sequences and U is the overall length of the alignment (namely the sum of the lengths of the two sequences minus the intersection length). An edge is established between two nodes only when the two corresponding proteins share BLAST hits that simultaneously satisfy the following constraints: Sequence identity g 40 % Coverage g 90 %

(1)

By this each pair of sequences shares simultaneously a high sequence identity value for a large fraction of the alignment (as indicated by the high coverage value). After building the graph, we clustered proteins by computing the connected components of the graph with a transitive closure algorithm.26 Given a graph, a connected component is a subset of its nodes such that all nodes in that set are connected (reachable by some path) and that no pairs of nodes belonging to different subsets are mutually connected.26 Therefore these components are by definition disjointed, so to say that no protein sequence is present in two different clusters. With the clustering procedure, each connected component includes all the pairs of sequences satisfying eq 1 and more importantly, it includes also chains that are not directly linked but are connected through a path of proteins which undergo the imposed criteria. For this reason, a cluster can also contain distantly related sequences. We define a cluster as a connected component of the graph whose dimension is g2. When sequences are found alone after clustering, they are called singletons. Mapping of GO and PDB on the Clusters. To investigate the functional and the structural information enclosed in the computed clusters we mapped the three major databases, the UniProt Knowledgebase27 (release 12.6, Swiss-Prot+TrEMBL, www.uniprot.org), the Protein Data Bank28 (PDB, March 2008, www.rcsb.org) and the Gene Ontology29 (GO, March 2008, www.geneontology.org) over the clusters. A correspondence between a sequence from our database and an entry of one of the selected databases was uniquely identified when the correspondent CRC64 checksum30 was identical. After retrieving the sequence by means of the CRC64 checksum we checked if it was identical to the original one in our data set. In case of extremely rare conflicts (multiple matches of CRC64 checksum) we selected the identical sequence. The UniProt database is the most complete resource of annotated sequences. UniProt annotation is a labor-intensive process that involves assessment of information from published articles along with use of a variety of programs/algorithms (for details see www.uniprot.org/docs/annbioch.txt and references therein). UniProt provides relations with other databases, including the Gene Ontology (GO).29 GO provides a set of hierarchical controlled vocabulary according to three categories: Biological process, Cellular component and Molecular function. Not all the UniProt files contain GO terms. When possible GO terms are manually or electronically assigned to a UniProt entry (www.uniprot.org/manual/gene_ontology). Our strategy aims at expanding this possibility. 4364

Journal of Proteome Research • Vol. 8, No. 9, 2009

After clustering, we determine how many sequences in the 187 594 clusters have a UniProt counterpart. Seventy-six percent of S599 sequences have an associated UniProt accession (40% of the entire UniProt): 44% of the sequences spread over the obtained clusters and 32% is found in singletons. To determine the functions associated to the sequences, we make use of the GO29 terms as indicated in UniProt when available, and specifically focus on the molecular function ontology (presently including 8351 functional terms) which is likely to be evolutionary conserved and best describes the molecular activities carried out by proteins and protein complexes. When considering the GO terms we also take into account all their parent terms31 (the data structure underlying the GO ontology is a Directed Acyclic Graph (DAG) that allows a child to have multiple parents) for capturing all the molecular function biochemical complexity. To handle with the DAG structure, we devised a two step procedure. First, we directly associated, when available, each protein sequence within a cluster to the corresponding GO term/s given by the UniProt annotation. Then for each sequence, we extended all the possible branches of the GO hierarchy by recursively walking along the parent branches of the molecular function GO tree. For exploring the distribution of the protein domains in the clusters, we took advantage of the PDBsum database,32 since it provides the correspondence between the protein chains contained in the PDB and the UniProt accession identifiers. In this way, we were able to establish an association between the protein sequences of our data set and the PDB protein structures (about 50% of PDB). Statistical Evaluation of GO Terms by Means of a P-Value Analysis. To assess whether a GO term is significant for a cluster, we performed a statistical test by computing the P-values. If N is the number of sequences in a given cluster which correspond to the same specific GO term, the P-value is the probability of finding N or more proteins that have a given annotation by chance, given the dimension of the cluster, the dimension of the database and the overall number of sequences in the database with the given annotation. For each GO term within a cluster the corresponding P-value is evaluated as:

min(K,P)

p-value )

∑ i)N

( ) ( )( ) () K i

D-K P-i D P

(2)

where D is the dimension of the database (total number of sequences with at least one associated GO term), P is the dimension of the cluster (total number of sequences of the cluster with at least one associated GO term), K is the number of sequences in all the database which have associated the same specific GO term. To compute the binomial expressions at the numerator in eq 2, we applied the Staˇnicaˇ approximation procedure.33 To account for the multiplicity of the GO terms in each cluster, we applied the Bonferroni correction to the computed P-values.34 Given the high dimensionality of the problem, a careful statistical analysis is required to investigate whether the GO terms present in each cluster are statistically significant and therefore whether the associated molecular function can be considered specific for the given cluster. To address this problem, we performed a statistical evaluation by computing

Functional and Structural Annotation of Protein Sequences

technical notes

Figure 1. Cumulative distributions of the Bonferroni corrected P-values. A logarithmic scale is adopted for representing the Bonferroni corrected P-values. Dotted line represents the cumulative function of the observed P-values. The solid line is the cumulative distribution of the average of the P-values of a 100 random benchmark set. Error bars represent standard deviations.

P-values and adopting a bootstrapping procedure. For this, we considered which range of values would be computed for the P-value, when the GO terms were randomly distributed among the clusters (the baseline random distribution) and we compared the observed distribution of the P-values with the baseline random distribution. The random distribution is a function of the data set composition, being computed by preserving the total number of clusters, the total number of GO terms and the cluster dimensions. To estimate a P-value threshold we compared the cumulative distributions of the real and random computed P-values (Figure 1). By this, a GO term is statistically significant for a cluster (i.e., that GO term is cluster-specific) if its associated P-value is smaller than the threshold value for which the real and the random curves are significantly different (Figure 1). An optimal threshold of the P-value is the one for which the observed curve significantly differs from the random ones. In particular, given our clustering procedure, the value for the observed cumulative distribution corresponding to a Pvalue of 0.001 is 31 606 (namely, the number of clusters with at least a GO term with P-value e 0.001), while the mean of the random cumulative distributions is 25 (namely, the number of randomly generated clusters having a minimum P-value e 0.001). For a P-value threshold of 0.001 we are expecting 8 noncorrectly assigned GO terms out of 10 000. On these bases we adopted 0.001 as a suitable threshold of P-value in order to guarantee statistical significance to the GO terms/clusters association and define the functional specificity of the cluster at hand.

Results and Discussion The main novelty of our non hierarchical clustering procedure consists in the organization of the protein sequences into clusters according to a very strict criterion that considers constraints based simultaneously on high sequence similarity (g40%) and high sequence coverage (g90%). We also included in the clusters, when possible, PDB structures, their SCOP classification and GO terms in order to exploit annotations in terms of sequence to structure, to function, and to structure and function. The annotation process is statistically validated

Figure 2. Scheme of the BAR methodology. Representation of the main steps of the BAR sequence annotation procedure.

by computing the level of confidence at which the GO annotation transfer process is performed. The general scheme of our method is shown in Figure 2. To exploit the added value of our procedure over other annotation methods already described and available in the web, in the following we will describe: (I) the clustering procedure and its statistical validation performed on an initial set comprising 599 genomes (S599); (II) a blind test on a set comprising some other 201 genomes (S201); (III) a final application leading to BAR and comprising a total sum of 800 genomes (S800). I. Clustering Procedure and Cluster Content: Clusters vs Singletons. Our clustering procedure, considering the initial S599 set, produced 187 594 clusters, which include a total of 1 963 704 protein sequences (75% of S599). It should be stressed that given the cluster procedure (see Materials and Methods) each protein chain belongs only to one cluster and that all the clusters are disjoint. The number of sequences in the clusters ranges from 2 (minimum cluster dimension allowed) to 15 875 (the dimension of the biggest connected component). 99% of the clusters have a number of sequences e200. The remaining 25% of S599 (for a total of 660 851 sequences) are the so-called singletons, namely sequences without any BLAST match satisfying our restrictive constraints (eq 1). Among the clusters, 122 686 (65%) contain sequences deriving from only prokaryotic organisms while 63 230 (34%) are specific for Eukaryotes. Only 1678 clusters (1%) contain sequences from both prokaryotic and eukaryotic organisms. In Table 1, clusters are grouped as a function of the standard deviation of the protein sequence length distribution of each cluster. This statistics highlights that clusters are quite homogeneous in protein length: the sequence length variability of most of the clusters (over 90%, containing some 87% of the total number of sequences in the clusters) is Journal of Proteome Research • Vol. 8, No. 9, 2009 4365

technical notes Table 1. Non Hierarchical Clustering of S599

a The constraints adopted for clustering of S599 are described in Materials and Methods (eq 1). Groups of clusters, indicated with capital letters, are obtained by collecting protein clusters as a function of the computed standard deviation of protein length in each cluster. The total number of clusters is 187 594 for a total sum of 1 963 704 protein sequences (75% of S599). b For completeness, the total number of protein sequences termed singletons is also reported (H) comprising about 25% of the total S599 proteome.+The percentage refers to the total number of clusters and sequences in the clusters.

e40 residues that is the minimum number of residues contained in a structural domain according to the SCOP structure classification. These results indicate the efficacy of our selected and stringent criteria (eq 1). Sequence identity values as low as 1-20% can be detected in nearly all clusters containing a number of sequences g10. Some 38% of the clusters, corresponding to 69% of the sequences in the clusters are annotated with the procedure described in the following. Ia. From Sequence to Function: GO Mapping. We define a sequence to be functionally annotated when it is endowed in UniProt with a GO term of the molecular function ontology, being this more specific for protein biochemical functions. With our procedure we use the GO terms found in UniProt to annotate the clusters. A statistical validation is however necessary to ensure functional homogeneity to the clusters. By this to each UniProt associated GO term a computed P-value is associated. When the P-value is e0.001 the GO term is specific in describing the cluster associated function. After our procedure we may distinguish between clusters with associated GO terms directly transferred from UniProt that may or may not be statistically validated and clusters without any GO term. Within the clusters with GO terms sequences without any annotation inherit annotation. The quality of this annotation depends on the P-value associated to the different GO terms. Singletons may be or not be annotated. For annotated singletons the P-value computation is not feasible. In Table 2 we list how the GO annotated sequences distribute among the clusters. Forty percent of the sequences of S599 have at least one associated GO term in UniProt (35% populate 38% of the clusters and 5% fall into singletons) and according to our definition are therefore functionally annotated. The percentage (number) of protein chains directly annotated in UniProt is listed as “direct” in Table 2. After clustering, another 15% of S599 sequences “inherit” annotation (Table 2), 9.5% of which with a cluster-specific GO term/s (P-value e0.001). For each cluster, specific GO terms range on average from 1 to 50, with the exception of the biggest cluster (comprising 15 875 sequences) which has associated 108 cluster-specific GO terms (related to ABC transporter/ATP-binding protein). Interestingly enough, after statistical validation a small percentage (0.1%) 4366

Journal of Proteome Research • Vol. 8, No. 9, 2009

Bartoli et al. of sequences changed significantly their associated GO term with respect to UniProt. These annotations routinely correspond to an electronic transfer of GO terms to UniProt (see examples in Supporting Information). The most represented and statistically validated (corresponding to GO terms with P-value e 0.001) molecular functions are listed in Figure 3, where the abundance of Catalytic, Binding and Transporter is evident. Among Catalytic the most frequent are Transferase, Hydrolase and Oxidoreductase. A complete analysis of the total number of annotated proteins for the top level terms (Figure 3) is detailed in Table 1S (Supporting Information), with children when present. For each annotation level, the number of proteins annotated with BAR is compared to that annotated in UniProt. The important results out of this effort over 599 genomes are: (i) a statistical validation of the GO terms directly derived from the UniProt annotated sequences within the clusters (directly annotated, “direct” in Table 2); (ii) a more rigorous annotation of some UniProt annotated sequences that by entering a cluster inherit statistically validated GO terms different from UniProt (change annotation, “changed” in Table 2); (iii) the statistically validated annotation of 15% of S599 sequences (397 131 UniProt non annotated sequences of S599, “inherited” in Table 2). Ib. From Sequence to Structure: PDB/SCOP Mapping. One added value of our procedure is to include information derived from PDB. Roughly 50% of the PDB structures can be presently included in our computed clusters: a total of 18 357 protein sequences of our data set correspond to 23 050 PDB files for a total of 28 978 PDB chains. These chains are distributed over 5734 clusters (about 3% of the total number of clusters). Only 151 singletons (0.02% of the total number of singletons) are endowed with the correspondent PDB chains (Table 3), for a total of 559 structures. To evaluate the structural congruency within the clusters, each PDB chain was then linked to its corresponding SCOP35 identifier/s (a SCOP identifier is a four digit code that allows classification of protein domains; the same SCOP identifier guarantees high structural similarity among protein domains); 21 090 PDB chains are endowed with SCOP identifiers in 3961 clusters. We found that 3064 (of the total 5734) clusters contain a unique SCOP identifier (monodomain identifier) for a total of 14,340 PDB chains and 334 647 protein sequences (third column in Table 3). Eight-hundred ninety-seven clusters are endowed with multiple SCOP identifiers (multidomain SCOP identifiers, ranging from 2 to 6 SCOP identifiers per cluster), containing 6750 PDB chains and 167 779 sequences. When transferring a template structure to a given sequence a major problem is the relative overlap of the target and template: the higher the overlap, the higher the probability of obtaining a template overlapping the whole protein sequence. In our clusters, given the clustering procedure, and as shown above (Table 1), sequences of similar length tend to be associated in the same cluster. Consequently, when PDB/SCOP are mapped into the clusters, a very high overlapping among template and cluster sequences is detected. Most of the clusters endowed with a PDB/SCOP label are also endowed with a high overlap (g0.90 in Table 3). The amount of clusters with an overlap e0.8 is negligible. The overlap is high also in the case of SCOP multidomain proteins (rightmost column of Table 3), indicating that within clusters even multidomain protein topologies can be safely inherited. After the PDB/SCOP mapping of the clusters (considering also that a small fraction of

technical notes

Functional and Structural Annotation of Protein Sequences Table 2. From Sequence to Function: Mapping of Molecular Function GO terms into the Clusters

a Percent values are computed with respect to the total number of clusters of S599 (187 594) and to the total sum of protein sequences of S599 (2 624 555). b Total number of singletons is 660 851. For the P-value threshold see Material and Methods. “Direct” annotation: sequences with at least a GO term. “Inherited” annotation: sequences that inherit GO term/s in the cluster. “Changed” annotation: sequences that after statistical validation change functional annotation with respect to UniProt (see Material and Methods). c Without GO terms.

Figure 3. Histogram of the main statistically validated GO molecular functions in S599. The figure shows the mapping of the 13 different molecular function GO tree main branches over the sequences of the S599 data set that fall into clusters with function specific GO terms. X axis: number of hits of each functions over the clusters. Y axis: Functions (catalytic activity, binding activity, transporter activity, transcription regulator activity, transducer activity, structural molecule activity, electron carrier activity, transcription regulator activity, enzyme regulator activity, antioxidant activity, nutrient reservoir activity, auxiliary transport activity and metallochaperone activity).

PDBs is lacking a SCOP label) we can safely transfer a template to nearly 23% of the total number of S599 sequences, including SCOP mono and multidomain proteins. Ic. From Sequence to Function and Structure. As a final result, our annotation procedure can produce three main categories of annotation (Table 4): PDB and GO; PDB without GO; GO without PDB, and no annotation. These categories, with the numbers of clusters and sequences in the clusters are listed in the foremost left and right corners of Table 4. After SCOP labeling and GO statistical validation, the richest annotation (PDB and GO) gives rise to six different types, listed in the A cells of Table 4. In turn, PDB without GO splits into three categories (only PDB, PDB/SCOP monodomain and PDB/ SCOP multidomain, C cells Table 4); GO without PDB is further divided into two categories (cluster-specific and non specific GO terms; B cells in Table 4). We can therefore assign 11 finegrain levels of annotation in the clusters. As a final result, 27.5% of the clustered S599 protein sequences inherits GO and PDB labels (21% of the total S599

protein chains), most of which with a cluster-specific GO term (P-value e 0.001) (Table 4). Twenty-eight percent of S599 has statistically significant associated GO terms (without PDB) and a remaining 11% has GO with a P-value > 0.001. In total about 52% of S599 protein sequences is included in this annotation process, while the direct UniProt annotation covers only 21%. A net increase of 31% protein sequence is obtained with this annotation procedure. These results suggest that our method allows a rigorous and safe annotation of S599 sequences, with the advantage of statistically validating and transferring the associated GO terms, and the template/s structures. II. Blind Test on a Data Set of 201 Genomes. To test our method we performed a blind test on 201 genomes (S201) not included in the clustering procedure. In this case we do not take advantage of the UniProt annotation. The blind test consists in aligning all the S201 sequences toward our clusters and consequently annotate them when the stringent criteria of eq 1 are satisfied. The never-seen-before sequences of S201 Journal of Proteome Research • Vol. 8, No. 9, 2009 4367

technical notes

Bartoli et al.

Table 3. Template Overlap over the Clusters Containing at Least a PDB Structure

a,b See Table 2. The total number of clusters of S599 with at least a PDB template is 5734 for a total sum of 596 411 protein sequences. “Direct” annotation: sequences with at least a PDB structure. “Inherited” annotation: sequences that inherit a structural template in the cluster.+The PDB structure is monodomain or multidomain according to the SCOP classification. c Overlap: a measure that indicates whether the structural template of a cluster can be adopted as a template for the protein sequences within the same cluster. For each cluster, the coverage is the average of the ratios: lPDB/l, where l is the length of each protein sequence and lPDB is: (2nd column) the length of the longest protein with a PDB structure; (3rd and 4th columns) the length of the longest protein with at least one domain annotated with SCOP (third and fourth columns in Table 3). The overlap is separately computed for all the clusters with a PDB structure (5734 clusters), for the clusters with a monodomain (3064 clusters) and multidomain (897 clusters) structural template.

Table 4. Statistical Validation of the Inheritance Annotation Process in S599 Clusters

a Total number of clusters of S599 is 187 594 for a total sum of 1 963 704 protein sequences. Direct annotation and Inherited annotation is as in Table 2 and 3. Three different types of annotation are possible: GO and PDB; GO without PDB; PDB without GO. The rightmost bottom corner contains the number of clusters and sequences that is without annotation. Each type of annotation can be statistically validated and the PDB annotation can be associated to a SCOP classification; by this 11 annotation types can be highlighted (A with GO and PDB; B with GO without PDB and C with PDB without GO).

were aligned with those contained in the clusters with the same constraints as before: a sequence ended up in a precomputed cluster provided that both constraints of eq 1 were simultaneously satisfied with respect to one of the S599 sequences in the clusters. Over 67% of the S201 protein sequences is assigned to property-specific clusters (P-value e 0.001) and for over 50% 4368

Journal of Proteome Research • Vol. 8, No. 9, 2009

of these, a structure template is also provided. Another 9.5% is endowed with a non cluster-specific GO term/s (P-value > 0.001). Some 54% of the total S201 protein sequences is annotated and this is to be compared with 13% of annotation directly obtained from UniProt. The coincidence of our annotations with that of UniProt is striking (“direct” in Table 5).

technical notes

Functional and Structural Annotation of Protein Sequences Table 5. Performance of the Method: a Fine Grain Annotation of S201

a Percent values are computed with respect to the total S201 sequences found in the S599 clusters (495 440 sequences of S201 data set are mapped on the 187 594 clusters of S599). The table lists the different levels of annotation as described in Table 4 (legends as in Table 4).

Figure 4. BAR annotation of the genome of Pongo pygmaeus abelii. The Pongo pygmaeus abelii genome was downloaded from Ensembl (www.ensembl.org). It consists of 23 409 protein sequences among which 19 154 have a match in the S599 clusters. Some 44% of the total could be annotated with our procedure. Bars indicate in a color code the different annotation types as a function of the different values of protein length standard deviation (from A to G, as defined in Table 1). Color code: (i) cluster specific GO term and PDB/SCOP monodomain (red); and PDB/SCOP multidomain (brown); and PDB without SCOP annotation (orange); without PDB (yellow); (ii) GO term with P-value greater than the selected threshold (P-value > 0.001) and PDB/SCOP monodomain (blue); and PDB/SCOP multidomain (magenta); and PDB without SCOP annotation (purple); without PDB (pink); (iii) without GO and with PDB/SCOP monodomain (black); and PDB/SCOP multidomain (dark-gray); and PDB without SCOP annotation (light-gray); and (iv) no annotation (white).

However our method allows to fine grain annotate some 37% more sequences of S201 (“inherited” in Table 5), and changes a negligible fraction of UniProt annotation (“changed” in Table 5). III. Application: BAR, the web server for protein functional annotation as derived from S800. Finally we implemented a web server for protein functional annotation considering S800. Reclustering of 800 genomes changed very little the overall picture described above. The total number of clusters increased 1.2 fold as compared to that obtained with S599 (4% of the old clusters glued into one cluster), and clusterspecificity was re-evaluated, according to the procedure detailed above. The overall change detected in increasing the

number of genomes by about one-half (S599 plus S201) was very little, suggesting that a good rate of system updating may be twice per year (considering also PDB updating). Presently for every sequence submission, the system allows the chain to be aligned against the annotated clusters and singletons. Alignments between sequences with identity g40% on at least the 90% of the alignment are marked with a red symbol (these are the pairwise alignments that will allow a safe inheritance of the annotation). For each cluster a general statistics of its dimensions is reported (including the number of sequences, number of genomes (Prokaryotes and Eukaryotes), number of UniProt accessions and of PDB structures included in the cluster). In order to functionally classify the cluster, the list of Journal of Proteome Research • Vol. 8, No. 9, 2009 4369

technical notes statistically validated GO terms, if present, is also provided together with the corresponding P-values and the description of the functions as reported in the GO database. As an example of what can be computed with our method, we show the annotation that we obtained for the genome of Pongo pygmaeus abelii, downloaded from Ensembl. This genome is barely annotated in UniProt (2%). The fine grain annotation of this genome after our procedure is color coded in Figure 4 and amounts to about 44% of the total.

Conclusions Genome annotation is a key issue in modern computational biology, being the starting point toward the understanding of the complex processes involved in biological networks. In this study we developed and validated a system that contributes to sequence annotation by taking advantage of a validated transfer through inheritance procedure of GO terms and PDB templates. Clusters, after a cross-genome comparison with BLAST, are built on the basis of the two stringent constraints on sequence identity and coverage of the aligned sequences. This allows a fine grain division of the whole set of proteomes we used, that ensures cluster homogeneity in terms of sequence length. GO terms were mapped into the clusters and were statistically validated by computing the associated P-values and a specific P-value threshold for determining cluster function specificity. Finally the PDB/SCOP mapping into the clusters gives the advantage of transferring not only function but also structures even to distantly related sequences. In this respect the high level of coverage of structure templates on the sequence length of the cluster ensures that multidomain proteins when present can be templates for sequences of similar length. However in our paper the most interesting and useful conclusion is that on the basis of correct mathematical procedures we can extend the coverage of structure and function to the world of sequences well above it can be obtained by exploiting the usual major databases. At the best of our knowledge, this is the first annotation procedure that includes the possibility of transferring statistically validated functions and structures to sequences considering information available in the present databases of molecular functions and structures.

Acknowledgment. We thank the following grants: FIRB 2003 LIBI-International Laboratory of Bioinformatics and the support to the Bologna node of the Biosapiens Network of Excellence project within the European Union’s VI Framework Programme (contract number LSGH-CT-2003503265). Supporting Information Available: List of all the species taken into account in this work; table (Table 1S) with the total number of child terms for each top level molecular function GO term. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) Chothia, C.; Lesk, A. M. The relation between the divergence of sequence and structure in proteins. EMBO J. 1986, 5 (4), 823–826. (2) Holm, L.; Sander, C. Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 1993, 233, 123–138. (3) Shindyalov, I. N.; Bourne, P. E. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 1998, 11, 739–747.

4370

Journal of Proteome Research • Vol. 8, No. 9, 2009

Bartoli et al. (4) Redfern, O.; Harrison, A.; Dallman, T.; Pearl, F. M. G.; Orengo, C. CATHEDRAL: a fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures. PLOS Comput. Biol. 2007, 3, e232. (5) Portugaly, E.; Harel, A.; Linial, N.; Linial, M. EVEREST: automatic identification and classification of protein domains in all protein sequences. BMC Bioinf. 2006, 7, 277. (6) Greene, L. H.; Lewis, T. E.; Addou, S.; Cuff, A.; Dallman, T.; Dibley, M.; Redfern, O.; Pearl, F.; Nambudiry, R.; Reid, A.; Sillitoe, I.; Yeats, C.; Thornton, J. M.; Orengo, C. A. The CATH domain structure data base: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Res. 2007, D291–297. (7) Pruitt, K. D.; Tatusova, T.; Maglott, D. R. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence data base of genomes, transcripts and proteins. Nucleic Acids Res. 2007, 35, D61– 5. (8) Wilming, L. G.; Gilbert, J. G. R.; Howe, K:; Trevanion, S.; Hubbard, T.; Harrow, J. L. The vertebrate genome annotation (Vega) data base. Nucleic Acids Res. 2008, 36, D753–60. (9) Walter, M. C.; Rattei, T.; Arnold, R.; Gu ¨ ldener, U.; Mu ¨ nsterko¨tter, M.; Nenova, K.; Kastenmu ¨ ller, G.; Tischler, P.; Wo¨lling, A.; Volz, A.; Pongratz, N.; Jost, R.; Mewes, H. W.; Frishman, D. PEDANT covers all complete RefSeq genomes. Nucleic Acids Res. 2009, 37, D408–11. (10) Wilson, C. A.; Kreychman, J.; Gerstein, M. Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J. Mol. Biol. 2000, 297, 233–249. (11) Rost, B. Enzyme function less conserved than anticipated. J. Mol. Biol. 2002, 318, 595–608. (12) Tian, W.; Skolnick, J. How well is enzyme function conserved as a function of pairwise sequence identity. J. Mol. Biol. 2002, 333, 863– 882. (13) Krause, A.; Stoye, J.; Vingron, M. The SYSTERS protein sequence cluster set. Nucleic Acids Res. 2002, 28, 270–272. (14) Hedger, A.; Holm, L. Picasso: generating a covering set of protein family profiles. Bioinformatics 2001, 17, 272–279. (15) Wu, C. H.; Huang, H.; Nikolskaya, A.; Hu, Z.; Barker, W. C. The iProClass integrated data base for protein functional analysis. Nucleic Acids Res. 2001, 29, 52–54. (16) Kriventseva, E. V.; Fleischmann, W.; Zdobnov, E. M.; Apweiler, R. CluSTr: a data base of clusters of SWISS-PROT+TrEMBL proteins. Nucleic Acids Res. 2001, 29, 33–36. (17) Petryszak, R.; Kretschmann, E.; Wieser, D.; Apweiler, R. The predictive power of the CluSTr data base. Bioinformatics 2005, 3604–3609. (18) Kaplan, N.; Sasson, O.; Inbar, U.; Friedlich, M.; Fromer, M.; Fleischer, H.; Portugaly, E.; Linial, N.; Linial, M. ProtoNet 4.0: a hierarchical classification of one million protein sequences. Nucleic Acids Res. 2005, 33, D216–D218. (19) Loewenstein, Y.; Portugaly, E.; Fromer, M.; Linial, M. Efficient algorithms for accurate hierarchical clustering of huge data sets: tackling the entire protein space. Bioinformatics 2008, 24, i41– i49. (20) Xu, R.; Wunsch, D. Clustering; Wiley-IEEE Press: Hoboken, NJ, 2009. (21) Guralnik, V.; Karypis, G. A scalable algorithm for clustering sequential data. SIGKDD Workshop on Bioinformatics, BIOKDD2001. (22) Sperisen, P.; Pagni, M. JACOP: a simple and robust method for the automated classification of protein sequences with modular architecture. BMC Bioinf. 2005, 6, 216. (23) Enright, A. J.; Van Dongen, S.; Ouzounis, C. A. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002, 30, 1575–1584. (24) Hegyi, H.; Gerstein, M. Annotation transfer for genomics: measuring functional divergence in multi-domain proteins. Genome Res. 2001, 11, 1632–1640. (25) Altshul, S. F.; Gish, W.; Myers, E. W.; Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 1990, 215, 403–10. (26) Cormen, T. H.; Leiserson, C. E.; Rivest, R. L.; Stein, C. Introduction to algorithms, 2nd ed.; MIT Press: Cambridge, MA, 2001. (27) Bairoch, A.; Apweiler, R. The SWISS-PROT protein sequence data base and its supplement TrEMBL. Nucleic Acids Res. 2000, 28, 45– 48. (28) Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov, I. N.; Bourne, P. E. The Protein Data Bank. Nucleic Acids Res. 2000, 28, 235–242. (29) The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nat. Genet. 2000, 25, 25–29.

technical notes

Functional and Structural Annotation of Protein Sequences (30) Press, W. H.; Teukolsky, S. A.; Vetterling, W. T.; Flannery, B. P. Numerical recipes in C, 2nd ed.; Cambridge University Press: Cambridge, 1992. (31) Barutcuoglu, Z.; Schapire, R. E.; Troyanskaya, O. G. Hierarchical multi-label prediction of gene function. Bioinformatics 2006, 22 (7), 830–836. (32) Laskowski, R. A.; Hutchinson, E. G.; Michie, A. D.; Wallace, A. C.; Jones, M. L.; Thornton, J. M. PDBsum: A Web-based database of summaries and analyses of all PDB structures. Trends Biochem. Sci. 1997, 22, 488–490.

(33) Stanica, P. Good Lower and Upper Bounds on Binomial Coefficients. JIPAM 2001, 2 (3), 30. (34) Moore, D. S.; McCabe, G. P. Introduction to the practice of statistics, 5th ed; W. H. Freeman and Company: New York, 2006. (35) Murzin, A. G.; Brenner, S. E.; Hubbard, T.; Chothia, C. SCOP: a structural classification of proteins data base for the investigation of sequences and structures. J. Mol. Biol. 1995, 247, 536–540.

PR900204R

Journal of Proteome Research • Vol. 8, No. 9, 2009 4371