Combining Hierarchical Clustering and Self-Organizing Maps for

Self-organizing maps (SOM) constitute an alternative to classical clustering methods because of its linear run times and superior performance to deal ...
0 downloads 0 Views 193KB Size
Combining Hierarchical Clustering and Self-Organizing Maps for Exploratory Analysis of Gene Expression Patterns Javier Herrero and Joaquı´n Dopazo* Bioinformatics Unit, Spanish National Cancer Center (CNIO), Melchor Ferna´ndez Almagro 3, 28029 Madrid, Spain Received March 26, 2002

Abstract: Self-organizing maps (SOM) constitute an alternative to classical clustering methods because of its linear run times and superior performance to deal with noisy data. Nevertheless, the clustering obtained with SOM is dependent on the relative sizes of the clusters. Here, we show how the combination of SOM with hierarchical clustering methods constitutes an excellent tool for exploratory analysis of massive data like DNA microarray expression patterns. Keywords: DNA array, gene expression patterns, hierarchical clustering, SOM, SOTA

DNA microarray technology opens up the possibility of measuring the expression level of thousands of genes in a single experiment.1 Serial experiments measuring gene expression at different conditions, times, or distinct experiments with diverse tissues, patients, etc. allow gene expression profiles to be obtained under the different experimental conditions studied. Initial experiments suggest that genes having similar expression profiles tend to be playing similar, or at least related, roles in the cell. Since thousand of genes are involved in such experiments, the use of an efficient technique that allows detection of the clusters of co-expressing genes is key for the analysis of microarray expression data. Aggregative hierarchical clustering has been extensively used for this purpose,2,3 although several authors4 have claimed that it suffers from a lack of robustness. Another common problem shared by clustering methods based on the calculation of a distance matrix is their slow run times, which are, in the best case, quadratic.5 This can constitute a real difficulty when thousands of items are to be analyzed. A different problem, which is not often mentioned, is derived from the size of the data set: there are obvious limitations in the visual inspection of a hierarchical tree with thousands of branches connecting thousand of items. Some authors4,6,7 proposed the use of neural networks as a convenient alternative to aggregative hierarchical cluster methods. Unsupervised neural networks, and in particular self-organizing maps (SOM),8 circumvent some of the above-mentioned problems. They can deal with data sets containing noisy, ill-defined items with irrelevant variables and outliers and whose statistical distributions do not need to be * To whom correspondence should be addressed. Tel: 34 912246919. Fax: 34 912246972. E-mail: [email protected]. 10.1021/pr025521v CCC: $22.00

 2002 American Chemical Society

parametric ones. SOM are robust and reasonably fast and can be easily scaled to large data sets. SOM, in its original form, converts the nonlinear statistical relationships between highdimensional data into simple geometric relationships of their image points on a low-dimensional map (usually a twodimensional arrangement of nodes). That is, SOM compresses the information of the original data set, but preserving, when possible, the topology.8 Nevertheless, if the aim is to find the different clusters of co-expressing genes contained in the data set, this approach presents several problems. First, the training of the network depends on the number of items in each cluster. Thus, the clustering obtained by SOM, based on the nodes of the network, will be more dependent on the sizes of the groups than in their actual differences among profiles. This is because SOM is not clustering, but producing a reduced representation of the original data set. If irrelevant data (e.g., invariant, “flat” profiles) or some particular type of profile is overepresented, SOM will produce an output in which genes displaying this particular profile will populate the vast majority of nodes. The level of resolution for the different clusters will be different and depends on the number of representatives in the data set. Finally, the lack of a hierarchical structure makes it impossible to detect higher order relationships between clusters of profiles. In an attempt of imposing some structure based on the intercluster differences, To¨ronen et al.6 applied Sammon’s mapping9 to the resulting map. Nevertheless, the picture provided by such mapping only puts in relieve that most of the cells in the SOM map are very similar, and consequently uninformative from the point of view of the characterization of the different classes of profiles. In addition, hierarchical clustering with a neural network can be achieved using SOTA.7 This method splits the data set in a hierarchy of clusters and sub-clusters by means of a self-organizing process (like in SOM) but using a binary tree topology with a splitting scheme based on inter-cluster distances.10 SOTA starts with a network of two neurons connected through an intermediate “mother” neuron, and after a training procedure similar to the SOM case, splits the data set into two groups. Then, one of the neurons (the one with the most heterogeneous pare of the data associated) splits, and the training re-start with three terminal nodes connected among them by means of a binary tree structure. SOTA allows stopping the growth of the hierarchy at a given level of variability, which permits a better visualization of the actual number of different patterns, regardless of the number of representatives of each pattern in the data set. Journal of Proteome Research 2002, 1, 467-470

467

Published on Web 07/11/2002

Exploratory Analysis of Gene Expression Patterns

technical notes

Figure 1. Clustering of gene expression profiles of yeast genes during a diauxic shift (http://cmgm.stanford.edu/pbrown/explore/ index.html). Ratios between the expression level at the starting point and at seven time points at 2 h intervals were calculated. The 6102 gene patterns were log2-transformed. (A) Hexagonal SOM map of 5 × 8 obtained for the data. The parameters used for running vfind program of SOMPACK package are for the first/second training (see documentation of the package): training length, 1000/10000; initial learning rate, 0.05/0.02; initial radius, 10/3; using a learning function linear and an unweighted quantitation error function. (B) Average linkage tree obtained using the values of the cells in the SOM in part A. Labels in the tree correspond to genes that share regulatory properties as described by DeRisi et al.12 in Figure 5. (C) SOTA tree obtained analyzing directly the 6102 genes. The parameters used were euclidean distance, variability threshold ) 4; error threshold ) 0.0001; update factors ) 0.01/0.005/0.001. Labels are as in part B. 468

Journal of Proteome Research • Vol. 1, No. 5, 2002

technical notes Here, we show how a combination of a method that reduces the dimensionality11 of the data set to a relatively small number of clusters (like SOM) with a hierarchical clustering technique provides a very efficient tool for exploratory analysis of microarray gene expression profiles. We have used the gene expression profiles of 6102 yeast genes, during a diauxic shift,12 as an example of unbalanced dataset to put in relieve the relative advantages and disadvantages of the different methods. We used the SOM package (http://www.cis.hut.fi/research/ som_lvq_pak.shtml) to reduce the number of items of the original dataset from 6102 expression profiles to only 40 average profiles. We used a hexagonal network of 5 × 8 cells, in which each cell was a vector of size 7, each component corresponding to one of the experimental conditions at which gene expression was measured. Once the network was trained (Figure 1A), the average values in the nodes were used as input values for an aggregative clustering method, like UPGMA2,5 in this case. The final picture is a hierarchical tree (Figure 1B) in which the branch lengths are proportional to the differences of the average expression patterns of the clusters under the bifurcations of the tree. The SOM map (Figure 1A) showed mainly “flat” patterns, corresponding to genes not affected by the experimental conditions, flanked by a few activation and repression patterns. The analysis of branch lengths in the tree in Figure 1B shows three big clusters representative of the three main different trends: the two bottom main branches, composed of three clusters each, correspond to activation patterns (bottom) and to repression patterns (middle). The remaining clusters (big upper branch) contain essentially unaffected genes. The advantage of applying hierarchical clustering is that branch lengths allow easy definition of these three behaviors irrespective of the number of clusters SOM has assigned to each one. Depending on the size of the network and the structure of the data, some nodes in the SOM may remain unoccupied after the training process. In some cases, it might be interesting to use also these nodes in the analysis to see what patterns between clusters might look like. The data set used is quite useful for showing the properties of the proposed approach, but it is quite lineal in the sense that is composed mainly by upregulated, downregulated, and flat patterns. The approach has been applied to other datasets displaying more complex structure of patterns such as the corresponding to the transcriptional program of sporulation in budding yeast.13 In the additional information web page (http://bioinfo.cnio.es/wwwsomtree), the corresponding figure shows how the results obtained fit to the corresponding ones found by the authors. Application of SOTA7 shows how the 6120 genes can be separated in a few classes of patterns (Figure 1C). From top to bottom, we have two different patterns of repression, another two patterns, corresponding essentially to unaffected genes (that accounts for the behavior of more than 4000 genes), and a number of different activation patterns. Despite the enormous differences in number of members in the clusters, SOTA was able to find the different average patterns of gene expression. SOTA found more different classes of activation patterns, not resolved in a similar detail when applying average linkage over the SOM average patterns. This is due to the fact that there are more repressed genes, and SOM has used more nodes for them. Despite this, SOM, post-processed with the hierarchical cluster method and taken into account the length of the branches, produced a cluster structure closer to the more

Herrero and Dopazo

Figure 2. Run times. UPGMA applied directly to the data and to average profiles of SOM with different sizes (4 × 8, 8 × 8, 8 × 16, 16 × 16, and 16 × 32) for different numbers of gene expression patterns. The expression patterns were randomly chosen from the original dataset of profiles of 6400 yeast genes, during a diauxic shift,12 and correspond to 10%, 20%, and so on, until the dataset is complete. While UPGMA run time increases quickly, the combination of SOM post-processed with UPGMA keeps run times at a very acceptable level. Run times have been obtained in a SGI Origin 200. Simulated datasets (data not shown) confirm the observed trends.

realistic picture provided by SOTA. The labels in Figure 1, parts A and C, correspond to genes showing common regulatory properties (see Figure 5 in DeRisi et al.12). In the case of SOM with post-processing the genes corresponding to upregulated classes fall in the same cluster. A similar trend can be observed for the case of downregulated genes, although ribosomal classes fall in two neighbor clusters. Since SOTA provides more resolution with less clusters, some of the classes fall in different clusters. Apart from the ribosomal class, the redox class also is split and falls in two neighbor clusters. In the additional information (http://bioinfo.cnio.es/wwwsomtree), we have studied in more detail the dataset by removing all “flat” patterns. All of the groups defined by DeRisi et et al.12 fall in the same (or very close) clusters except for the one corresponding to redox class. Nevertheless, the apparent dispersion is not too drastic because all the branches connecting the clusters are short. In fact, redox group maps in a single cluster in the SOTA picture, while only early induction genes have split just a little bit more than in the case of SOM post-processed with hierarchical clustering. In any case, if SOM is preferred as the clustering method for some reason, it is convenient to post-process the results with any clustering method that produces a final picture based on the differences among the values of average clusters. Obviously, SOTA can also be used as a post-processing method. In this case, similar patterns can be collapsed into single clusters, and the resulting tree will reflect the differences among the distinct classes of expression patterns. The second advantage of using a hierarchical clustering method in combination with SOM, for exploratory analysis of expression profiles is run time. SOM run time is approximately linear on the number of data analyzed. Figure 2 shows the increase in run times as the number of expression patterns to analyze grows. If UPGMA is directly applied to the raw data, it shows a behavior clearly quadratic (at least, and probably slower). On the other hand, if the data are pre-clustered using Journal of Proteome Research • Vol. 1, No. 5, 2002 469

technical notes

Exploratory Analysis of Gene Expression Patterns

SOM and UPGMA is applied to the average values of the clusters, the reduction in the final number of items to analyze after this pre-processing makes the run times nearly linear. Depending on the implementation of UPGMA, which can be optimized, the slope of its run time can change but, in any case, it becomes slower than the SOM/UPGMA combination beyond a given number of genes.14 The run times of the SOM/ UPGMA combination are appropriate for exploratory analysis. Despite the fact that run times do not increase drastically by using bigger SOM maps, the use of a moderate size (no more than 10 × 10) produces better results. Mapping the data set into many neurons using SOM will tend to exaggerate the asymmetries of the data. In addition, mapping the data set onto a large network is not a good idea when the aim is reducing its dimensionality. Sammon maps have been used to represent SOM maps in a proportional scale that allows one to visualize aggregations among clusters.6 Although the picture produced is more informative than the SOM map (in the additional information web page http://bioinfo.cnio.es/wwwsomtree, the graph shows aggregation of clusters in two main superclusters), the interpretation is cumbersome, and the graph lacks some interesting properties of the hierarchical tree. New technologies, like DNA microarrays, allow analysis of biological systems from a genomic perspective, but at the expense of producing a huge amount of data. In these cases, dimensionality reduction is a desirable property.11 Unfortunately, most of the procedures used for data analysis were not designed to deal with large amounts of data with high redundancy. The use of a previous step of preclustering with SOM, followed by the application of hierarchical clustering methods, constitutes a fast and reasonably accurate method for exploratory analysis of large amounts of data. Given the speed of the procedure, it can be used interactively even for several thousands of expression profiles. We have developed an interactive web interface that implements this combination of methods, which can be freely used (URL: http:// bioinfo.cnio.es/wwwsomtree).

470

Journal of Proteome Research • Vol. 1, No. 5, 2002

Acknowledgment. J.H. is supported by a CNIO fellowship.

References (1) Brown, P. O.; Botsein, D. Exploring the new world of the genome with DNA microarrays. Nature Biotechnol. 1999, 14, 1675-1680. (2) Eisen, M.; Spellman, P. L.; Brown, P. O.; Botsein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U.S.A. 1998, 95, 14863-14868. (3) Wen, X.; Fuhrman, S.; Michaels, G. S.; Carr, D. B.; Smith, S.; Barker, J. L.; Somogyi, R. Large-scale temporal gene expression mapping of central nervous system development. Proc. Natl. Acad. Sci. U.S.A. 1998, 95, 334-339. (4) Tamayo, P.; Slonim, D.; Mesirov, J.; Zhu, Q.; Kitareewan, S.; Dmitrovsky, E.; Lander, E. S.; Golub, T. R. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. U.S.A. 1999, 96, 2907-2912. (5) Hartigan, J. A. Clustering algorithms; Wiley: New York, 1975. (6) To¨ro¨nen, P.; Kolehmainen, M.; Wong, G.; Castre´n, E. Analysis of gene expression data using self-organizing maps. FEBS Lett. 1999, 451, 142-146. (7) Herrero, J.; Valencia, A.; Dopazo, J. A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics 2001, 17, 126-136. (8) Kohonen, T. Self-organizing maps; Springer-Verlag; Berlin, 1997. (9) Sammon, J. W. A nonlinear mapping for data structure analysis. IEEE Trans. Comput. 1969, C18, 401-409. (10) Dopazo, J.; Carazo, J. M. Phylogenetic reconstruction using a growing neural network that adopts the topology of a phylogenetic tree. J. Mol. Evol .1997, 44, 226-233. (11) Herwig, R.; Poutska, A. J.; Mu ¨ ller, C.; Bull, C.; Lehrach, H.; O’Brien, J. Large-scale clustering of cDNA-fingerprinting data. Genome Res. 1999, 9, 1093-1105. (12) DeRisi, J. L.; Iyer, V. R.; Brown, P. O. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 1997, 278, 680-686. (13) Chu, S.; DeRisi, J.; Eisen, M.; Mulholland, J.; Botsein, D.; Brown, P. O.; Herskowitz, I. The Transcriptional Program of Sporulation in Budding Yeast. Science 1998, 282, 699-705 (14) Cummings, C. A. Application of SOTA, a growing neural network algorithm, to gene expression profile clustering. Briefing Bioinf. 2001, 2, 402-404.

PR025521V