Evaluation of Clustering Algorithms for Protein Complex and Protein

Mar 24, 2009 - Assembling protein complexes and protein interaction networks from affinity purification-based proteomics data sets remains a challenge...
0 downloads 9 Views 2MB Size
Evaluation of Clustering Algorithms for Protein Complex and Protein Interaction Network Assembly Mihaela E. Sardiu, Laurence Florens, and Michael P. Washburn* Stowers Institute for Medical Research, Kansas City, Missouri 64110 Received January 30, 2009

Assembling protein complexes and protein interaction networks from affinity purification-based proteomics data sets remains a challenge. When little a priori knowledge of the complexes exists, it is difficult to place proteins in the proper locations and evaluate the results of clustering approaches. Here we have systematically compared multiple hierarchical and partitioning clustering approaches using a well-characterized but highly complex human protein interaction network data set centered around the conserved AAA+ ATPases Tip49a and Tip49b. This network provides a challenge to clustering algorithms because Tip49a and Tip49b are present in four distinct complexes, the network contains modules, and the network has multiple attachments. We compared the use of binary data, quantitative proteomics data in the form of normalized spectral abundance factors, and the Z-score normalization. In our analysis, a partitioning approach indicated the major modules in a network. Next, while Euclidian distance was sensitive to scaling, with data transformation, all the attachments in a data set were recovered in one branch of a dendrogram. Finally, when Pearson correlation and hierarchical clustering were used, complexes were well separated and their attachments were placed in the proper locations. Each of these three approaches provided distinct information useful for assembly of a network of multiple protein complexes. Keywords: quantitative proteomics • hierarchical clustering • partition clustering • normalized spectral abundance factor • Z-score normalization • protein interaction networks

Introduction Major efforts are underway to define human protein interaction networks. The two general approaches are to use yeast two-hybrid assays1 and affinity purification along with tandem mass spectrometry.2,3 Both of these approaches yield distinctly different data sets. Yeast two-hybrid-based approaches intrinsically yield binary data but can be effectively scaled to cover a large portion of interaction space.1 Affinity purification followed by tandem mass spectrometry is more time-consuming and expensive, but it directly measures the associations of a protein and can capture entire complexes, or multiple complexes, with each purification.2,3 Because of these differences, distinct approaches for assembly interaction networks based on each type of data have been generated.2-5 An important component of any protein network assembly approach is the use of clustering algorithms.2,3,6 Clustering methods can broadly be classified into two types, that is, hierarchical and partitional clustering, according to the method adopted to define the clusters.7,8 There are two kinds of hierarchical techniques, agglomerative and divisive. Agglomerative algorithms begin with each element as a separate cluster and successively merge them into larger clusters.7,8 In contrast, divisive algorithms begin with the whole data set as a single * To whom correspondence should be addressed. Michael P. Washburn, Stowers Institute for Medical Research, 1000 E. 50th St., Kansas City, MO 64110. Phone: 816-925-4457. Fax: 816-926-4694. E-mail: mpw@ stowers-institute.org.

2944 Journal of Proteome Research 2009, 8, 2944–2952 Published on Web 03/24/2009

cluster and sequentially subdivide it into smaller clusters.7,8 The traditional representation of this hierarchy is a tree data structure called dendrogram, in which each branch represents a group of genes/proteins that has a higher relationship. In contrast to hierarchical methods, partitioning methods subdivide the data into a predetermined number of subsets, without any implied hierarchical relationship between these clusters.7,8 Therefore, whenever a partition method is used, the desired number of clusters has to be specified a priori, or tested in an iterative fashion where the number of clusters is changed in each iteration. The choice of a suitable clustering algorithm depends both on the type of data and on the particular purpose. Sometimes several algorithms are applicable, and a priori arguments may not suffice to narrow down the choice to a single method. Therefore, it is recommended to run all suitable clustering algorithms followed by a careful analysis of the obtained results. However, especially for novel or lesscharacterized complexes, selecting the correct method will be difficult; hence, the availability of benchmarks for cluster analysis will ease this decision. We have recently described a probabilistic approach for assembling protein interaction networks based on quantitative proteomics.2 Specifically, we used the spectral counting based approach termed normalized spectral abundance factors to characterize the human protein interaction network centered on the conserved AAA+ ATPases Tip49a and Tip49b.2 This network contains four protein complexes with roles in chro10.1021/pr900073d CCC: $40.75

 2009 American Chemical Society

research articles

Evaluation of Clustering Algorithms 9

10

matin remodeling including SRCAP, INO80, and TRRAP/ TIP60,11 and one complex involved in nutrient sensing and TOR signaling named Uri/Prefoldin.12 Furthermore, this network contains modules and multiple attachments that present organizational challenges.2 In the present study, we restricted our attention to 55 proteins which were determined to be connected with the four complexes of which 43 proteins were deemed core components of the complexes.2 We systematically compared different clustering approaches using this small-medium proteomics data set to determine which approaches provide the most insight into complex assembly and network organization. In doing so, we tested three approaches for entering data into the clustering algorithms. The first approach used binary values where a value of 1 was used if a protein is present and a value of 0 was used if a protein is absent. Next, we used the labelfree spectral count based method named the normalized spectral abundance factor.13-15 Finally, we tested the Z-score normalization values where the mean is subtracted from each quantitative values and is then divided by the standard deviation, which results in the mean of each column being zero and the standard deviation being one.16,17 Each type of input values was then applied to a variety of clustering approaches to systematically compare and contrast the results.

Results and Discussion TIP49a/b Data Set. To evaluate the clustering results, we took advantage of the fact that the data set utilized in the current study was previously well characterized both by computational methods and by experimental validation.2,9,10,12,18 This network consists of a total of four complexes (hINO80, SRCAP, TRRAP/TIP60, and URI/Prefoldin) and an important characteristic of these complexes is a number of common proteins shared between the complexes,2 which provides a challenge for clustering analysis. Three of the four complexes, hINO80, SRCAP and TRRAP/TIP60 which are implicated in chromatin modification and remodeling9-11 shared three proteins, (TIP49a/b and BAF53) and therefore formed a module (hereafter referred to as “module1”).2 In addition, SRCAP and TRRAP/TIP60 shared another 4 proteins (YL-1, DMAP1, GAS41 and H2AZ) and hence formed another module (hereafter referred as “module2”).2 In contrast, the URI/Prefolding complex, known to be involved in nutrient sensing and TOR signaling,12 share only two proteins (TIP49a/b) with the three complexes. Among the 55 proteins that were identified as being connected to the four complexes, few proteins were detected only in a single purification (LEPRE1, UQCRH, LOC388272, ANP32E, and HBB) and therefore were considered attachments to the specific purification (hereafter referred as “attachments1”).2 Other proteins were shown to be related with few subunits of the complexes (LIN9, FLJ20436, FLJ20729, SIT49a/ b, ZnF-HI2, NUFIP, and DPCD) and thus there were defined as attachments to the specific complexes (hereafter referred to as “attachments2”).2 Given these partitions, we decided to evaluate the clustering results independently, by evaluating how well the core components (nonshared proteins of the complexes), modules and attachments were detected and correctly assigned. In order to find an objective measure of the clustering algorithms we numerated the total number of misplaced proteins of the core components, the modules and the attachments. Clustering Methods Used. First, hierarchical clustering with five different agglomerative procedures (UPGMA, WPGMA, CO,

Table 1. Effects of the Data Normalization Methods and Hierarchical Clustering Algorithms on the Human Tip49a/b Data Set best algorithm distance measurement

agglomerative clustering approaches

Binary NSAF Z-score normalization Binary

Pearson Pearson Pearson Euclidean

NSAF ln transformed

Euclidean

Z-score rank transformed

Euclidean

UPGMA/WPGMA UPGMA/WPGMA WARD UPGMA/WPGMA/ WARD/Single Linkage and Complete Linkage UPGMA/WPGMA/WARD and Complete Linkage UPGMA/WARD

data set

SL, and Ward’s) were tested in this study. Note that the difference between these procedures consists in the way the intercluster distance is defined (the “linkage function”). Pearson correlation and Euclidian distances were used for this evaluation. The software PermutMatrix was employed for the assessment of the agglomerative procedures.19 Regarding the clustering based on partitioning, the following algorithms were applied in our and other studies since they are considered to be solid performers for clustering analysis7,20-23 and are freely available through various R libraries (www.r-project.org). A partition method called K-medoid PAM,8,23 K-means23,24 (www. r-project.org), and a fuzzy logic based method FANNY8,23 were used to evaluate partitioning methods. Two metric distances were applied with partitioning methods, Manhattan and Euclidean respectively, which are currently the only available options for these libraries. Hierarchical Clustering with the Pearson Correlation. The hierarchical clustering approach is a well accepted and often used in the analysis of transcription7,23,25 and proteomics data2,3,6,26 and was thus chosen as the starting point of our analysis. We used five different agglomerative procedures with two different distances (Pearson, Euclidean) to separately cluster the three matrices consisting of binary values, NSAF values, and Z-scores, respectively (see Methods). The results obtained from each clustering procedure using each of the matrices were compared to each other and the best algorithm together with the best metric distance were reported in the Table 1. To start, we first focused on using binary data to recover the core components, modules, and attachments of the four complexes present in the data. The first approach used binary values where a value of 1 was used if a protein is present and a value of 0 was used if a protein is absent. When Pearson correlation was used as a distance, UPGMA and WPGMA algorithms gave the best result in recovering the complexes (Table 1). By examining the dendrogram resulting from UPGMA usage, we observed that the core components of the four complexes were generally well recovered, except for the SRCAP complex which had one misplaced core protein, ZnF-HIT1 (Figure 1A). In terms of the modules, the components of the first module were found in close proximity with each other (under the same main branch), while for the second module, only YL-1, GAS41, and DMAP1 were located under the same tree branch whereas H2AZ clustered with BAF53 (Figure 1A). Although we assigned BAF53 together with TIP49a/b to the module1, it is likely to be retained close to the subunits of the second module since it is also shared between the SRCAP and Journal of Proteome Research • Vol. 8, No. 6, 2009 2945

research articles

Sardiu et al.

Figure 1. Representation of the unified TIP49a/b complexes. Hierarchical clustering separated the 55 proteins into the core, modules, and attachments of the four protein complexes (hINO80, SRCAP, TRRAP/TIP60, and URI/Prefoldin). The core components and the attachments of the complexes are colored as: hINO80; purple, URI/Prefoldin; red, TRRAP/TIP60; green and SRCAP blue. All the modules are depicted in orange. Data was clustered using: (A) binary values and UPGMA with Pearson correlation, (B) normalized abundance factor (NSAF) and UPGMA with Pearson correlation (this portion of Figure 3 is largely reproduced from Sardiu et al.2 with modifications), and (C) Z-scores with Ward and Pearson correlation. In all portions of the figure attachments are linked to their appropriate proteins with the arrows on the side.

TRRAP/TIP60 complexes. Also, H2AZ protein can be recovered with module1 since it showed a similar pattern with BAF53 2946

Journal of Proteome Research • Vol. 8, No. 6, 2009

(Figure 1A). Regarding the attachments, hierarchical clustering with Pearson correlation placed the majority of the attachments

research articles

Evaluation of Clustering Algorithms 2

terms of abundance level. In such a situation, it is recommended to cluster also the baits and then decide for the correct association.

Figure 2. Hierarchical clustering of binary data. Hierarchical clustering was used to separate the core subunits of the complexes, modules, and attachments using binary values and the UPGMA algorithm combined with Euclidean distance. The core components and the attachments of the complexes are colored as: hINO80, purple; URI/Prefoldin, red; TRRAP/TIP60, green; and SRCAP, blue. All of the modules are depicted in orange. The attachments are shown by a bracket on the side of the figure.

incorrectly. This is visualized by the arrows and lines to the right of Figure 1A. For example, Sit49ab was far from its proper attachment location next to Tip49a and Tip49b. Next, the data was clustered using the NSAF values of identified proteins using the same procedures as described in the previous paragraph for binary values. We inspected all five different agglomerative procedures using Pearson correlation as a distance metric and found that UPGMA and WPGMA perform the best (Table 1). All core subunits were correctly situated with each of the two algorithms with the results for UPGMA shown in Figure 1B. Concerning the modules, module1 was broken apart by separating TIP49a/b from BAF53, while all the subunits of the module2 were located in proximity to each other under the same main branch of the tree dendrogram on the left of Figure 1B. The attachments were generally observed in the vicinity of the proteins that they associate with as shown by the arrows and lines on the side of Figure 1B. For example, Sit49ab was properly placed next to Tip49a and Tip49b in this case (Figure 1B) but was improperly placed when using binary values (Figure 1A). The exception to this was for UQCRH and FLJ20729, which was also misplaced using binary values and Z-score normalizations (Figure 1). However, it should be noted that when FLJ20729 was used as bait, it showed a clear association with the hINO80 complex, whereas as a prey (pulled-down by a different bait) it exhibited a strong connection with NUFIP, both in terms of pattern but also in

Finally, the data set was converted into Z-scores, another common normalization method. When Pearson correlation was used as distance metric, Ward algorithm performed best compared with the others (Table 1). The use of Z-scores with the WARD algorithm is shown in Figure 1C. In general, the complexes were properly separated and most of the attachments were in close proximity to their proper locations, with the exception of Sit49ab (Figure 1C). In general, using Pearson with hierarchical clustering, NSAF values with the UPGMA/ WPGMA methods performed the best, followed by Z-scores with the WARD method. Hierarchical Clustering with Euclidian Distance. Since Euclidian distance was previously used to determine components of protein complexes based on relative abundance from mass spectrometry,27 we therefore next examined the impact of this metric distance combined with different agglomerative procedures on the clustering result. In contrast to Pearson correlation, when Euclidean distance was used as a distance metric with any of the agglomerative procedures and binary values all the attachments were recovered under the same tree (Figure 2). The cores and modules were well positioned and separated (Figure 2). In contrast, when using NSAF values with Euclidian distance and the UPGMA method, although a few proteins from the same complex were found close to each other, the complexes itself were poorly recovered and the representation of the tree was very uninformative (Figure 3A). One important attribute of the Euclidean distance is that it is sensitive to scaling and differences in average expression level, whereas Pearson correlation is not. Because the difference between maximum and minimum NSAF values in our data set is small, Euclidean distance does not produce good results on the NSAF values. An alternative solution would be to apply the same type of transformation as for microarray data analysis and to apply a logarithm on the NSAF values to attain a larger scale. Because we are dealing with a sparse matrix that is populated with many zeros, we added a very small number (0.000001) to all of the NSAF values before the logarithm was applied. We subsequently clustered the new values using Euclidean distance and the five different agglomerative procedures and observed that four out of five methods presented the same result (Table 1). Using natural log transformed NSAF values, Euclidian distance, and the UPGMA method, all the core components were correctly placed and all of the attachments were clustered under the same tree (Figure 3B). Regarding the modules, TIP49a/b were clustered with BAF53 under the same main branch (module1), while for the second module, H2AZ was located far apart from the DMAP1, GAS41 and YL-1 (Figure 3B). When Euclidean distance was used, we encountered the same problem with Z-scores as for nontransformed NSAF values, that is, we did not obtained any satisfactory clusters and the dendrograms were uninformative (Figure 4A). However, in the case of Z-scores it is impossible to transform the data by applying a logarithm because several of the values are negative. One possibility is to rank the values based on the Z-scores and then perform the agglomerative clustering using Euclidean distance. Euclidean distance performed well on the ranked values and positioned all the core proteins correctly and all the attachments under the same main branch (Figure 4B). Concerning the modules, the subunits of the first module were Journal of Proteome Research • Vol. 8, No. 6, 2009 2947

research articles

Sardiu et al.

Figure 3. Hierarchical clustering of NSAF values. Hierarchical clustering was used to separate the core subunits of the complexes, modules, and attachments using: (A) NSAF values and UPGMA with Euclidean distance and (B) the natural logarithm transformed NSAF values with UPGMA and Euclidean distance. The attachments are shown as grouped by a bracket on the side of the figure. The core components and the attachments of the complexes are colored as: hINO80, purple; URI/Prefoldin, red; TRRAP/TIP60, green; and SRCAP, blue. All the modules are depicted in orange.

clustered together, while for the second modules, H2AZ was clustered near to BAF53 (Figure 4B). The following is a summary of the application of hierarchical clustering to this data set. First, the Pearson correlation worked well on all three types of data in terms of recovering the core components, while for the determination of the modules binary and NSAF values performed better. When Pearson correlation was used on NSAF values and Z-scores, H2AZ was clustered with the components of the module1, while in the case of Euclidean distance in combination with all three types of data, BAF53 was clustered near by the components of the second module. As for the attachments, clustering based on NSAF values gave the best result by placing the attachments in the vicinity of their associate proteins. If the goal is to recover all the attachments in a single cluster, Euclidean distance worked well on the binary, the transformed NSAFs and Z-scores. Partitioning Clustering. In contrast to hierarchical methods, partitioning methods subdivide the data into predetermined number of subsets, without any implied hierarchical relationship between these clusters.7,8 Therefore, whenever a partition method is used, the desired number of clusters has to be specified a priori, or tested in an iterative fashion where the number of clusters is changed in each iteration. In this study, we compared three different partitioning clustering approaches, namely, the methods PAM,8,23 FANNY,23,24 and K-means,8,23 and used two distances, that is, Euclidean and Manhattan. Since all three methods require specifying the preferred number of clusters (k) a priori, we therefore chose several values for k ranging from k ) 4, the number of complexes in the data, up to k ) 8 (Supplementary Table 1, Supporting Information). 2948

Journal of Proteome Research • Vol. 8, No. 6, 2009

When binary variables were used as an input for the partition analysis, we observed that for k ) 4 and 5, the FANNY program based fuzzy approach performed the best (Supplementary Table 1, Supporting Information). For k ) 4, FANNY combined with Euclidean distance produced four good clusters which included as follows: (1) the core of the hINO80 and the module1; (2) the core components of SRCAP and TRRAP/TIP60 as well as module2 (proteins shared between SRCAP and TRRAP/TIP60); (3) the URI/Prefoldin complex alone; and (4) all the attachments in a single cluster. In the case of k ) 5, the FANNY algorithm generated the same clustering results using Euclidean and Manhattan distance, giving rise to the following five clusters: (1) the core components of the hINO80 and the module1; (2) the two core components of the SRCAP complex as well as module2; (3) the core subunits of the URI/Prefoldin complex; (4) the core components of the TRRAP/TIP60 and one core subunit of the SRCAP complex; and (5) all the attachments. Note that when k ) 5, the core components of the TRRAP/ TIP60 complex were assigned to a single cluster in contrast to k ) 4, whereas the two complexes (SRCAP and TRRAP/TIP60) that share the most components (module2) were all clustered together. We initially expected for increasing values of k to yield a better separation between the modules and the core components of the complexes as well as between the attachments present in a single or multiple purifications. However, the results indicated that when binary values are used for the partition clustering, greater values of k result in a breaking of the core complexes. We next subjected the transformed NSAF values to the three partition methods. For k ) 4 and 5, fuzzy clustering produced

research articles

Evaluation of Clustering Algorithms

Figure 4. Hierarchical clustering of Z-scores. Hierarchical clustering was used to separate the core subunits of the complexes, modules, and attachments using: (A) Z-scores and UPGMA with Euclidean distance and (B) the ranked transformed Z-scores with UPGMA and Euclidean distance. The attachments are shown as grouped by a bracket on the side of the figure. The core components and the attachments of the complexes are colored as: hINO80, purple; URI/Prefoldin, red; TRRAP/TIP60, green; and SRCAP, blue. All the modules are depicted in orange.

the most correct partition compared with the other algorithms. Furthermore, Euclidean and Manhattan distances generated identical results for k ) 4, which were alike to those obtained from binary analysis (Supplementary Table 1, Supporting Information). However, for k ) 6, PAM and fuzzy methods performed better on the transformed NSAF values compared to the binary values. With Z-scores, the fuzzy method yielded the most correct partitions, followed by the PAM algorithm. In contrast, the results obtained from the K-means algorithm with Z-scores were unsatisfactory compared with the other two methods with Z-scores (Supplementary Table 1, Supporting Information). One possible explanation for this result could primarily be that PAM and FANNY algorithms are more robust versions than K-means, and perform better on sparse human proteomics data sets like the one we used, which is populated with many zeros. In general, the partition clustering based on binary values and NSAF values produced (especially for k ) 4 and 5) superior partition clustering results compared to the ranked Z-score values (see Table 2). The FANNY method based fuzzy algorithm23 has the advantage of avoiding “hard” decisions. More explicitly, instead of assigning a protein to only one cluster it allocates proteins to multiple clusters by assigning a so-called membership coefficient (percentage) for each of the proteins to each cluster (k) requiring that the sum of the coefficients sum up to 1 (100%). To further illustrate the significance of the fuzzy algorithm results on proteomics data, we decided to plot the membership coefficients of the subunits of module1 and module2 for each of the five clusters obtained for k ) 5 using FANNY method and Euclidean distance on the binary values, transformed NSAF

Table 2. Effects of the Data Normalization Method and Partitional Clustering Algorithms on the human Tip49a/b data set. methods

PAM Fuzzy K-means

distances

Manhattan Euclidean Manhattan Euclidean Manhattan Euclidean

binary

k k k k k k

) ) ) ) ) )

5 4,5 4,5 4,5 4,5 4,5

k k k k k k

NSAF

Z-score

) ) ) ) ) )

k k k k k k

4-6 4-6 4-6 4-6 4,5 4-6

) ) ) ) ) )

4,5 4 4,6 5 4,5 4

values, and ranked Z-scores (Figure 5A-C). As described earlier, because Euclidian distance was used with FANNY, both natural log transformed NSAF values and ranked Z-scores were the necessary inputs. If the clusters were correctly partitioned, the subunits of the modules (proteins that are shared between multiple complexes) would be expected to be present at similarly high coefficients in the clusters containing the complexes that the module associate with, while for the other complexes the coefficients should be very low or zero. For instance, in the case of the transformed NSAF values used as input in the fuzzy algorithm, the proteins TIP49a/b were present with similar coefficients in all five clusters (Figure 5B). This was in accordance with our prediction, given that these two proteins are related with the majority of the components of clusters (i.e., are present in nearly all the purifications).2 Similarly, for the protein BAF53 which also showed similar coefficients with four of the clusters and only exhibited a lower coefficient with the cluster that contained the components of the URI/Prefoldin complex (BAF53 was still present in UXT1 and PDRG purifications) (Figure 5B). In the case of the Journal of Proteome Research • Vol. 8, No. 6, 2009 2949

research articles

Sardiu et al.

Figure 5. Membership coefficients for FANNY algorithm. The membership coefficients obtained from FANNY algorithm-based fuzzy method are illustrated only for the proteins that are shared between multiple complexes which consequently were separated in different modules: module1 consisting of TIP49a/b and BAF53 proteins, and module2 comprising DMAP1, GAS41, YL-1, and H2AZ proteins. The FANNY program was computed using k ) 5 and: (A) binary values with Euclidean distance, (B) logarithm transformed NSAF values with Euclidean distance, and (C) the ranked transformed Z-scores with Euclidean distance. Cluster1 (black) corresponded to the components of the hINO80 complex, cluster2 (blue) consisted of the core subunits of the SRCAP complex, cluster3 (gray) comprised the core components of the URI/Prefoldin complex, cluster4 (purple) corresponded to the TRRAP/TIP60 complex, and cluster5 (light blue) consisted of all the attachments.

module2, GAS41, DMAP1, YL-1, and H2AZ scored the highest membership with the cluster2 that contained the subunits of the SRCAP complex (Figure 5B). To summarize the use of partition methods, when k ) 4 or 5 were used, both binary and transformed NSAF values gave satisfactory results for 2950

Journal of Proteome Research • Vol. 8, No. 6, 2009

clustering using fuzzy methods in order to obtain correct partitions. In terms of distances, Euclidean was overall more consistent in giving more correct partitions, although Manhattan produced identical results in a few situations (Supplementary Table 1, Supporting Information).

research articles

Evaluation of Clustering Algorithms

Conclusions Clustering approaches play an important role in evaluating proteomics data sets.19,28 In particular, clustering approaches are valuable for determining the proteins in a complex from multiple purifications26,27 and as one tool for assembling protein interaction networks.2,6 However, many different clustering approaches exist that have been used in evaluating microarry data sets, for example.7,23 In this body of work we set out to evaluate various clustering approaches using a previously well-defined protein interaction network data set to determine which clustering approaches work with which data types and which clustering approaches provide valuable insights when assembling complexes into networks. We compared the use of binary data, label-free quantitative proteomics data in the form NSAF values, and the Z-score normalization method. From this analysis three major observations were made. First, for data sets containing multiple complexes that share several proteins, a fuzzy logic partitioning method like FANNY can recover the major modules in a network by providing membership coefficients that demonstrated that BAF53, Tip49a, Tip49b, and H2AZ were components of multiple complexes. Next, the use of Euclidian distance with hierarchical methods like UPGMA can place all the attachments from a data set into a single branch of a dendrogram. This was the case for binary values, but since Euclidian distance was sensitive to scaling both NSAF values and Z-scores needed to be transformed to achieve this result. Finally, when Pearson correlation and hierarchical clustering were used complexes were well separated and their attachments were placed in proper locations, and this was especially the case with the use of NSAF values. In summary, there is no perfect clustering approach able to solve all challenges in protein complex characterization. Going forward, as protein interaction network data sets are acquired and assembled we believe that these three approaches will be valuable and help make distinct observations depending on the goal. The FANNY approach may be used to determine the major modules of a network, hierarchical clustering using Euclidian distance may be used for identifying attachments, and hierarchical clustering using Pearson distance may be used to properly place the attachments.

Materials and Methods Data Set Description. The mammalian Tip49a (Rvb1) and Tip49b (Rvb2) proteins belong to an evolutionary conserved family of AAA+ ATPases and are involved in multiple protein complexes. In S. cerevisiae, Tip49a and Tip49b are subunits of two distinct ATP-dependent chromatin remodeling complexes SWR1 and INO80. In higher organisms, TIP49a/b are components of at least four multiprotein complexes that play roles in chromatin remodeling (SRCAP, INO80, TRRAP/TIP60), or nutrient sensing (Uri/Prefoldin). The generation and the analysis of the data set was previously described in ref 2. Normalization and Scaling Description. The raw data was assembled into a matrix An×m (55 × 27) in which the columns represent the individual purifications (“baits”) and rows represented the associated proteins (“preys”, which if detected also include the bait). The elements of the matrix An×m are represented by the spectral count. The first normalization procedure used is known as the normalized spectral abundance factor (NSAF) defined as:

ˆ ik ) X

(SpC )ik /Lik

(1)

n

∑ (SpC )i /Li k

k

i)1

with 1 e i e n, k ) 1,2,...,m and L represent the length of a protein. The new matrix elements are now between 0 and 1. The second scaling produces a new matrix (Z-scores) where each column has a mean of zero and a standard deviation of one. A general scaling for Xik (define as spectral count/Length) ˆ ik can be defined using two real numbers Rk and to produce X βk, for k ) 1,2,...,m, termed scaling and displacement factors, respectively, where Rk > 0. Namely, for k ) 1,2,...,m, we define the scaled components as: ˆ ik ) Rk(Xik - βk), X

1eien

(2)

The scaling factors are defines as: βk )

1 n

n

∑ Xi

Rk ) 1/

(3)

k

i)1



1 n

n

∑ (Xi

k

- βk)2

(4)

i)1

Distance Measure. The measure of the relationship between raw vectors depends on the way the distance and similarity between vectors is determined. Here we consider three distances measures that are commonly used for gene and protein expression data. A simple measure is Manhattan distance defined as:

∑ |a(i) - b(i)|

d(a, b) )

(5)

i

A more common measure is Euclidean distance computed as: N

d 2(a, b) )

∑ [a(i ) - b(i)]

2

) d 2(b, a)

(6)

i)1

Euclidean and Manhattan distances, however, does not distinguish whether a(i) g 0 or a(i) < 0. Furthermore, Euclidean and Manhattan distances are sensitive to scaling and differences in average expression level. Pearson correlation is another method of choice when cluster analysis is used.

∑ (a - aj )(b - bj ) i

r(a, b) )

i

i

(n - 1)SaSb

(7)

Sa represents the standard deviation for a vector a and Sb is the standard deviation for a vector b. Software Used. The software PermutMatrix was employed for the assessment of the agglomerative procedures.19 Regarding the clustering based on partitioning, the following algorithms were applied in our and other studies since they are considered to be solid performers for clustering analysis20-22 and are freely available through various R libraries (www. r-project.org). A partition method called K-medoid PAM,8,23 K-means,23,24 and a fuzzy logic based method FANNY8,23 were used to evaluate partitioning methods. Two metric distances were applied with partitioning methods, Manhattan and Euclidean respectively, which are currently the only available options for these libraries (http://stat.ethz.ch/R-manual/Rpatched/library/cluster/html/00Index.html). Journal of Proteome Research • Vol. 8, No. 6, 2009 2951

research articles Acknowledgment. This work was supported by the Stowers Institute for Medical Research. Supporting Information Available: Supplementary Table 1 and Supplementary Figure 1. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) Rual, J. F.; Venkatesan, K.; Hao, T.; Hirozane-Kishikawa, T.; Dricot, A.; Li, N.; Berriz, G. F.; Gibbons, F. D.; Dreze, M.; Ayivi-Guedehoussou, N.; Klitgord, N.; Simon, C.; Boxem, M.; Milstein, S.; Rosenberg, J.; Goldberg, D. S.; Zhang, L. V.; Wong, S. L.; Franklin, G.; Li, S.; Albala, J. S.; Lim, J.; Fraughton, C.; Llamosas, E.; Cevik, S.; Bex, C.; Lamesch, P.; Sikorski, R. S.; Vandenhaute, J.; Zoghbi, H. Y.; Smolyar, A.; Bosak, S.; Sequerra, R.; Doucette-Stamm, L.; Cusick, M. E.; Hill, D. E.; Roth, F. P.; Vidal, M. Towards a proteomescale map of the human protein-protein interaction network. Nature (London) 2005, 437 (7062), 1173–8. (2) Sardiu, M. E.; Cai, Y.; Jin, J.; Swanson, S. K.; Conaway, R. C.; Conaway, J. W.; Florens, L.; Washburn, M. P. Probabilistic assembly of human protein interaction networks from label-free quantitative proteomics. Proc. Natl. Acad. Sci. U.S.A. 2008, 105 (5), 1454–9. (3) Ewing, R. M.; Chu, P.; Elisma, F.; Li, H.; Taylor, P.; Climie, S.; McBroom-Cerajewski, L.; Robinson, M. D.; O’Connor, L.; Li, M.; Taylor, R.; Dharsee, M.; Ho, Y.; Heilbut, A.; Moore, L.; Zhang, S.; Ornatsky, O.; Bukhman, Y. V.; Ethier, M.; Sheng, Y.; Vasilescu, J.; Abu-Farha, M.; Lambert, J. P.; Duewel, H. S.; Stewart, I. I.; Kuehl, B.; Hogue, K.; Colwill, K.; Gladwish, K.; Muskat, B.; Kinach, R.; Adams, S. L.; Moran, M. F.; Morin, G. B.; Topaloglou, T.; Figeys, D. Large-scale mapping of human protein-protein interactions by mass spectrometry. Mol. Syst. Biol. 2007, 3, 89. (4) Venkatesan, K.; Rual, J. F.; Vazquez, A.; Stelzl, U.; Lemmens, I.; Hirozane-Kishikawa, T.; Hao, T.; Zenkner, M.; Xin, X.; Goh, K. I.; Yildirim, M. A.; Simonis, N.; Heinzmann, K.; Gebreab, F.; Sahalie, J. M.; Cevik, S.; Simon, C.; de Smet, A. S.; Dann, E.; Smolyar, A.; Vinayagam, A.; Yu, H.; Szeto, D.; Borick, H.; Dricot, A.; Klitgord, N.; Murray, R. R.; Lin, C.; Lalowski, M.; Timm, J.; Rau, K.; Boone, C.; Braun, P.; Cusick, M. E.; Roth, F. P.; Hill, D. E.; Tavernier, J.; Wanker, E. E.; Barabasi, A. L.; Vidal, M. An empirical framework for binary interactome mapping. Nat. Methods 2009, 6 (1), 83–90. (5) Braun, P.; Tasan, M.; Dreze, M.; Barrios-Rodiles, M.; Lemmens, I.; Yu, H.; Sahalie, J. M.; Murray, R. R.; Roncari, L.; de Smet, A. S.; Venkatesan, K.; Rual, J. F.; Vandenhaute, J.; Cusick, M. E.; Pawson, T.; Hill, D. E.; Tavernier, J.; Wrana, J. L.; Roth, F. P.; Vidal, M. An experimentally derived confidence score for binary protein-protein interactions. Nat. Methods 2009, 6 (1), 91–7. (6) Collins, S. R.; Kemmeren, P.; Zhao, X. C.; Greenblatt, J. F.; Spencer, F.; Holstege, F. C.; Weissman, J. S.; Krogan, N. J. Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Mol. Cell. Proteomics 2007, 6 (3), 439–50. (7) Gollub, J.; Sherlock, G. Clustering microarray data. Methods Enzymol. 2006, 411, 194–213. (8) Kaufman, L.; Rousseeuw, P. J. Finding Groups in Data: An Introduction to Cluster Analysis; Wiley: New York, 1990; p 342. (9) Ruhl, D. D.; Jin, J.; Cai, Y.; Swanson, S.; Florens, L.; Washburn, M. P.; Conaway, R. C.; Conaway, J. W.; Chrivia, J. C. Purification of a human SRCAP complex that remodels chromatin by incorporating the histone variant H2A.Z into nucleosomes. Biochemistry 2006, 45 (17), 5671–7. (10) Jin, J.; Cai, Y.; Yao, T.; Gottschalk, A. J.; Florens, L.; Swanson, S. K.; Gutierrez, J. L.; Coleman, M. K.; Workman, J. L.; Mushegian, A.; Washburn, M. P.; Conaway, R. C.; Conaway, J. W. A mammalian chromatin remodeling complex with similarities to the yeast INO80 complex. J. Biol. Chem. 2005, 280 (50), 41207–12.

2952

Journal of Proteome Research • Vol. 8, No. 6, 2009

Sardiu et al. (11) Cai, Y.; Jin, J.; Florens, L.; Swanson, S. K.; Kusch, T.; Li, B.; Workman, J. L.; Washburn, M. P.; Conaway, R. C.; Conaway, J. W. The mammalian YL1 protein is a shared subunit of the TRRAP/ TIP60 histone acetyltransferase and SRCAP complexes. J. Biol. Chem. 2005, 280 (14), 13665–70. (12) Gstaiger, M.; Luke, B.; Hess, D.; Oakeley, E. J.; Wirbelauer, C.; Blondel, M.; Vigneron, M.; Peter, M.; Krek, W. Control of nutrientsensitive transcription programs by the unconventional prefoldin URI. Science 2003, 302, 1208–12. (13) Paoletti, A. C.; Parmely, T. J.; Tomomori-Sato, C.; Sato, S.; Zhu, D.; Conaway, R. C.; Conaway, J. W.; Florens, L.; Washburn, M. P. Quantitative proteomic analysis of distinct mammalian Mediator complexes using normalized spectral abundance factors. Proc. Natl. Acad. Sci. USA 2006, 103 (50), 18928–33. (14) Pavelka, N.; Fournier, M. L.; Swanson, S. K.; Pelizzola, M.; RicciardiCastagnoli, P.; Florens, L.; Washburn, M. P. Statistical similarities between transcriptomics and quantitative shotgun proteomics data. Mol. Cell. Proteomics 2008, 7 (4), 631–44. (15) Zybailov, B.; Mosley, A. L.; Sardiu, M. E.; Coleman, M. K.; Florens, L.; Washburn, M. P. Statistical analysis of membrane proteome expression changes in Saccharomyces cerevisiae. J. Proteome Res. 2006, 5 (9), 2339–47. (16) Cheadle, C.; Vawter, M. P.; Freed, W. J.; Becker, K. G. Analysis of microarray data using Z score transformation. J. Mol. Diagn. 2003, 5 (2), 73–81. (17) Adkins, J. N.; Monroe, M. E.; Auberry, K. J.; Shen, Y.; Jacobs, J. M.; Camp, D. G., 2nd; Vitzthum, F.; Rodland, K. D.; Zangar, R. C.; Smith, R. D.; Pounds, J. G. A proteomic study of the HUPO Plasma Proteome Project’s pilot samples using an accurate mass and time tag strategy. Proteomics 2005, 5 (13), 3454–66. (18) Cai, Y.; Jin, J.; Yao, T.; Gottschalk, A. J.; Swanson, S. K.; Wu, S.; Shi, Y.; Washburn, M. P.; Florens, L.; Conaway, R. C.; Conaway, J. W. YY1 functions with INO80 to activate transcription. Nat. Struct. Mol. Biol. 2007, 14 (9), 872–4. (19) Meunier, B.; Dumas, E.; Piec, I.; Bechet, D.; Hebraud, M.; Hocquette, J. F. Assessment of hierarchical clustering methodologies for proteomic data mining. J. Proteome Res. 2007, 6 (1), 358–66. (20) Tavazoie, S.; Hughes, J. D.; Campbell, M. J.; Cho, R. J.; Church, G. M. Systematic determination of genetic network architecture. Nat. Genet. 1999, 22 (3), 281–5. (21) Kim, S. Y.; Lee, J. W.; Bae, J. S. Effect of data normalization on fuzzy clustering of DNA microarray data. BMC Bioinformatics 2006, 7, 134. (22) Halkidi, M.; Batistakis, Y.; Vazirgiannis, M. On clustering validation techniques. J. Intell. Inf. Syst. 2001, 17, 107–145. (23) Datta, S.; Datta, S. Evaluation of clustering algorithms for gene expression data. BMC Bioinformatics 2006, 7 (Suppl 4), S17. (24) MacQueen, J. B. Some Methods for classification and Analysis of Multivariate Observations. Proc. 5th Berkeley Symp. Math. Stat. Probability 1967, 1, 281–297. (25) Do, J. H.; Choi, D. K. Clustering approaches to identifying gene expression patterns from DNA microarray data. Mol. Cells 2008, 25 (2), 279–88. (26) McAfee, K. J.; Duncan, D. T.; Assink, M.; Link, A. J. Analyzing proteomes and protein function using graphical comparative analysis of tandem mass spectrometry results. Mol. Cell Proteomics 2006, 5 (8), 1497–513. (27) Powell, D. W.; Weaver, C. M.; Jennings, J. L.; McAfee, K. J.; He, Y.; Weil, P. A.; Link, A. J. Cluster analysis of mass spectrometry data reveals a novel component of SAGA. Mol. Cell. Biol. 2004, 24 (16), 7249–59. (28) Kislinger, T.; Cox, B.; Kannan, A.; Chung, C.; Hu, P.; Ignatchenko, A.; Scott, M. S.; Gramolini, A. O.; Morris, Q.; Hallett, M. T.; Rossant, J.; Hughes, T. R.; Frey, B.; Emili, A. Global survey of organ and organelle protein expression in mouse: combined proteomic and transcriptomic profiling. Cell 2006, 125 (1), 173–86.

PR900073D