Identifying Protein Complexes Using Hybrid Properties - American

Sep 18, 2009 - Laboratory of Trustworthy Computing, East China Normal University, ... has been put into identifying protein complexes using computatio...
0 downloads 0 Views 669KB Size
Identifying Protein Complexes Using Hybrid Properties Lei Chen,§,‡ Xiaohe Shi,| Xiangyin Kong,| Zhenbing Zeng,§ and Yu-Dong Cai*,† Institute of Systems Biology, Shanghai University, Shanghai 200444, People’s Republic of China, Centre for Computational Systems Biology, Fudan University, Shanghai 200433, People’s Republic of China, Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, Shanghai 200062, People’s Republic of China, and Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences and Shanghai Jiao Tong University School of Medicine, Shanghai 200025, People’s Republic of China Received June 25, 2009

Protein complexes, integrating multiple gene products, perform all sorts of fundamental biological functions in cells. Much effort has been put into identifying protein complexes using computational approaches. A vast majority attempt to research densely connected regions in protein-protein interaction (PPI) network/graph. In this research, we try an alterative approach to analyze protein complexes using hybrid features and present a method to determine whether multiple (more than two) proteins from yeast can form a protein complex. The data set consists of 493 positive protein complexes and 9878 negative protein complexes. Every complex is represented by graph features, where proteins in the complex form a graph (web) of interactions, and features derived from biological properties including protein length, biochemical properties and physicochemical properties. These features are filtered and optimized by Minimum Redundancy Maximum Relevance method, Incremental Feature Selection and Forward Feature Selection, established through a prediction/identification model called Nearest Neighbor Algorithm. Jackknife cross-validation test is employed to evaluate the identification accuracy. As a result, the highest accuracy for the identification of the real protein complexes using filtered features is 69.17%, and feature analysis shows that, among the adopted features, graph features play the main roles in the determination of protein complexes. Keywords: Protein complex • mRMR(Minimum Redundancy Maximum Relevance) • Protein-protein interaction (PPI) • Nearest Neighbor Algorithm (NNA) • Jackknife cross-validation test • Feature selection

1. Introduction Protein complexes are fundamental to the biological processes within a cell. It would be helpful to correctly indentify protein complexes in an organism for understanding the molecular mechanisms. In this paper, protein complexes determined by experiments are the first-hand materials for the prediction of unknown protein complexes though the raw materials are not perfect since some protein complexes may escape the detection by experiments.1-3 Protein complexes need to be coded first by digital vectors (i.e., features) in order to be processed by mathematical models. Features derived from hybrid propertiessthe graph properties and biological propertiessare used to code protein complex in the paper. Biological properties include protein length, amino acid compositions, protein secondary structure, hydrophobicity, normalized van der Waals volume, polarity, polarizability and solvent accessibility. Graph properties come from the protein-protein interaction (PPI) network/graph, where the vertices denote * To whom correspondence should be addressed. Tel: 0086-21-66136132. Fax: 0086-21-66136109. E-mail: [email protected]. § East China Normal University. | Chinese Academy of Sciences and Shanghai Jiao Tong University School of Medicine. ‡ Fudan University. † Shanghai University.

5212 Journal of Proteome Research 2009, 8, 5212–5218 Published on Web 09/18/2009

proteins and the edges denote the interactions between proteins. Many approaches make use of PPI network to discover protein complexes by usually searching densely connected regions (such as clique),4-7 for example, King8 used a cost-based clustering algorithm to partition PPI network’s node set into clusters. However, many complexes are not always dense in PPI graph, such as linear path. Rives9 proposed a protein complex identification algorithm to search the shortest path between proteins in the filament protein network in yeast and observed that the proteins along the shortest pathway are more likely to form the filament protein complexes. Because all kinds of topologies present in protein complexes and the tremendous variation of the sizes of protein complexes pose a yet further problem for identifying the specific topologies, instead of searching all possible graph topologies to find protein complexes, one could try to simply find the graph features that are able to represent various topologies and use the representing features to identify protein complexes. A recent published paper1 used graph features to search complexes, in which the topology features are learned and driven by known complexes to predict unknown ones. This paper tries to analyze some graph features which may be important to determine protein complexes. The edges of the graphs are weighted by the likelihood of possible interactions, estimated through gene ontology (GO) consortium, between every pair of proteins in 10.1021/pr900554a CCC: $40.75

 2009 American Chemical Society

Identifying Protein Complexes Using Hybrid Properties

research articles

Figure 1. The size distribution of positive and negative complexes.

the network. Except for graph features, biological properties of proteins in each complex may also contribute toward the determination of protein complexes. In this research, we combine these two kinds of properties and analyze which are the important properties and subproperties of protein complexes. The features derived from the hybrid properties are input into an identification model to determine whether the given protein complex is valid. A data set, consisting of 493 positive/ valid protein complexes (known protein complexes) and 9878 negative/invalid protein complexes of yeast, is constructed for the study. Each complex is represented by a 295-dimensional feature vector. To optimize and analyze the 295 features, we combine Minimum Redundancy Maximum Relevance (mRMR), Incremental Feature Selection (IFS) and Forward Feature Selection (FFS) to select an optimized feature set for the prediction. Nearest Neighbor Algorithm is used as the identification/classification model, and the prediction accuracy is evaluated by jackknife cross-validation test. As a result, 152 features were selected as the optimized feature set and the prediction accuracy is 69.17% for the positive protein complexes. Feature analysis showed that graph properties play the main roles in the determination of protein complexes, while to a less extent, the biochemical properties and physicochemical properties also contribute importantly toward the prediction.

2. Materials and Methods 2.1. Data Set. The data set consists of positive protein complexes (known protein complexes) and negative protein complexes of yeast. The positive protein complexes are downloaded from http://www.cs.cmu.edu/~qyj/SuperComplex/.1 After we restrict the size of protein complexes to be within 3 and 20, 493 positive protein complexes are obtained. Negative complexes are generated by randomly choosing some proteins to be the components of the complexes. Because valid protein complexes are very rare comparing to the vast majority of negative protein complexes, negative complexes are produced to be about 20 times as many as the positive ones, and there will still be very little chance for those negative complexes to be actually positive. Figure 1 shows the size distribution of the positive and negative complexes. The detailed data set can be found in Supplemental Materials I. 2.2. Complex Features. Graph properties and biological properties are used to code protein complexes. In this study, 29 graph features are extracted from each graph that represents

a protein complex, and 266 biological features, derived from protein length, biochemical properties and physicochemical properties, are extracted from protein sequences of all the proteins in a complex, totally, 29 + 266 ) 295 features altogether. Please refer to Supplemental Materials II for these features. The edges between each pair of proteins are weighted by the likelihood that they are able to interact with each other, as will be explained in detail in section 2.3. The 295 features are divided into 12 feature groups as follows: 1. Graph size and graph density: Let G ) (V,E) be a complex graph, with |V| vertices and |E| edges. The graph size is the number of proteins in the complex. Suppose |E|m ) |V|(|V| - 1)/2 is the theoretical maximum number of possible edges in G. The graph density is defined as |E| divided by |E|m.10 2. Degree statistics: A degree is defined as the number of neighbors of a vertex. We take mean degree, variance of degrees, median degree and maximum degree as features for vertex degrees.11 3. Edge weight statistics: Let G ) (V,w(E)) be a weighted complete graph where each edge is weighted by a weight w in the range of [0,1]. Since w(e) ) 0 for some edges e ∈ E, we compute features from two cases: (a) all edges are considered including those with zero weights (mean and variance of all weights are taken as the features); (b) only edges with nonzero weights are considered (mean and variance of the nonzero weights are taken as features).10 4. Topological change: Let G ) (V,w(E)) be a weighted complete graph. This group of features is gained by measuring the topological changes when different cutoffs of the weights are applied to the graph. The weight cutoffs are 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, and 0.8. Let Gi ) (V,Ei) (i ) 1-8) be the graphs that only have the edges with weights higher than i/10 remained, that is, Ei ) {e|w(e) > i/10}. Topology changes are measured as Ti ) (|Ei| - |Ei+1|)/|Ei|; i ) 1-7. If |Ei| ) 0, let Ti ) 0. 5. Degree correlation and clustering: Let G ) (V,E) be a complex graph with V ) {v1,v2, ..., vn}. For each vertex vi, denote its neighbors as Vi′ ) {vi1, vi2, ..., vik} and letHi ) (Vi′,Ei) be an induced subgraph of G. Define Di ) |Ei|/k (if k ) 0, Di ) 0) and Ci ) 2|Ei|/k(k - 1) (if k e 1, Ci ) 0). The features in this group are the mean, variance and maximum of D1, ..., Dn and C1, ..., Cn, respectively.11,12 Journal of Proteome Research • Vol. 8, No. 11, 2009 5213

research articles 6.

7.

Topological: Let G ) (V,E) be a complex graph with V ) {v1, v2, ..., vn}. For each pair of vertices vi,vj (i * j), denote nij as the number of neighbors for both vi and vj (plus 1 if edge vivj ∈ E as there is a direct link between the two vertices) and denote ni as the number of neighbors of vi. LetTij ) nij/ni (if ni ) 0, then Tij ) 0). For each vertex vi, let Ti be the mean of Ti1, Ti2, ..., Tin. Topological features are defined as the mean, variance and maximum of T1, ..., Tn.12 Singular values: Let G ) (V,E) be a complex graph and A be its adjacent matrix. The first three largest singular values are taken as the features.10

The following feature groups are about biological properties including protein length, biochemical properties and physicochemical properties. Biochemical properties include amino acid compositions and secondary structure, while physicochemical properties include hydrophobicity, normalized van der Waals volume, polarity, polarizability and solvent accessibility. The use of these biological properties was motivated by previous work on proteins.13-16 Suppose a complex consists of n proteins, the mean and maximum values of biological features of the n proteins are taken as the complex features. 8. Protein length: the number of amino acids in a protein sequence. The detail protein sequences can be found in Supplemetal Materials III.17 9. Hydrophobicity, normalized van der Waals volume, polarity and polarizability: 21 features can be extracted from each of these physicochemical properties.18,19 Here we will only describe how to obtain features from hydrophobicity property, and features from other properties can be obtained in a similar way. Each amino acid can be assigned into one of the three categories, the polar (P), neutral (N) and hydrophobic (H). For a given protein sequence, each amino acid is substituted by P, N or H, and the sequence is called protein pseudosequence. Composition (C) is the percentage of P, N and H in the whole pseudosequence. Transition (T) is the changing frequency between any two characters (such as P and N, P and H, N and H). Distribution (D) is the sequence segment (in percentage) of the pseudosequence that is needed to contain the first, 25%, 50%, 75% and the last of the Ps, Ns and Hs, respectively. In conclusion, there are 3 properties for (C), 3 properties for (T) and 15 properties for (D). Totally, 21 features are obtained. 10. Solvent accessibility: each protein sequence is coded by two letters (H and E). The composition (C) for H, the transition (T) between H and E, and five distributions (D) for H are used in this property, resulting in totally 7 features. 11. Secondary structure: Each protein sequence is coded by three letters like hydrophobicity property. For details, please see refs 20 and 21. Totally, 21 features can be derived from this property. 12. Amino acid compositions: the percentage of each amino acid occurring in the whole sequence. Totally, 20 amino acid composition features are extracted. Table 1 shows the amount of features in feature group 9-12. Before taking the mean and maximum values of features in feature group 9-12, normalization is applied to adjust the values to be under a standard scale. The normalization function N is designed as follows: Uij ) (uij - uj)/Tj, where Tj ) [∑i)1 (uij 5214

Journal of Proteome Research • Vol. 8, No. 11, 2009

Chen et al. Table 1. Amount of Features in Feature Group 9-12 properties

C

T

D

total

Hydrophobicity Normalized van der Waals volume Polarity Polarizability Solvent accessibility Secondary structure Amino acid composition Total

3 3 3 3 1 3 20 -

3 3 3 3 1 3 -

15 15 15 15 5 15 -

21 21 21 21 7 21 20 132

Table 2. The Distribution of 295 Features group ID

1 2 3 4 5 6 7 8 9

10 11 12

group name

Graph size and graph density Degree statistic Edge weight statistics Topological change Degree correlation and clustering Topological Singular values Protein length Hydrophobicity, normalized van der Waals volume, polarity and polarizability Solvent accessibility Secondary structure Amino acid compositions

number of features

2 4 4 7 6 3 3 2×1)2 2 × 4 × 21 ) 168 2 × 7 ) 14 2 × 21 ) 42 2 × 20 ) 40

uj)/(N - 1)]1/2 is the standard deviation of the jth feature and N uj ) ∑i)1 uij/N is the mean value of the jth feature. The total number of complex features is 2 + 4 + 4 + 7 + 6 + 3 + 3 + 2 × (1 + 4 × 21 + 21 + 7 + 20) ) 29 + 2 × 133 ) 295. Table 2 shows the distribution of these 295 features. 2.3. Interactions between Each Pair of Two Proteins. As described in section 2.2, the computations of some features require the weights of the graph edges. Weights quantify the likelihood of the interaction between every pair of proteins, and they can be estimated by encoding the proteins using gene ontology (GO) consortium. “Ontology” is a specification of a conceptualization that refers to the subject of existence. GO is established by the following three criteria: (I) biological process referring to a biological objective to which the gene or gene product contributes; (II) molecular function defined as the biochemical activity of a gene product; (III) cellular component referring to the place in the cell where a gene product is active. It is very common for the same protein or proteins in the same subfamily to form protein complexes, for example, protein Ste2p and Ste3p from a complex that is among activated G protein-coupled receptors in yeast cellular mating.22 It is also common for proteins in heterofamilies to form protein complexes if they share a conservative motif, for example, protein Ctf19, Mcm21, and Okp1 from a heterocomplex in the budding yeast kinetochore.23 Complicated protein complexes may be formed by multiple proteins, some of which share same biological processes and some are from the same subfamily, for example, Dsl1p complex, involved in Golgi-ER retrograde transport, includes Dsl1p, Dsl3p, Q/t-SNARE proteins, and so forth.24 Thus GO consortium is considered to be a very helpful vehicle for investigating protein-protein interactions,25 because these three criteria reflect the attribute of gene, gene product, gene-product groups and the subcellular localization.26-28 The steps, using GO (gene ontology) encoding system, are described as follows:

research articles

Identifying Protein Complexes Using Hybrid Properties 1. With Uniprot2GO mapping, provided by GOA Uniprot 34.0 on November 21, 2005 (http://www.ebi.ac.uk/ A/),29 containing 9525 GO items, the functional GO annotations of proteins are obtained. 2. Each protein can be represented by the values of the 9525 GO items to compose a 9525D (Dimensional) vector, for example, if a given protein hits a GO item which is the ith item of the GO items, the ith component of the 9525D vector is set to be 1, otherwise 0. 3. Thus, the protein sample can be formulated as T ) (t1, t2, ..., ti, ..., t9525) where ti ) 1 if the sample hits the ith GO item, and ti ) 0, otherwise. In this research, 4726 proteins were investigated. Because some GO items are zeros for every protein, the data can be more compactly packed by removing the all-zero items. As a result, each protein can be represented by 2936-dimensional vector (please refer to Supplemental Materials IV for detail). Let p ) (p1, p2, ..., p2936) and q ) (q1, q2, ..., q2936) be two proteins represented by 2936-dimensional vectors. Proteins in protein complexes tend to have more common hits in the GO items as these proteins are more likely to have the same biological process and molecular function in the same cellular components, as is discussed above. Thus, the likelihood of the interaction between p and q, that is, the weight of edge pq, is computed by the following formula:

w(p, q) )

p · q |p| · |q|

where p · q is dot product of p and q, |p| is modulus of p and |q| is modulus of q. 2.4. Minimum Redundancy Maximum Relevance (mRMR). Feature selection can reduce the feature dimensions and improve the efficiency of a learning machine. In this research, mRMR, first proposed by Peng,30 is employed as it can balance the minimum redundancy and the maximum relevance. The maximum relevance guarantees that features that contribute most to the classification will be selected, while the minimum redundancy guarantees that features whose prediction ability has already been covered by selected features will be excluded. mRMR tries to add one feature at a time into the feature list. In each round, a feature with maximum relevance and minimum redundancy is selected. As a result, a feature list with the selection order can be obtained. Both redundancy and relevance can be computed through mutual information (MI), which is defined as follows:

I(x, y) )

p(x, y) dxdy ∫ ∫ p(x, y) log p(x)p(y)

where x and y are two random variables. p(x,y) is the joint probabilistic distribution of x and y. p(x) and p(y) are the marginal probabilities of x and y, respectively. Let Ω denote the whole feature set. The already-selected feature set with m features is denoted by Ωs, and the to-beselected feature set with n features is denoted by Ωt. The relevance of a feature f and the target variable h can be computed as I(f,h), and the redundancy between a featuref and the already-selected Ωs can be computed as r(f,Ωs) ) (1/ m)∑fi∈ΩsI(f,fi). If m ) 0, then the redundancy r(f,Ωs) ) 0. For each feature f in Ωt, compute the following equation:

R(f, Ωs) ) I(f, h) - r(f, Ωs) To maximize relevance and minimize redundancy, we select a feature f ′ ∈ Ωt such that R( f ′,Ωs) ) maxf∈Ωt R(f,Ωs). Then take f ′ into Ωs and remove f ′ from Ωt. For the rest of the features, each time the most relevant and least redundant feature is selected from Ωt and put into Ωs, until all features are in Ωs. Thus, for a feature pool Ω with N(N ) n + m) features, mRMR program will execute N rounds and provide an ordered feature list: F ) [f0, f1, ..., fk, ..., fN-1], where k denotes the round at which the feature is selected. 2.5. Nearest Neighbor Algorithm (NNA). In this research, NNA,31 which has been widely applied in the bioinformatics area,32-36 was adopted to predict the class of protein complex (positive or negative). The “nearness” is defined by the Euclidian distance

d(c1, c2) ) 1 -

c1 · c2 |c1 | · |c2 |

where c1 · c2 is dot product of two vectors c1 and c2, |c1| and |c2| are the modulus of vector c1 and c2. The smaller the d(c1,c2), the nearer the two variables are. In NNA algorithm, suppose there are m training protein complexes, each of them is either positive or negative, and a new protein complex needs to be determined to be either positive or negative. The distances between each of the m protein complexes and the new protein complex are calculated, and the nearest neighbor of the new protein complex is found. If the nearest neighbor is positive/negative, then the new protein complex is assigned to be positive/negative. 2.6. Jackknife Cross-Validation Test. The prediction model is tested by jackknife cross-validation test.37 In such a test, every sample in the data set is singled out in turn as the testing data and the remaining samples are used to train the prediction model. Thus, every sample is tested exactly once. The accuracy is calculated as the percentage of the complexes that are correctly classified. Since NNA is not strictly a learning algorithm and it is not orientated to get a maximum overall accuracy, it is deemed to be all right to only compute the accuracy of identifying the positive protein complexes, which is the main focus of the research. The accuracy is defined as “the number of correctly predicted positive complexes”/“the total number of positive complexes”. 2.7. Incremental Feature Selection (IFS). From mRMR, we construct an ordered feature list F ) [f0, f1, ..., fk, ..., fN-1]. We can define the ith feature set as Fi ) {f0, f1, ..., fi} (0 e i e N 1), that is, Fi contains i + 1 features of F. For every i (0 e i e N - 1), we perform NNA algorithm with the features in Fi and obtain an accuracy of correctly identifying the positive protein complexes, evaluated by jackknife cross-validation test. As a result, we plot a curve named IFS curve, with identification accuracy as its y-axis and the index i of Fi as its x-axis. 2.8. Forward Feature Selection (FFS). To find the optimal features, we perform a forward feature selection. On the IFS curve, we determine a point empirically, which lies around the inflection point of the IFS curve. If the x-axis of this point is k, we select Fk as the initial feature set and the remaining features are set as FR. Forward feature selection tries to select more features from the rest features. The sketch of this procedure is described as follow: 1. FK ) {f0, f1, ..., fk}, FR ) {fk+1, ..., fN-1}. Journal of Proteome Research • Vol. 8, No. 11, 2009 5215

research articles

Chen et al.

Figure 2. Distribution of the most relevant features.

Figure 3. IFS curve. Figure 4. IFS and FFS curves between index 130 and 159.

2. For each feature f ∈ FR, calculate the accuracy using the feature set FK∪ { f }. 3. Select the feature f ′ ∈ FR that achieves the maximum accuracy when it is added into FK. 4. If the identification accuracy using feature set FK ∪ { f ′} is smaller than that using FK, stop and output FK, otherwise update FK ) FK ∪ { f ′}, FR ) FR - { f ′} and back to 2. As a result, we plot a curve named FFS curve, with identification accuracy as its y-axis and the index i of Fi as its x-axis. Feature set FK is considered to be the optimal feature set.

3. Results and Discussion 3.1. Results of mRMR. The mRMR program was downloaded from http://research.janelia.org/peng/proj/mRMR/. mRMR program was run with default parameters. There are two feature lists in the result of mRMR program: MaxRel feature list and mRMR feature list. See Suppplemental Materials V for the two detailed lists. For the MaxRel feature list, the most relevant 10% of the features (totally 30) are investigated. They contain graph properties, amino acid compositions, hydrophobicity, polarity, polarizability and normalized van der Waals volume. Among these properties, more than 60% features belong to graph properties, strongly indicating that the identification is most relevant to the graph features. The distribution of these features can be found in Figure 2. 3.2. Results of IFS and FFS. Figure 3 shows the results of IFS curve. To improve the efficiency of the computation, IFS is executed by alterable steps to search the highest accuracy as follows: 1. Calculate the accuracy with feature set F10, ..., F290 using 10 features as the step; 2. Find the index of the feature set achieving the maximum accuracy, which is 140 in this research; 3. Refine the accuracy around F140, by calculating accuracies using feature sets F131, F132, ..., F159. The highest accuracy of IFS is 67.34% using 141 features. Since F140 may not be the true inflection point due to chance, we set the inflection point of IFS curve to be 134, a bit earlier 5216

Journal of Proteome Research • Vol. 8, No. 11, 2009

than F140, that is, k is set to be 134 in FFS procedure. Figure 4 shows the IFS curve and the FFS curve between index 130 and 159. The highest accuracy of FFS is 69.17% using 152 features, which is 1.83% higher than that of IFS. For the readers’ interest, the accuracy of negative complexes and total accuracy using these optimized 152 features are 96.84% and 95.53%, respectively. The detail data of IFS and FFS can be found in Supplemental Materials VI. 3.3. The Most Important Features. IFS and FFS produce the optimized 152 features for the prediction. However, not all features improve the accuracy when they are added, for example, when the eighth feature, weight_edge_variance(with_ missing_edge), is added, the identification accuracy is 48.88%, while it is 49.29% when it is not added. Features that decrease the identification accuracy are considered to be less important than other features. After filtering the optimized 152 features, 101 features, which are considered to be more important for the complex identification, are selected to be further analyzed. The filtered 101 features can be found in Supplemental Materials VII. Totally, 9 graph features, out of 19 graph features in the most relevant 10% features, are part of the 101 most important features, indicating that there is severe redundancy among graph features. The 9 graph features are all at the forepart of the 101 features, and the forepart features contribute most toward the identification, indicating that graph features are the most important features for protein complex identification. Table 3 shows the distribution of the 101 features. 3.4. Analysis of the Important Features. The most contributed individual feature is the weight_edge_mean (with_ missing_edges) which is the mean of weighted edges in a complex graph including zero weights. If fewer zero-weighted edges or more heavily weighted edges present in the graph, the feature tends to have a greater value. Fewer zero-weighted edges mean the graph is more densely connected, and more heavily weighted edges mean that the proteins are bound together more strongly or bound with higher confidence. These factors all improve the likelihood of the presence of protein complexes. The second most contributed feature is the weight_ edge_variance (without_missing_edges) which is the variance

research articles

Identifying Protein Complexes Using Hybrid Properties Table 3. The Distribution of 101 Filtered Features category

number of features

Graph features Hydrophobicity Normalized van der Waals volume Polarity Polarizability Secondary structure Solvent accessibility Amino Acids Composition Total

9 12 11 13 13 13 7 23 101

of weighted edges without including edges with zero weights. We speculate that the edges of a valid protein complex will all tend to have higher weights, leading to smaller variance of the weighted edges. The third most contributed feature is the polarity of amino acids. The longer and more complementary the binding sites, the majority of which would be polar, of the proteins, the stronger the proteins would be bound. And the distribution of the polarities of the amino acids strongly influences the conformation of a protein and consequently its binding sites. Though polarity is deemed as a strong factor for the forming of protein complexes, the interpretation of polarity information from the primary structure of a protein is very primitive, preventing from accurately interpreting a protein’s tertiary structure from the primary structure. The forth most contributed feature is the topological mean, quantifying various topologies of protein complexes. For a nonbroken graph, linear graph (proteins in the graph form a linear path) has a minimum topological mean, while a complete graph has a maximum topological mean. A densely connected graph tends to have higher topological mean, indicating a higher likelihood to form protein complexes. The fifth most contributed feature is the topological_change_0.6_0.7, which is the change of the number of the edges when different weight cutoffs are applied to the graph. We expect a valid complex would have more edges with higher weights, and speculate that the greater value of the feature more likely is calculated from a valid complex. Most of the forefront features, contributing most to the identification, are graph features, indicating that graph features are the most important features. The tertiary structure or the binding sites of a protein cannot be coded properly by the primary structures using the biological features, preventing them from identifying protein complexes at a good level.

4. Conclusion We analyze the graph properties and biological properties of protein complexes and construct a prediction model using the filtered features in this research. An advantage of the model is that it does not require direct information of the topologies of protein complexes. Instead, the topologies are coded by graph properties. A dedicated feature selection procedure is applied to choose an optimized feature set, with feature analysis showing that the graph properties are most important for the identification. It indicates that the GO (Gene Ontology), from which the weights of the edges of the graph are derived, strongly relates to the forming of protein complexes. It also indicates that the topologies of the protein complexes are the main tool for the identification of protein complexes, given the rough coding of biological properties of amino acids. The fact that biological properties are poor at the identification reveals that the higher-level structures (e.g., secondary and tertiary structure) of proteins cannot be accurately represented by the

primary structure under the current coding techniques. The experimentally determined protein interaction network has not been used in the research, and a possible future research could combine the experimentally determined protein interactions with the GO estimated interactions to further improve the identification.

Acknowledgment. National Basic Research Program of China (2004CB518603) and grant from Shanghai Commission for Science and Technology (KSCX2-YW-R-112). Supporting Information Available: Supplement Materials I.txt listing the detailed components of protein complexes; Supplement Materials II.txt listing 295-dimension vector of each complex; Supplement Materials IIII.txt listing the protein sequences; Supplement Materials IV.txt listing the Go data of each protein; Supplement Materials V.txt listing MaxRel feature list and mRMR feature list; Supplement Materials VI.txt listing accuracy of IFS and FFS; Supplement Materials VII.txt listing the filtered 101 features. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) Qi, Y.; Balem, F.; Faloutsos, C.; Seetharaman, J. K.; Joseph, Z. B. Protein complex identification by supervised graph local clustering. Bioinformatics 2008, 24, 250–258. (2) Gavin, A. C. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002, 415, 141– 147. (3) Ho, Y. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002, 415, 180– 183. (4) Adamcsek, B.; Palla, G.; Farkas, I. J.; Derenyi, I.; Vicsek, T. CFinder: Locating cliques and overlapping modules in biological networks. Bioinformatics 2006, 22, 1021–1023. (5) Bader, G. D.; Hogue, C. W. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinf. 2003, 4, 2. (6) Zotenko, E.; Guimaraes, K. S.; Jothi, R.; Przytycka, T. M. Decomposition of overlapping protein complexes: A graph theoretical method for analyzing static and dynamic protein associations. Algorithms Mol. Biol. 2006, 1–7. (7) Spirin, V.; Mirny, L. A. Protein complexes and functional modules in molecular networks. Proc. Natl. Acad. Sci. U.S.A. 2003, 100, 12123–12118. (8) King, A. D.; Przulj, N.; Jurisica, I. Protein complex prediction via cost-based clustering. Bioinformatics 2004, 20, 3013–3020. (9) Rives, A. W.; Galitski, T. Modular organization of cellular networks. Proc. Natl. Acad. Sci. U.S.A. 2003, 100, 1128–1133. (10) Chakrabarti, D. Tools for Large Graph Mining. Ph.D. Thesis, School of Computer Science, Carnegie Mellon University, 2005. (11) Barabasi, A. L.; Oltvai, Z. N. Network biology: understanding the cell’s functional organization. Nat Rev Genet. 2004, 5, 101–103. (12) Stelzl, U.; Worm, U.; Lalowski, M.; Haenig, C.; Brembeck, F.; Goehler, H.; Stroedicke, M.; Zenkner, M.; Schoenherr, A.; Koeppen, S. A human protein-protein interaction network: a resource for annotating the proteome. Cell 2005, 122, 957–968. (13) Bock, J. R.; Gough, D. A. Predicting protein-protein interactions from primary structure. Bioinformatics 2001, 17 (5), 455–460. (14) Cai, C. Z.; Han, L. Y.; Ji, Z. L.; Chen, X.; Chen, Y. Z. SVM-Prot: webbased support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 2003, 31 (13), 3692–3697. (15) Cai, C. Z.; Wang, W. L.; Sun, L. Z.; Chen, Y. Z. Protein function classification via support vector machine approach. Math. Biol. Sci. 2003, 185 (2), 111–122. (16) Yu, X. J.; Cao, J. P.; Cai, Y. D.; Shi, T. L.; Li, Y. X. Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines. J. Theor. Biol. 2006, 240 (2), 175–184. (17) Cherry, J. M.; Ball, C.; Weng, S.; Juvik, G.; Schmidt, R.; Adler, C.; Dunn, B.; Dwight, S.; Riles, L.; Mortimer, R. K.; Botstein, D. Genetic and physical maps of Saccharomyces cerevisiae. Nature 1997, 387, 67–73. (18) Dubchak, I.; Muchnik, I.; Holbrook, S. R.; Kim, S. H. Prediction of protein folding class using global description of amino acid sequence. Proc. Natl. Acad. Sci. U.S.A. 1995, 92, 8700–8704.

Journal of Proteome Research • Vol. 8, No. 11, 2009 5217

research articles (19) Dubchak, I.; Muchnik, I.; Mayor, C.; Dralyuk, I.; Kim, S. H. Recognition of a protein fold in the context of the Structural Classification of Proteins (SCOP) classification. Proteins 1999, 35, 401–407. (20) Cheng, J.; Randall, A. Z.; Sweredoski, M. J.; Baldi, P. SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Res. 2005, 33, W72–W76. (21) Frishman, D.; Argos, P. Seventy-five percent accuracy in protein secondary structure prediction. Proteins 1997, 27 (3), 329–335. (22) Shi, C.; Kaminskyj, S.; Caldwell, S.; Loewen, M. C. A role for a complex between activated G protein-coupled receptors in yeast cellular mating. Proc. Natl. Acad. Sci. U.S.A. 2007, 104 (13), 5395– 5400. (23) Ortiz, J.; Stemmann, O.; Rank, S.; Lechner, J. A putative protein complex consisting of Ctf19, Mcm21, and Okp1 represents a missing link in the budding yeast kinetochore. Genes Dev. 1999, 13 (9), 1140–1155. (24) Kraynack, B. A.; Chan, A.; Rosenthal, E.; Essid, M.; Umansky, B.; Waters, M. G.; Schmitt, H. D. Dsl1p, Tip20p, and the novel Dsl3(Sec39) protein are required for the stability of the Q/t-SNARE complex at the endoplasmic reticulum in yeast. Mol. Biol. Cell 2005, 16 (9), 3963–3977. (25) Chou, K. C.; Cai, Y. D. Predicting Protein-Protein Interactions from Sequences in a Hybridization Space. J. Proteome Res. 2006, 5, 316– 322. (26) Ashburner, M. Gene Ontology: tool for the unification of biology. Nat. Genet. 2000, 25, 25–29. (27) Chou, J. J.; Li, H.; Salvessen, G. S.; Yuan, J.; Wagner, G. Solution structure of BID, an intracellular amplifier of apoptotic signaling. Cell 1999, 96, 615–624. (28) Oxenoid, K.; Chou, J. J. The structure of phospholamban pentamer reveals a channel-like architecture in membranes. Proc. Natl. Acad. Sci. U.S.A. 2005, 102, 10870–10875.

5218

Journal of Proteome Research • Vol. 8, No. 11, 2009

Chen et al. (29) Camon, E.; Magrane, M.; Barrell, D.; Binns, D.; Fleischmann, W.; Kersey, P.; Mulder, N.; Oinn, T.; Maslen, J.; Cox, A.; Apweiler, R. The gene ontology annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. Genome Res. 2003, 13, 662–67. (30) Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information: criteria of maxdependency, max-relevance, and minredundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226– 1238. (31) Salzberg, S.; Cost, S. Predicting protein secondary structure with a nearest-neighbor algorithm. J. Mol. Biol. 1992, 227, 371–374. (32) Cai, Y. D.; Qian, Z. L.; Lu, L.; Feng, K. Y.; Meng, X.; Niu, B.; Zhao, G. D.; Lu, W. C. Prediction of compounds’ biological function (metabolic pathways) based on functional group composition. Mol. Diversity 2008, 12, 131–137. (33) Qian, Z. L.; Lu, L. L.; Liu, X. J.; Cai, Y. D.; Li, Y. X. An approach to predict transcription factor DNA binding site specificity based upon gene and transcription factor functional categorization. Bioinformatics 2007, 23, 2449–2454. (34) Salamov, A. A.; Solovyev, V. V. Prediction of Protein Secondary Structure by Combining Nearest-neighbor Algorithms and Multiple Sequence Alignments. J. Mol. Biol. 1995, 247, 11–15. (35) Yi, T. M.; LANDER, E. S. Protein secondary structure prediction using nearest-neighbor methods. J. Mol. Biol. 1993, 232, 1117– 1129. (36) Kim, S. Protein β-turn prediction using nearest-neighbor method. Bioinformatics 2004, 20, 40–44. (37) Chou, K. C.; Zhang, C. T. Prediction of protein structural classes. Crit. Rev. Biochem. Mol. Biol. 1995, 30 (4), 275–349.

PR900554A