Prediction of Membrane Protein Types in a Hybrid Space - Journal of

A feature selection method has also been put forward and explored, resulting in both improvement on prediction accuracy and reduction of space dimensi...
0 downloads 0 Views 383KB Size
Prediction of Membrane Protein Types in a Hybrid Space Peilin Jia,‡,§ Ziliang Qian,‡,§ Kaiyan Feng,# Wencong Lu,∇,O Yixue Li,*,§,4,⊥ and Yudong Cai*,† CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, China, Graduate School of the Chinese Academy of Sciences, 19 Yuquan Road, Beijing 100039, China, Bioinformatics Center, Key Lab of Molecular Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China, Shanghai Center for Bioinformation Technology, 100 Qinzhou Road, 200235 Shanghai, China, College of Life Science & Biotechnology, Shanghai Jiao Tong University, Division of Imaging Science & Biomedical Engineering, Room G424 Stopford Building, The University of Manchester, M13 9PT, United Kindom, Department of Chemistry, College of Sciences, Shanghai University, 99 Shang-Da Road, Shanghai 200444, China, and School of Materials Science and Engineering, Shanghai University, 149 Yan-Chang Road, Shanghai 200444, China Received November 8, 2007

Prediction of the types of membrane proteins is of great importance both for genome-wide annotation and for experimental researchers to understand proteins’ functions. We describe a new strategy for the prediction of the types of membrane proteins using the Nearest Neighbor Algorithm. We introduced a bipartite feature space consisting of two kinds of disjoint vectors, proteins’ domain profile and proteins’ physiochemical characters. Jackknife cross validation test shows that a combination of both features greatly improves the prediction accuracy. Furthermore, the contribution of the physiochemical features to the classification of membrane proteins has also been explored using the feature selection method called “mRMR” (Minimum Redundancy, Maximum Relevance) (IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27 (8), 1226-1238). A more compact set of features that are mostly contributive to membrane protein classification are obtained. The analyses highlighted both hydrophobicity and polarity as the most important features. The predictor with 56 most contributive features achieves an acceptable prediction accuracy of 87.02%. Online prediction service is available freely on our Web site http:// pcal.biosino.org/TransmembraneProteinClassification.html. Keywords: Membrane Protein • Nearest Neighbor Algorithm • Feature Selection

Introduction Membrane proteins are important proteins playing various roles in biology cells; for example, some work as pumps or channels by transporting molecules into and/or out of cells; some provide the skeleton for the lipid bilayer membranes. On the basis of their function and topology, membrane proteins can be divided into the following six types: (1) single-pass type I membrane; (2) single-pass type II membrane; (3) multipass membrane protein; (4) Cell membrane, lipid-anchor; (5) Cell membrane, GPI-anchor; (6) peripheral membrane. Knowledge of membrane protein types often provides great help for understanding their function. However, experimentally detecting this information is difficult either because of the intrinsic * To whom correspondence should be addressed. E-mails: (Yixue Li) yxli@ sibs.ac.cn; (Yudong Cai) [email protected]. ‡ Graduate School of the Chinese Academy of Sciences. § Key Lab of Molecular Systems Biology, Shanghai Institutes for Biological Sciences. # The University of Manchester. O School of Materials Science and Engineering, Shanghai University. 4 Shanghai Center for Bioinformation Technology. ⊥ Shanghai Jiao Tong University. ∇ Department of Chemistry, College of Sciences, Shanghai University. † CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences. 10.1021/pr700715c CCC: $40.75

 2008 American Chemical Society

biochemical properties of membrane proteins or because of the ever growing huge body of new proteins. As a result, an effective algorithm aiming at predicting membrane proteins’ types is of great help to understand their functions. The past decade has seen a huge appearance of studies regarding the field of identifying the types of membrane proteins using bioinformatics technology. Anders Krogh tried to predict topology of membrane proteins using a Hidden Markov Model.2 Cai, Y. D. and Chou, K. C. did a series of work trying to improve the prediction accuracy of membrane proteins’ types.3–6 Recently, Xiao-Guang Yang also used amino acid and peptide composition to predict membrane protein types.7 However, few works tried to explore the relationship between physiochemical features or domains and the types of membrane proteins, or to figure out the informative features which contribute most to discriminate membrane protein types. Feature selection has played an important role in the investigation of machine learning and data mining problems. It can very effectively combat the dimensional curse, increase a learning machine’s generalization, release the burden of too much computation, and help to meet a data mining task. Feature selection has been greatly applied in the field of microarray data analysis, which always operates in highJournal of Proteome Research 2008, 7, 1131–1137 1131 Published on Web 02/09/2008

research articles

Jia et al.

Table 1. Description of the Data Set membrane protein type

total

I: single-pass type I membrane II: single-pass type II membrane III: multipass membrane protein IV: membrane protein; lipid-anchor V: membrane protein; GPI-anchor VI: peripheral membrane protein Total

576 254 2727 101 168 712 4538

dimensional feature spaces. Yet, few works have introduced feature selection in membrane protein discrimination to highlight those important features or to improve the prediction accuracy. Recently, Keun-Joon Park has performed feature selection in their work of discrimination of outer membrane proteins.8 However, deep application and detailed analysis are still needed regarding the problem of membrane proteins. In this work, we first performed the discrimination work on a manually constructed data set to classify the six types of membrane proteins, executed in a bipartite space containing domain profile and physiochemical characters. After that, to further improve the prediction accuracy as well as to elucidate the most important features, we incorporated a previously established algorithm, mRMR (Minimum Redundancy, Maximum Relevance)1plus our own forward feature selection strategy, to perform feature selection on our working data set. The algorithm we used is nearest neighbor algorithm (NNA), and jackknife test has been performed throughout all of our work. The total accuracy achieved 87.02%. A webserver has also been provided at http://pcal.biosino.org/ TransmembraneProteinClassification.html.

Materials and Methods 1. Data Set. 1.1. Data Collection. We manually constructed our working data set from Swiss-Prot (http://cn.expasy.org/, release 51.2) mainly according to the annotation line stated as “subcellular location”.9 As for entries depositing in the database of Swiss-Prot, only those presented with the “subcellular location” annotation would be kept. Our strategy contains a preliminary classification automatically done by running a string-match script according to a list of suggested keywords, followed by manual verification for each annotation entry. The detailed explanation about how to construct the data set can be found in our previous work.10 1.2. Removal of Highly Similar Sequences. We then removed those sequences with high degree of similarity by performing the program CD-HIT11 which executes an all-toall check to generate a nonredundant data set. After setting the cutoff value for removing as 0.4, we performed CD-HIT and retrieved a nonredundant data set, which contains proteins with no more than 40% similarity with each other. We did not consider those with amino acid of B or X in their sequences, neither those with sequence length greater than 5000 or less than 50. The subsequently constructed data set has been shown in Table 1 with detailed name list provided in the Supporting Information. 2. Method. The nearest neighbor algorithm (NNA) is a special case of k-nearest neighbor where k ) 1. The basic idea is to classify objects based on those closest training samples in the feature space. NNA makes itself standout for easily executable capability and no restriction on training data distribution.12,13 1132

Journal of Proteome Research • Vol. 7, No. 3, 2008

The training samples are denoted as vectors distributed in a multidimensional feature space, which is partitioned into regions according to the labels of the training data set. According to NNA, a test sample, as denoted by Px, would be designed to class c according to eq 1 , where c is the class of Pk D(Px,Pk) ) Min{D(Px,P1), D(Px,P2), ..., D(Px,PN)},

(x * k) (1)

N is the sample size of the whole data set. The operator Min means taking the smallest value from those in the brackets.13 In our case, we defined the distance as: D(Px,Pi) ) 1 -

Px · Pi , ||Px || · ||Pi ||

(i ) 1, 2, ..., N)

(2)

where Px · Pi is the dot product of vectors Px and Pi, with |Px| and |Pi| as their modulus, respectively. The distance lies always between 0 and 2, and the similarity between Px and Pi is inversely proportional to their distance. In particular, when Px ≡ Pi, then D(Px,Pi) ) 0 3. Construction of the Hybrid Vector Space: Domain Profile Space and Sequence-Based Physiochemical Parameter Space. Two kinds of feature spaces have been constructed based on either domain profile or physiochemical character for a subject sequence. The domain profile space is constructed by all the Pfam domains14 which the training data set covers, each Pfam entry serving as one dimension. A protein in this space is numerically represented by

[]

a1 a2 ... P) ai ... aD

(3)

where ai ) 1 if the corresponding Pfam entry appears in the specific protein sequence while ai ) 0 if the protein does not have this Pfam domain. D is the total number of Pfam entries which are covered by our data set, which in this case, D ) 671. In the physiochemical space, feature vector of a specified protein is calculated from the following seven residue properties: sequence amino acid composition, hydrophobicity, predicted secondary structure, solvent accessibility, normalized van der Waals volume, polarity, and polarizability. Amino acid composition measures the percent of each of the 20 amino acids in the whole sequence. Secondary structure is classified into three groups as helix, strand, and coil by using the predicting tools of Predator.15 Solvent accessibility is represented by two groups as either buried or exposed, predicted by PredAcc.16 Analogously, for other properties such as hydrophobicity, normalized van der Waal volume, polarity, and polarizability, amino acid residues are divided into three groups according to their value range, respectively.17 Three descriptors are constructed to describe the global character of the seven properties presented above,17 namely, composition (C), transition (T), and distribution (D). C represents the percentage of amino acids from certain properties. T indicates the frequency that an amino acid of a certain group is transited to an amino acid of a different group. Five specific sites are defined to describe the distribution of the seven properties along the whole protein sequence, which are the first, 25%, 50%, 75%, and 100% of the amino acids of a certain property is located. D measures the distribution of the properties by calculating the percent of the five sites against the whole sequence. Therefore, the amino acid composition contributes 20-dimen-

research articles

Prediction of Membrane Protein Types in a Hybrid Space Table 2. Physicochemical Feature Vectors and Their Dimensionality number of vectorsa property

C

T

D

total

Hydrophobicity(HY) Secondary structure(SS) Solvent accessibility(SA) Normalized van der Waals volume(VD) Polarity(PR) Polarizability(PZ) Amino Acids Composition(CM)

3 3 1 3 3 3

3 3 1 3 3 3 20

15 15 5 15 15 15

21 21 7 21 21 21 20

a

C, composition; T, transition; D, distribution.

sional vectors to the whole vector space and the solvent accessibility 7-dimensional, while each of the other properties contributes 21-dimensional, respectively (Table 2). Thus, the vector space constructed in this way is of (5 × 21) + 20 + 7 )132 dimension and the protein defined in this space is represented by a 132D feature vector (please see Supporting Information). This kind of descriptor has been successfully used in several other protein analyzing problems such as protein– protein interaction18 and protein secondary structure prediction.17 Readers can consult a series of previous studies for more detail.19,20 Our strategy tries to describe a protein by their domain profile. Yet, if a protein has not been defined by the database of Pfam, it has to and will be referred to the physiochemical space. However, two special cases are treated exceptionally. First, if a protein contains a unique Pfam entry that appears in no other protein of the data set, it will be removed to the physiochemical space. This is because, in a vector space, if all the values of a feature come out as 0 except for one point, or one protein actually in our case, the presence of this feature provides no information for the prediction of all the other proteins except this specific protein, and vice versa. In this occasion, removing the feature is good for dimension reduction with slight influence on the prediction accuracy. Secondly, proteins with identical Pfam profiles but assigned to different classes have also been removed because these double assignments would complicate the prediction. That is, these two kinds of proteins, as well as those with no Pfam definitions, will be operated in the physiochemical space all together. 4. mRMR and Forward Feature Selection. To choose the best feature combination for a learning machine, ideally all possible feature combinations should be evaluated. However, this exhaustive search will only be best for data with small feature sets since it requires too much computation to accomplish the task if many features are present. Data with huge feature sets need a different approach to approximate the perfect feature combination. An intuitive measure will be to add one feature, which is the best for all the one-addition combinations, each time until it reaches a reasonable solution. However, if thousands of features need to be analyzed, this forward/backward search will become inappropriate due to the huge computational problem. In Peng’s work1 an mRMR filter is first used to preselect an estimated optimized subset from the whole feature set, and then a forward/backward search was applied to this much smaller subset to obtain the final optimized feature set. A good estimated optimized feature set should be chosen to maximize its relevance, that is, each feature in the chosen feature set should contribute well to the classification of the data and

minimize its redundancy, that is, the chosen features should not correlate to each other much. If the feature itself cannot separate the classes, it should be excluded because it will not do its job, neither when it is combined with other features. If one feature is highly correlated with another feature (the extreme case is that these two features are identical), one of them should be excluded as it adds an extra dimension without adding much useful information. The mRMR filter is one of the methods trying to balance the minimum redundancy and the maximum relevance. It tries to select an optimized feature set by examining the mutual information between the features themselves and between the features and the class variable. The mutual information between the features reviews how strong one feature is related to others. Including the feature, which relates little to others, leads to minimization of the redundancy. The mutual information between the features and the class variables reviews how strong the feature is related to the class variables. The stronger it is related to the class variable, the better it is likely to contribute to the classification. Including the feature, which strongly relates to the class variable, leads to maximization of the relevance. After mRMR is applied to preselect an estimated optimized feature set, a refinement using forward/backward feature selection is required to further optimize its feature set. A detail description of mRMR is given below. Given a data set with n rows (samples) and m columns (features), Ω, S, and ΩS are used to represent the whole feature set, selected features, and nonselected features, respectively. And we call the class variable c (class). Thus, Ω ) {xi, i ) 1, ..., m}. (1) Calculating the Score of Features’ Relevance and Redundancy .21 For categorical features, the mutual information I between two variables x and y is defined based on their joint probabilistic density function p(x,y) and the respective marginal probabilistic density function p(x) and p(y): I(x, y) )

∑ p(x , y ) log p(x )p(y ) p(xi, yj)

i

j

i

i,j∈N

(4)

j

where i and j are two samples in the data set. If continuous features exist in the feature set, they must be discretized before calculating the mutual information. In this work, we choose a threshold of 1.0 in data discretization and the continuous feature is changed into a 3-state categorical feature, that is, take the “mean ( standard deviation” as boundaries of the 3 states. To measure the level of discrimination power of the features, the mutual information I(c,xi) between classification variable c and the independent variable xi is calculated. Thus, I(c,xi) quantifies the relevance of xi for the classification task. The maximum relevance condition is to maximize the total relevance of all variables in S: max(D), D )



1 I(c, xi) |S| x ∈S

(5)

i

where D is the mean value of the mutual information between the target variable c and the features in S, and |S| is the number of features in S. Mutual information is again used to measure the level of “similarity” between variables in S. The redundancy is calculated by the following formula: min(R), R )



1 I(xi, xj) |S|2 xi,xj∈S

(6)

Journal of Proteome Research • Vol. 7, No. 3, 2008 1133

research articles

Jia et al. 3

Table 3. Composition of the Data Set membrane protein type

part A part B (physiochemical (domain space) space)

I: single-pass type I membrane protein II: single-pass type II membrane protein III: multipass membrane protein IV: lipid-anchor membrane protein V: GPI-anchor membrane protein VI: peripheral membrane protein Total

total

348

228

576

133

121

254

2086

641

2727

19

82

101

72

96

168

306

406

712

2964

1574

4538

where I(xi,xj) and R represent the mutual information and score for notational simplicity between xi and xj in S, respectively. (2) The Combination Criterion. ∇MID (Mutual information difference) is defined to combine the maximum relevance and the minimum redundancy, which can be expressed as: max(∇MID), ∇MID ) D - R

(7)

In practice, suppose we already have Sm-1, the feature set with m - 1 features. The task is to select the mth feature from the set ΩS. This is done by selecting the feature that maximizes ∇MID. The respective incremental algorithm optimizes the following condition: max

[

xj∈(X-Sm-1)

I(xj, c) -



1 m - 1 x ∈S i

m-1

]

I(xj, xi)

(8)

The mRMR includes two steps: Step 1. The 1st feature is selected according to eq 5, that is, the feature with the highest I(c,xi). Step 2. From the feature set ΩS (all features except those already selected), adding the jth feature xj into feature subset S when condition (8) is reached. The mRMR algorithm results in m sequential feature subsets S1, S2, S3, ..., Sm. (3) Selecting the Compact (Optimal) Feature Subset.1 For finding the optimal feature subset, a list of cross-validation (CV, such as jackknife test) error for Si(i ) 1, ..., m) was calculated and a relatively stable CV error range k(1 e k e m) was found. The sketch of forward search method is shown as following: Step 1. Calculate the CV error: For feature subset Si(i ) 1, ..., k) every CV error of subset {Si + xj}(xj ∈ ΩS) was calculated. Step 2. Feature inclusion: The feature xj was added to the Si(i ) 1, ..., k) when the CV error of subset {Si + xj} is the smallest one. If the CV error of subset {Si + xj} is bigger than that of Si, then the feature selection is stopped. Else, go to Step 1. Then the compact feature subset can be extracted from the candidate feature set.

Results 1. Results Using All Features. As shown by Table 3, part A of the data set contains 2964 proteins covering 671 Pfam entries in total thus handled in a domain space of 671 dimensions. Part B of the data set consists of 1574 proteins and is predicted in the 132-D feature space indicating physiochemical features. 1134

Journal of Proteome Research • Vol. 7, No. 3, 2008

At first, we performed the jackknife test on each of the data sets in the corresponding vector space using all defined features. Results are shown in Table 4. As for the total prediction accuracy of the six types, domain space has as high as 97.64% and physiochemical space has 60.93%, summered as 84.91% in total. 2. Analyzing and Exploring the Relationship in Each Feature Space. To select the most informative feature combinations, we incorporated mRMR followed by forward feature selection to handle each of the vector space, respectively. mRMR here was first used, through the part of “MaxRel” (Maximum Relevance), to show the relatedness of individual features with the subject aim of classifying membrane protein types. Then mRMR is used as a macroscopically control as indicated in the part of “mRMR” (Minimum Redundancy), showing the prediction potential of the sorted features from a more general scale. If the prediction accuracy keeps increasing according to the mRMR list of features, then there is no need for further reduction. Otherwise, forward feature selection will be introduced to check the features one by one within the area where accuracy is unstable. 2.1. Domain Profile Space. To find the most contributing Pfam entries for prediction, we performed mRMR on part A of the data set. Since the maximum number of sorted features of mRMR is 500 and the total dimension of the Pfam space is 671, only a part of all the Pfam entries have been sorted (results are presented in Supporting Information). Figure 1 shows the prediction accuracy of the 25 subsets of these 500 candidates, each time 20 will be added according to the resulted list sorted by the “minimum redundancy” part of mRMR. The ever growing accuracy line of these feature combinations indicates that there is little redundancy in the Pfam entries space; thus, there is no need to perform feature selection. 2.2. Physiochemical Space. 2.2.1. Correlation between Features and Membrane Protein Types. The computation result of MaxRel by mRMR ranks individual features according to the extent to which they are related to the classification problem. Figure 2 shows the distribution of the seven physiochemical characters as indicated by the part of MaxRel (please see Supporting Information for detail information). From Figure 2, hydrophobicity and polarity rank high, and parts of them occupy the most important sites of the total distribution of the 132D space. As for hydrophobicity, it is already known that membrane proteins are associated with bilayer membranes mainly through hydrophobicity-driven interactions between the transmembrane segments and the bilayer interface.22 Furthermore, for polarity, previous experiments have shown that most of the amino acid side chains of the transmembrane segments must be nonpolar, or if a polar side chain happens, it often participates in H-bond to be consistent with the nonpolar environment.23,24 That is, these two characters play important roles during the formation process of transmembrane proteins as well as maintaining the interaction between transmembrane protein and the bilayer membrane. MaxRel successfully highlighted these two important characters of hydrophobicity and polarity, which shows the rationality and possibility of our vector construction to predict the types of membrane proteins. 2.2.2. Construction of a Compact Set of Nonredundant Features. The computation result of the part of “minimum redundancy” by mRMR (please see Supporting Information for detail information) was based on the principle of minimum redundancy, providing a frame to do the further feature selection. To either condense the vector space or to improve

research articles

Prediction of Membrane Protein Types in a Hybrid Space Table 4. Performance of the Hybrid Vector Space and Feature Selection Strategy part A

part B

total

Type I Type II Type III Type IV Type V Type VI Total

338 129 2074 14 67 272 2894

348 133 2086 19 72 306 2964

0.9713 0.9699 0.9942 0.7368 0.9306 0.8889 0.9764

Before Feature Selection 142 228 33 121 506 641 23 82 55 96 200 406 959 1574

0.6228 0.2727 0.7894 0.2805 0.5729 0.4926 0.6093

480 162 2580 37 122 472 3853

576 254 2727 101 168 712 4538

0.8333 0.6378 0.9461 0.3663 0.7262 0.6629 0.8491

Type I Type II Type III Type IV Type V Type VI Total

338 129 2074 14 67 272 2894

348 133 2086 19 72 306 2964

0.9713 0.9699 0.9942 0.7368 0.9306 0.8889 0.9764

After Feature Selection 143 228 35 121 529 641 30 82 62 96 256 406 1055 1574

0.6272 0.2893 0.8253 0.3659 0.6458 0.6305 0.6703

481 164 2603 44 129 528 3949

576 254 2727 101 168 712 4538

0.8351 0.6457 0.9545 0.4356 0.7679 0.7416 0.8702

the prediction power, we then performed forward feature selection based on the mRMR resulting list. The 132D data set has been inputted into mRMR. A curve (Figure 3) showing accuracy tendency has been generated by iteratively adding features according to the 132 sorted features. The calculation of the curve starts with the first two features scored highest on the mRMR generated list, followed by 10 features added each time. Jackknife test has been performed for each subcombination of those features, checking by the nearest neighbor algorithm. As shown in Figure 3, an obvious increase could be observed from V1 to V32, also coupled with an unstable disturbing area from V22 to V102, resulting in a somewhat decrease in the accuracy. So we decided to re-search this area by performing forward feature selection one by one. Figure 4 shows the results of forward feature selection on the area from V22 to V102. The highest prediction accuracy, 67.03%, appears when the 56th feature has been added, which is also higher than using the total 132 features, 60.93%. So we finally chose these 56 features to perform the prediction.

Domain profile has been successfully used in many areas regarding protein function classification, such as subcellular location prediction,10 quaternary structure discrimination,25 and protein N-glycosylation.26 Here, we introduced domain profile which also shows effectiveness in prediction of types of the membrane protein. Figure 5 shows the distribution of the Pfam entries against the six types of membrane proteins. A higher percentage of 87.93% (590/671) of Pfam entries are specifically distributed in only one type of membrane proteins, which provides a reasonable explanation for the possibility and effectiveness of using domain profile to predict subtypes of membrane proteins. After validating the application as well as the high efficiency of domain space, we next ask whether there exists redundancy among these Pfam entries, or whether there exists any subset of them which can achieve an as high as, or at least no less than, the accuracy that the total corpus has achieved. However, the ever-growing accuracy trend shown by Figure 1 indicates that little redundancy exist, if there is any among these Pfam

Discussion 1. Domain Distribution and mRMR Results Indicate the Effectiveness and Little Redundancy in the Domain Space. Domains are the basic units for proteins to execute their function and are intimately correlated with their classification.

Figure 1. Accuracy tendency of the domain space. By adding 20 features according the first 500 features sorted by mRMR (the minimum redundancy part) each time, the accuracy was calculated using NNA and jackknife test.

Figure 2. Distribution of the 132 physiochemical features sorted by mRMR (the maximum relevance part). The X-axis is the site of each character after sorting. The Y-axis denotes each of the seven physiochemical characters as follows: 1, hydrophobicity; 2, secondary structure; 3, solvent accessibility; 4, normalized van der Waals volume; 5, polarity; 6, polarizability; 7, amino acid composition. Journal of Proteome Research • Vol. 7, No. 3, 2008 1135

research articles

Figure 3. Accuracy tendency from the physiochemical space. By the addition of 10 dimensions according to the first 500 features sorted by mRMR (the minimum redundancy part) each time, the accuracy was calculated using NNA and jackknife test.

Jia et al.

Figure 5. Distribution of the pfam entries against the six types of membrane proteins. Each color describes one type. Distributed on a circle, type of membrane proteins are connected with outwardly radiating lines denoting a specifically distributed Pfam entry and inwardly radiating lines denoting Pfams occurring in more than one types of membrane proteins. Table 5. Performance of Homology-Based Prediction by BLAST

Type I Type II Type III Type IV Type V Type VI Total

Figure 4. Accuracy tendency of feature selection. The first 22D has been decided by mRMR and the V23∼V102 in the picture is resorting by forward feature selection. The highest accuracy appears at V56.

entries. Since we have removed the orphan proteins and multiple defined records before we started to perform the prediction, this result is not surprising. Finally, we defined the domain profile space keeping all the Pfam entries related. 2. Forward Feature Selection Both Highlights Important Vectors and Improves Prediction Accuracy in Physiochemical Space. In the physiochemical space, the correlation result highlighted important physical or chemical characters regarding protein sequences which greatly contribute to the prediction of membrane proteins. Nevertheless, there are possible redundancies among these important characters. The feature selection mentioned above not only reduces the 132D space to an essential subset of 56D, but also improves the total prediction accuracy from 60.93% to 67.03% (see Table 4). We finally choose the condensed subset containing 56 features to construct the physiochemical space. 3. Hybridizing Both Vector Spaces Provides a Powerful Tool for the Prediction of Membrane Proteins. Results from both spaces indicate that the domain-profile based space 1136

Journal of Proteome Research • Vol. 7, No. 3, 2008

true

prediction

accuracy

576 254 2727 101 168 712 4538

505 186 2553 32 105 484 3865

0.8767 0.7323 0.9362 0.3168 0.625 0.6798 0.8517

performs better than the physiochemical one. However, the uncompleted annotation from related databases has the limitation that not all proteins can be handled in this space. Fortunately, the physiochemical space, although it does not perform very well, requires only sequence information of a protein to perform the prediction and has the ability to be applied to all proteins as long as their sequences have been already known. Thus, the latter provides a necessary compensation for the domain-based space. We finally integrated the two subspaces to construct the classifier and the web-server. 4. Comparing with BLAST. A homologue-based strategy has also been performed for comparison. The strategy contains performing BLAST27 against the whole data set, following the logic that the type of a membrane protein can be assigned in terms of its homologous counterparts with known types. Each of the membrane proteins has been queried against all the other proteins by BLAST27 followed by being predicted as the same type of the highest homologous protein. The total prediction results are presented in Table 5, which is 85.17% in total. Compared with this homology-based search strategy, our hybrid space procedure shows weaker performance (84.91%) before feature selection but successfully outperforms (87.02%) BLAST after feature selection, which shows the rationality of our strategy.

research articles

Prediction of Membrane Protein Types in a Hybrid Space

Conclusion We performed a prediction work on the types of membrane proteins. We also carried out a deep analysis on the relationship between the domain profile or the physiochemical character of a protein and its status of being one type of membrane proteins. Our results show that proteins’ domain profile is important and effective for predicting their subtype of membrane protein. It is also indicated that both polarity and hydrophobicity are important during the process of prediction. The result of combining both vectors of domain profile and physiochemical characters is better than that from BLAST, which indicates the strong performance of our prediction methods for membrane protein types. A further step of feature selection has also been performed to construct a nonredundant and effective classifier to predict membrane protein types. Finally, a classifier using our method has been developed and put on the Web site: http://pcal.biosino.org/ TransmembraneProteinClassification.html.

Acknowledgment. This work is supported by National Basic Research Program of China (2006CB910700, 2004CB720103, 2004CB518606, 2003CB715901) and National High-Tech R&D Program (863): 2006AA02Z334. Supporting Information Available: Tables listing the detailed information of the data set, domain.mRMR features, physiochemical.MaxRel features, physiochemical.mRMR features, physiochemical feature vectors. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and minredundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27 (8), 1226–1238. (2) Krogh, A.; Larsson, B.; von Heijne, G.; Sonnhammer, E. L. L. Predicting transmembrane protein topology with a hidden markov model: application to complete genomes. J. Mol. Biol. 2001, 305 (3), 567–580. (3) Cai, Y. D.; Zhou, G. P.; Chou, K. C. Support vector machines for predicting membrane protein types by using functional domain composition. Biophys. J. 2003, 84 (5), 3257–3263. (4) Cai, Y. D.; Chou, K. C. Predicting membrane protein type by functional domain composition and pseudo-amino acid composition. J. Theor. Biol. 2006, 238 (2), 395–400. (5) Chou, K.-C.; Cai, Y.-D. Using GO-PseAA predictor to identify membrane proteins and their types. Biochem. Biophys. Res. Commun. 2005, 327 (3), 845–847. (6) Wang, S.-Q.; Yang, J.; Chou, K.-C. Using stacked generalization to predict membrane protein types based on pseudo-amino acid composition. J. Theor. Biol. 2006, 242 (4), 941–946. (7) Yang, X.-G.; Luo, R.-Y.; Feng, Z.-P. Using amino acid and peptide composition to predict membrane protein types. Biochem. Biophys. Res. Commun. 2007, 353 (1), 164–169.

(8) Park, K.-J.; Gromiha, M. M.; Horton, P.; Suwa, M. Discrimination of outer membrane proteins using support vector machines. Bioinformatics 2005, 21 (23), 4223–4229. (9) Boeckmann, B.; Bairoch, A.; Apweiler, R.; Blatter, M.-C.; Estreicher, A.; Gasteiger, E.; Martin, M. J.; Michoud, K.; O’Donovan, C.; Phan, I.; Pilbout, S.; Schneider, M. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003, 31 (1), 365–370. (10) Jia, P.; Qian, Z.; Zeng, Z.; Cai, Y.; Li, Y. Prediction of subcellular protein localization based on functional domain composition. Biochem. Biophys. Res. Commun. 2007, 357 (2), 366–370. (11) Li, W.; Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22 (13), 1658–1659. (12) Cai, Y. D.; Chou, K. C. Nearest neighbour algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition. Biochem. Biophys. Res. Commun. 2003, 305 (2), 407–411. (13) Cai, Y. D.; Doig, A. J. Prediction of Saccharomyces cerevisiae protein functional class from functional domain composition. Bioinformatics 2004, 20 (8), 1292–1300. (14) Finn, R. D.; Mistry, J.; Schuster-Bockler, B.; Griffiths-Jones, S.; Hollich, V.; Lassmann, T.; Moxon, S.; Marshall, M.; Khanna, A.; Durbin, R.; Eddy, S. R.; Sonnhammer, E. L. L.; Bateman, A. Pfam: clans, web tools and services. Nucleic Acids Res. 2006, 34 (Suppl. 1), D247–251. (15) Dmitrij Frishman, P. A. Seventy-five percent accuracy in protein secondary structure prediction. Proteins: Struct., Funct., Genet. 1997, 27 (3), 329–335. (16) Mucchielli-Giorgi, M. H.; Hazout, S.; Tuffery, P. PredAcc: prediction of solvent accessibility. Bioinformatics 1999, 15 (2), 176–177. (17) Dubchak, I.; Muchnik, I.; Mayor, C.; Dralyuk, I.; Kim, S.-H. Recognition of a protein fold in the context of the SCOP classification. Proteins: Struct., Funct., Genet. 1999, 35 (4), 401–407. (18) Bock, J. R.; Gough, D. A. Predicting protein-protein interactions from primary structure. Bioinformatics 2001, 17 (5), 455–460. (19) Ding, C. H. Q.; Dubchak, I. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 2001, 17 (4), 349–358. (20) Cai, C. Z.; Wang, W. L.; Sun, L. Z.; Chen, Y. Z. Protein function classification via support vector machine approach. Math. Biosci. 2003, 185 (2), 111–122. (21) Ding, C.; Peng, H. Minimum redundancy feature selection from microarray gene expression data. J. Bioinf. Comput. Biol. 2005, 3 (2), 185–205. (22) White, S. H.; Wimley, W. C. Hydrophobic interactions of peptides with membrane interfaces. Biochim. Biophys. Acta 1998, 1376 (3), 339–352. (23) Haltia, T.; Freire, E. Forces and factors that contribute to the structural stability of membrane proteins. Biochim. Biophys. Acta 1995, 1241 (2), 295–322. (24) Tusnady, G. E.; Simon, I. Principles governing amino acid composition of integral membrane proteins: application to topology prediction. J. Mol. Biol. 1998, 283 (2), 489–506. (25) Yu, X.; Wang, C.; Li, Y. Classification of protein quaternary structure by functional domain composition. BMC Bioinf. 2006, 7 (1), 187. (26) Li, S.; Liu, B.; Cai, Y.; Li, Y. Predicting protein N-glycosylation by combining functional domain and secretion information. J. Biomol. Struct. Dyn. 2007, 25 (1), 49–54. (27) Altschul, S. F.; Madden, T. L.; Schaffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D. J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17), 3389–3402.

PR700715C

Journal of Proteome Research • Vol. 7, No. 3, 2008 1137