Predicting the Solvent Accessibility of Transmembrane Residues from Protein Sequence Zheng Yuan,*,† Fasheng Zhang,† Melissa J. Davis,† Mikael Bode´ n,‡ and Rohan D. Teasdale*,† Institute for Molecular Bioscience and ARC Centre in Bioinformatics, The University of Queensland, St. Lucia, 4072, Australia, and School of Information Technology and Electrical Engineering, The University of Queensland, St. Lucia, 4072, Australia Received November 13, 2005
In this study, we propose a novel method to predict the solvent accessible surface areas of transmembrane residues. For both transmembrane R-helix and β-barrel residues, the correlation coefficients between the predicted and observed accessible surface areas are around 0.65. On the basis of predicted accessible surface areas, residues exposed to the lipid environment or buried inside a protein can be identified by using certain cutoff thresholds. We have extensively examined our approach based on different definitions of accessible surface areas and a variety of sets of control parameters. Given that experimentally determining the structures of membrane proteins is very difficult and membrane proteins are actually abundant in nature, our approach is useful for theoretically modeling membrane protein tertiary structures, particularly for modeling the assembly of transmembrane domains. This approach can be used to annotate the membrane proteins in proteomes to provide extra structural and functional information. Keywords: lipid exposed residues • transmembrane helix protein • transmembrane β-barrel protein • protein sequence analysis • support vector regression
Introduction In proteomes of multicellular organisms, about 20-30% of native proteins are integral membrane proteins.1,2 They play crucial roles in a variety of biological functions such as signal transduction, cellular adhesion, respiration, and molecular transport. Because of their importance, they are also targets of many pharmaceutical developments. Although primary and tertiary structural data on proteins have increased rapidly, there are only a very small number of integral membrane proteins with known three-dimensional structures. More than 30 000 structures have been deposited to the Protein Data Bank (PDB),3 but only about 300 are membrane proteins,4 of which no more than 100 are structurally unique.5 For a better understanding of the functions of membrane proteins, knowledge of their overall structures is essential. Owing to experimental difficulties in the determination of transmembrane protein structures, computational approaches become very useful. Currently, there are a number of methods which can be used to predict the topologies of transmembrane proteins from sequences alone with relatively high accuracies.6-10 However, prediction of the assembly of transmembrane domains in the lipid bilayer from the primary sequence is still elusive. Recent computational analyses based on membrane protein structures have shown a number of new findings. For example, * To whom correspondence should be addressed. E-mail: z.yuan@ imb.uq.edu.au or
[email protected]. † Institute for Molecular Bioscience and ARC Centre in Bioinformatics. ‡ School of Information Technology and Electrical Engineering. 10.1021/pr050397b CCC: $33.50
2006 American Chemical Society
the hydrophobic core region and membrane-water interface region have different amino acid propensities and physicochemical conformations.11 This information will be helpful for accurately predicting transmembrane domain boundaries. The analyses on the lipid propensity of transmembrane residues provide a way to study helix-helix and helix-lipid interactions.12-15 Knowing whether a residue is lipid exposed or not provides valuable information for studying transmembrane segment assembly. As a well-studied structural feature, accessible surface area (ASA) has been used to measure the extent to which a residue is in contact with its lipid or solvent environment. Recent studies in prediction of accessibility have mainly focused on the solvent accessibility of soluble proteins. A number of methods have been proposed to predict the real values of solvent accessibility (ASA or relative ASA).16-20 The reported correlation coefficient between predicted and observed solvent accessibility could be as high as 0.7.20 Here, we develop new methods to predict lipid accessible surface areas of transmembrane residues, as they are in a different environment from that of soluble proteins and therefore needs separate treatment. First, we will examine whether the same type of amino acids have varying ASA values depending on their location in membranous and extra-membranous parts. Second, we develop new methods for predicting the ASA values of transmembrane residues. We then use these predicted values to determine whether a residue is buried or exposed in a lipid environment. We have applied the algorithms to two types of Journal of Proteome Research 2006, 5, 1063-1070
1063
Published on Web 03/30/2006
research articles
Yuan et al.
Table 1. Accessible Surface Areas for an Amino Acid (X) in a Gly-X-Gly Extended Conformation22 AA
whole residue (Å2)
side chain (Å2)
A C D E F G H I K L M N P Q R S T V W Y
116.4 141.48 155.37 187.16 223.29 83.91 198.51 189.95 207.49 197.99 210.55 168.87 144.8 189.17 249.26 125.68 148.06 162.24 265.42 238.3
55.4 82.07 97.8 132.53 164.18 141.27 130.71 147.99 141.52 150.39 109.92 106.44 129.68 190.24 69.08 88.62 103.12 209.62 180.03
integral membrane proteins (R-helix and β-barrel) and predict the ASA value for the whole residue or for its side chain.
ferences. The values in the profiles are divided by 10 for normalization, instead of being normalized by a sigmoid function. Since the residues in coiled-coil and low-complexity regions do not have meaningful scores, we encoded these residues with an orthogonal scheme. This variation of coding scheme has been successfully applied in our previous work25 and was also adopted in this work. To study the effect of sliding window size on the prediction performance, we use different values from 11 amino acids to 21 amino acids and report their results. To estimate the relationship between protein sequence and relative ASA, we performed -insensitive support vector regression (-SVR).26,27 The expected function for relative ASA can be formulated as f(Xi) ) 〈W,Φ(Xi)〉 + b
where W is the weight, b the bias, and X a feature vector representing a certain residue. Φ(Xi) is a nonlinear function used for mapping a data point from the input space to a feature space. To find the best W and b, the following minimization problem was adopted M
1 Minimizing ||W||2 + C (ξi + ξ/i ) 2 i)1
{
Methods Solvent Accessible Surface Area. The software MSMS21 was used to calculate ASAs for all atoms in the PDB file. The PDB file may contain the atomic coordinates for a number of subunits or chains. The selected probe radius is 1.4 Å (H2O) or 2.0 Å (-CH2-), as used by Beuming and Weinstein.13 A probe radius of 1.9 Å was used by Adamian et al.12 to simulate the -CH2- group, which was not significantly different from the selection of 2.0 Å. For each residue, we obtain ASA for the whole residue or the side chain, resulting in two values for prediction. The ASAs for a residue were divided by their reference values for normalization and the normalized values were defined as the relative accessible surface areas (rASAs). The reference values for the whole and side chain of a residue X are their corresponding accessible surface values in a Gly-X-Gly tripeptide with extended conformation. The reference values given by Samanta et al.22 are used in this work and also shown in Table 1. Using ASA or rASA to define whether a residue is on the protein surface or inside a protein is widely used, but it also generates errors by assigning some “inside” residues as “surface” residues. These kinds of residues can only be found by inspecting protein structures manually and cover a small percentage of solvent accessible residues. Eyre et al.14 analyzed the distribution of those residues in a nonredundant data set of 23 membrane proteins. They found that those residues with rASA greater than 5% but not on protein surface (“pore-lining” residues) comprised only 1% of the defined solvent accessible residues. The same level of noise may apply to our data set and the noise will not affect our analyses significantly. Predicting Relative Accessible Surface Area from PSIBLAST Profile. The residue rASAs are predicted from protein sequences. To code a protein sequence, we use PSI-BLAST profiles23 as used in previous work on protein secondary structure prediction.24 The PSI-BLAST profiles are obtained by running the program against NCBI nonredundant protein sequence database in three rounds with coiled-coil and lowcomplexity regions masked.24 However, there area some dif1064
Journal of Proteome Research • Vol. 5, No. 5, 2006
(1)
∑
f(Xi) - yi e + ξi
/ subject to yi - f(Xi) e + ξi / ξi,ξi g 0 for i ) 1,...,M
(2)
where C is the regularization constant that determines the tradeoff between the norm and the error penalty. Two positive variables ξ and ξ* are used to measure the deviation of samples outside the error tube, where data points with errors less than are not used in constructing the optimization problem. The solution of the problem is given as26,27 M
f(X) )
∑(R - R )〈Φ(X ),Φ(X)〉 + b / i
i
i
(3)
i)1
where Ri and R/i are Lagrange multipliers. By introducing a kernel function, K(Xi,X) ) 〈Φ(Xi),Φ(X)〉, we only need to know the exact form of K(Xi,X), instead of Φ. Particularly, the radial basis function, K(Xi,X) ) exp(-γ||Xi - X||) is used here and thus, the above formula can be transformed to M
f(X) )
∑(R - R ) exp(-γ||X - X||) + b i
/ i
i
(4)
i)1
In this work, different values of γ and C were examined. However, was constantly set as 0.01. Membrane Protein Datasets and Leave-One-Out Test. Two datasets have been used for this purpose; one containing twenty-eight R-helix and the other fourteen β-barrel protein structures. The 28 R-helix proteins contains fifty-nine unique chains and were previously used by Beuming and Weinstein13 for analyzing the characteristics of buried and exposed residues. Transmembrane residues were manually annotated by the authors based on protein structures and particularly the boundary residues of transmembrane domains were determined as those interacting with the phospholipid headgroups or those interacting with the solvent if the transmembrane domain was buried. The fourteen β-barrel membrane proteins
research articles
Lipid Accessibility Prediction Table 2. Proteins Used for Development and Evaluation of the Prediction Methodsa transmembrane helix proteins
1afoa 1ap9_ 1ar1a 1bgyc 1bgyd 1bgye 1bgyg 1bgyj 1bgyk 1bl8a 1e12a 1ehka 1ehkb 1ehkc 1f88a 1fx8a 1h2sa 1h2sb 1iwga 1j4na 1jb0a 1jb0b 1jb0f 1jb0i 1jb0j 1jb0l 1jb0m 1jb0x 1kpla 1kqfb 1kqfc 1kzua 1kzub 1l0vc 1l0vd 1l7va 1lgha 1lghb 1msla 1mxma 1nekc 1nekd 1occa 1occb 1occc 1occd 1occg 1occi 1occj 1occk 1occl 1occm 1prch 1prcl 1prcm 1pv7a 1pw4a 1qlac 1su4a 1a0sp 1e54a 1fepa 1i78a 1k24a 1kmoa 1prn_ 1qd5a 1qj8a 1qjpa 2fcpa 2mpra 2omf_ 2por_
transmembrane β-barrel proteins a
The first four characters are the PDB name and the fifth is the chain name.
Table 3. Amino Acid Compositions and Average Accessible Surface Areas for Twenty Types of Amino Acids When They Occur in Membranous and Extra-membranous Regions of Membrane Proteins, as Well as In Soluble Proteins whole residue (Å2)
composition (%)
side chain (Å2)
hydrophobicity
AA
mem.
extra-mem.
soluble
mem.
extra-mem.
soluble
mem.
extra-mem.
soluble
very hydrophobic
L V I F M W C A G T Y H K S P N R E Q D
16.68 10.60 9.82 8.81 4.46 3.26 1.20 12.05 8.96 5.28 3.36 2.32 0.66 4.88 2.82 1.25 1.11 0.87 0.88 0.71
7.80 6.36 4.68 4.67 2.34 2.45 0.90 8.15 8.42 5.92 3.50 2.60 5.38 5.86 6.27 4.33 5.54 5.72 3.84 5.25
8.62 7.03 5.44 3.93 2.20 1.49 1.39 8.69 7.86 5.69 3.56 2.36 5.83 5.81 4.73 4.33 4.81 6.53 3.76 5.95
41.73 33.04 38.73 45.19 31.49 63.96 20.49 15.76 9.20 19.93 42.67 36.69 38.85 13.09 18.69 19.51 36.44 14.87 29.36 14.88
40.90 30.15 34.00 51.68 46.19 57.54 11.33 30.70 24.39 40.88 43.99 58.41 97.30 35.42 46.00 48.84 87.49 72.42 64.61 50.75
22.98 19.78 18.68 23.79 33.22 33.34 12.02 24.30 23.79 39.40 36.26 49.69 105.71 37.22 50.03 53.84 87.63 74.05 67.47 54.29
39.70 31.54 37.54 43.18 30.03 62.19 18.40 12.51 18.09 40.44 35.01 37.05 10.85 16.29 18.39 35.12 13.75 26.88 13.00
35.31 25.01 29.62 44.49 38.57 52.80 7.85 20.14 32.97 38.93 50.89 87.98 24.60 37.28 41.25 78.92 62.90 56.74 41.48
18.78 15.76 15.29 19.59 26.47 29.08 7.33 16.04 32.48 31.99 43.10 96.09 27.30 40.90 45.38 80.40 64.90 59.62 45.10
hydrophobic
other
were previously used by Bagos et al.28 to develop a prediction method for β-barrel membrane proteins. The transmembrane segments were manually defined by visualizing the threedimensional structures and identifying the aromatic belts of the barrel. All protein chain names are listed in Table 2 and their transmembrane annotations are given in the Supporting Information files 1 and 2. Because transmembrane domains of R-helix and β-barrel proteins have different structures and lengths, the SVR approach was applied to the two data sets and two different prediction methods were developed. To evaluate the performance of the methods, we used leave-one-out tests. Each protein in a data set was tested in turn by the function derived from the other proteins. We calculated the Pearson correlation coefficient and mean absolute error between the predicted ASA and the observed ASA based on all residues.
PDB-REPRDB.29 All protein and chain names are listed in the Supporting Information file 3. The amino acid composition and mean ASA values are also calculated. In addition to amino acid compositions, Table 3 gives the mean values for the whole residues and side chains of different amino acids. Here, we provide the results for the data derived from a probe radius of 1.4 Å in the calculation of ASA, as a probe radius of 2.0 Å yields a similar overall trend.
Results and Discussions Residues in Membranous and Extra-Membranous Parts Have Different ASA Distributions. To compare the ASA distribution of residues within membrane proteins, we performed the analysis based on each individual amino acid type because different amino acids have different ASA scales and the membranous regions have an amino acid composition bias toward hydrophobic residues. For each type of amino acid, we compared its ASA values and its amino acids compositions according to different locations (membranous or extramembranous). For a further comparison with soluble proteins, we prepared an extra dataset consisting 1086 protein chains with pairwise identity less than 25% using the web server of
Figure 1. Relationship of mean absolute errors (y) and correlation coefficients (x). The correlation coefficient between y and x is -0.988. Journal of Proteome Research • Vol. 5, No. 5, 2006 1065
research articles
Yuan et al.
Table 4. Prediction Correlation Coefficient (CC) and Mean Absolute Error (MAE) for Different Types of Transmembrane Residuesa probe radius ) 1.4 Å residue
CC
R-helix (whole residue) R-helix (side chain) β-barrel (whole residue) β-barrel (side chain)
0.659 (0.639 ( 0.020) 0.659 (0.615 ( 0.060) 0.652 (0.638 ( 0.011) 0.649 (0.638 ( 0.007)
a
MAE
probe radius ) 2.0 Å (Å2)
19.35 (19.89 ( 0.50) 18.88 (20.25 ( 1.30) 20.19 (20.78 ( 0.43) 20.25 (20.61 ( 0.24)
CC
MAE (Å2)
0.637 (0.616 ( 0.020) 0.686 (0.619 ( 0.040) 0.637 (0.623 ( 0.009) 0.638 (0.623 ( 0.007)
19.54 (20.00 ( 0.38) 17.98 (20.00 ( 1.15) 20.87 (21.43 ( 0.42) 21.00 (21.47 ( 0.25)
In parentheses are the means and standard deviations on twenty-four models.
Figure 2. Observed and predicted solvent accessible surface areas for transmembrane residues in helix proteins. (A) The mean absolute error is 16.0 Å2 calculated from the residues in the seven transmembrane domains of protein 1e12 (chain A). (B) The mean absolute error is 30.1 Å2 for protein 1jb0 (Chain L).
Amino acids have been classified as three groups. Hydrophobic amino acids were defined by Livington and Barton30 including L, V, I, F, M, W, C, A, G, T, Y, H, and K, in which the first seven amino acids were defined as very hydrophobic.31 Very hydrophobic amino acids form the first group while remaining hydrophobic amino acids form the second group. All other amino acids make up the third group. On the basis of the mean ASA values, very hydrophobic amino acids tend to have greater or comparable ASAs in transmembrane domains than in other parts of membrane proteins. Because the distributions are far from normal distributions, we perform Kolmogorov-Smirnov tests to show whether they have significant difference. V (P < 0.05), I (P < 0.01), W (P < 0.05) and C (P < 0.01) have significant distributions, while L, F, and M do not show significant difference in the two environments. In addition, very hydrophobic amino acids in membrane proteins tend to have greater ASA values than in soluble proteins. Except cystine (C), the distributions for other amino acids are significantly different (P < 10-5). Furthermore, all very hydrophobic residues have higher amino 1066
Journal of Proteome Research • Vol. 5, No. 5, 2006
acid compositions in the membranous parts than in extraneous parts of proteins. In contrast, residues located in extramembranous parts of membrane proteins and in soluble proteins have very similar amino acid distribution. Very hydrophobic amino acids frequently occurring in transmembrane domains are liable to interact with the lipid environment, and therefore tend to have greater or at least comparable solvent accessible surface areas. In membrane proteins, hydrophobic amino acids (group two) except Y show lower ASA values in transmembrane domains and their distributions are significantly different from those in extra-membranous parts (P < 10-4). The distributions of A, G, Y, and K are significantly different in the extramembranous parts of membrane proteins and in soluble proteins (P < 0.05). Amino acids T, Y, H, and K have lower compositions in membrane parts of membrane protein than in the extra-membranous parts or in soluble proteins. The residues that were not classified as hydrophobic amino acids (group three) have lower compositions and lower ASA values in membranous parts than in extra-membranous parts.
Lipid Accessibility Prediction
research articles
Figure 3. Prediction accuracies for transmembrane helix residues according to different cases (probe radius 1.4 or 2.0 Å, whole residue or side chain). Specificity (solid), sensitivity (slashed), exposed residue abundance (dotted), and overall accuracy (dot-and-slashed) are plotted according to various relative ASA cutoff thresholds.
The ASA distributions in two different locations are quite different (P < 10-6), but they tend to have greater ASA values in soluble proteins. Amino acids S, P, and N in extramembranous parts of membrane proteins and in soluble proteins have significantly different distributions (P < 0.05). We do not observe a difference for amino acids R, E, Q, and D in these two environments. It can be concluded that the majority of amino acids in membranous parts and extra-membranous parts of membrane proteins have different ASA distributions. Even the majority of amino acids in extra-membranous parts of membrane proteins and in soluble protein have different ASA distributions as well. These observations suggest that most current ASA prediction methods developed for soluble proteins may not be applicable here and therefore the transmembrane residues need separate treatment. Correlation between Predicted and Observed Lipid Accessible Surface Areas. Although the ASA calculations were based on whole molecules, we examined our method only on protein unique chains. Fifty-nine transmembrane helix protein chains and 14 transmembrane β-barrel chains were selected to examine the prediction accuracy by the leave-one-out jackknife test. We examined different combinations of window sizes and support vector machine control parameters. Window size was tested from 11, 13, 15, 17, 19, and 21 amino acids. The C values (eq 2) were examined from 1, 2, 5, and 7. The value of γ (eq 4) was set at 0.01. The correlation coefficient (CC) and mean absolute errors were calculated for each model. Two probe radiuses (1.4 and 2.0 Å) were used and the ASAs for the
whole residues and the side chains were predicted, accordingly. In each case, we performed 24 (6 × 4) rounds of training and testing procedures according to different window sizes and C values. It should be expected that accuracy variance occurs for different sets of control parameters. However, the best models are always used for practical use. Therefore, for each case, the best correlation coefficient is given and is also followed by the mean values (over 24 models) plus standard deviations shown in parentheses in Table 4. Note that the best correlation coefficient always coincides with the least mean absolute error. Figure 1 illustrates the relationship of correlation coefficients and mean absolute errors and this analysis is based on the predictions of whole residue ASAs of transmembrane helices with a probe radius of 1.4 Å and a window size of 17 amino acids. The correlation between CCs and mean absolute errors can reach -0.988 suggesting that they are highly correlated and that either of the measures can be used solely to reflect the prediction accuracy. Eight cases with best prediction accuracies are shown in Table 4. Transmembrane helix residues are more accurately predicted than β-barrel residues due to the larger dataset containing more information. After the leave-one-out test, the correlation coefficients between predicted and observed ASAs fall in the range between 0.637 and 0.686 and mean absolute errors between 17.98 and 20.87 Å2. The best predicted CC for transmembrane helix residues can reach 0.686 with mean absolute error 17.98 Å2 when the side chain ASA and probe radius 2.0 Å are used. In this case, the window is 21 amino acids wide and C value equals 7. When the window size is set Journal of Proteome Research • Vol. 5, No. 5, 2006 1067
research articles
Yuan et al.
Figure 4. Prediction accuracies for transmembrane β-barrel residues according to different cases (probe radius 1.4 or 2.0 Å, whole residue or side chain). Specificity (solid), sensitivity (slashed), exposed residue abundance (dotted) and overall accuracy (dot-andslashed) are plotted according to various relative ASA cutoff thresholds.
as 19 amino acids, the CC equals 0.677 and the mean absolute error 18.03 Å2. There are some accuracy variations between models of different control parameters and the accuracy also depends on the dataset. Figure 2 gives the observed and predicted ASA values for two transmembrane helix proteins. Predicted values are jackknife test results. Better prediction can be found for Protein 1e12, chain A (Figure 2A). The mean absolute errors for the transmembrane domains (from one to seven starting from protein N-terminal) are 17.0, 20.6, 13.1, 18.3, 17.7, 12.9, and 12.9 Å2 with an overall mean absolute error of 16.0 Å2. Protein 1jb0 (chain L) is predicted with the mean absolute errors as 39.8, 26.0, and 23.3 Å2 for the three transmembrane domains (Figure 2B). The overall mean absolute error is 30.1 Å2, which is much greater than the mean absolute errors 19.34 Å2 (Table 4) derived from all proteins. Generally, nearly all previous ASA prediction methods were developed on larger databases that lead to higher prediction accuracy. For example, residues in soluble proteins can be predicted with a correlation coefficient of about 0.7 in our previous study,20 which is better than what we currently achieve, however that study used about one thousand protein chains. The data sets we use here are very small, therefore further improvement may be obtained by including more protein structures into our training set when more structures are solved experimentally. Predicting the Lipid Exposed Residues in Transmembrane Segments. A transmembrane residue can be coarsely classified as exposed to the lipid environment or buried inside the 1068
Journal of Proteome Research • Vol. 5, No. 5, 2006
protein. For comparison of different types of amino acids at the same level, the ASA values were usually normalized by the value of their extended conformation of the tripeptide and the normalized value was termed as relative accessible surface area. Using different cutoff thresholds that classify exposed and buried residues based on rASA, we calculate the prediction sensitivity, specificity and overall accuracy. Sensitivity is defined as the percentage of correctly predicted exposed residues on the total observed exposed residues, while specificity is defined as the percentage of correctly predicted exposed residues on the total predicted exposed residues. The overall accuracy is the ratio between the number of correctly predicted exposed and buried residues and the total number of residues. To give a description of the extent that a threshold can split the data into exposed and buried residues, we use the exposed residue abundance, defined as the percentage of observed exposed residues in the total residues. Using the best models for different cases (probe radius 1.4 or 2.0 Å; whole residue or side chain), we present the calculated values of specificity, sensitivity, exposed residue abundance, and overall accuracy in Figure 3 and Figure 4. The results for transmembrane helix proteins are shown in Figure 3, which clearly indicates that the abundance of lipid exposed residues, sensitivity, and specificity are inversely related to rASA thresholds. The sensitivities drop rapidly while the threshold increases. The specificity, however, can maintain its accuracy (60∼70%) in a wide range of rASA thresholds (2∼40%). Using a rASA threshold that corresponds to 50%
research articles
Lipid Accessibility Prediction Table 5. Specificity, Sensitivity and Overall Accuracy for All Cases When the Exposed Residue Abundance Is Set As 50%
R-helix, whole residue, r)1.4 Å R-helix, side chain, r)1.4 Å R-helix, whole residue, r)2.0 Å R-helix, side chain, r)2.0 Å β-barrel, whole residue, r)1.4 Å β-barrel, side chain, r)1.4 Å β-barrel, whole residue, r)2.0 Å β-barrel, side chain, r)2.0 Å
specificity (%)
sensitivity (%)
overall accuracy (%)
67.0 67.5 62.8 66.2 72.8 73.5 71.3 72.5
87.9 88.0 91.3 92.7 81.4 77.4 85.3 82.6
72.2 72.4 68.7 72.6 75.3 74.6 75.6 75.5
exposed residue abundance, we present the results of the specificity, sensitivity and overall accuracy in Table 5. For transmembrane R-helix residues, the specificities, sensitivities and overall accuracies are roughly 65%, 90%, and 70%, respectively. We also give specificity, sensitivity, exposed residue abundance and overall accuracy for β-barrel membrane proteins in Figure 4. When the residues are classified as half exposed and half buried, the specificities, sensitivities and overall accuracies are roughly 72%, 80%, and 75%, respectively (Table 5). For the two transmembrane helix proteins shown in Figure 2, the prediction accuracies were obtained by using the whole residue ASA (probe radius 1.4) and a rASA threshold of 9% that can define 50% of total residues as exposed residues. Protein 1e12 (chain A) was predicted with specificity, sensitivity, and overall accuracy as 79.0%, 92.2%, and 82.2%, respectively. The specificity, sensitivity and overall accuracy for protein 1bj0 (chain L) are 72.2%, 86.7%, and 67.7%, respectively. Most previous methods performed two-class prediction on soluble proteins and achieved overall accuracies between 72% and 90%,32-37 which depend on the selection of rASA thresholds and ASA distribution in the databases. Thus, a strict comparison with this study is not valid. However, by looking at the top-left plots in Figures 3 and 4 we can find that the overall accuracies fall between 72 and 92% for helix proteins and 74-93% for β-barrel proteins, across a wide range of rASA thresholds (250%). These accuracies are comparable with previously reported results on soluble proteins.
Conclusions Although many membrane proteins have been identified and submitted to most major sequence databases such as UniProt,38 Very little is known about their three-dimensional structures, or even the structural assembly of transmembrane domains. The same type of amino acid may have different ASA distributions when it is located in different environments, i.e., water or lipid environment. Nearly all existing solvent accessible surface area prediction methods were developed mainly on soluble proteins and therefore may not be applicable to membrane proteins. On the basis of this, we provide an effective approach to predict lipid accessible surface areas of transmembrane residues. The predictions will provide valuable information for people to study transmembrane residuelipid or residue-residue interactions and further to study their functions. To expand the application, we will make the methods available for public use, together with the prediction methods designed for soluble proteins (http:// ccb.imb.uq.edu.au/ASAP/).
Acknowledgment. This work was supported by funds from the Australian Research Council and the Australian National Health and Medical Research Council. M.B. and Z.Y. is supported by a UQ Early Research Grant. R.D.T. is supported by a NHMRC R. Douglas Wright Career Development Award. This work was performed as part of the Renal Regeneration Consortium, and was supported by National Institutes of Health (DK63400) as part of the Stem Cell Genome Anatomy Project. We thank Melvena Teasdale for critical reading of the manuscript and Thijs Beuming and Harel Weinstein for kindly providing the annotations of transmembrane helix proteins. Supporting Information Available: All protein chain transmembrane annotations are given in the Supporting Information files 1 and 2. All soluble protein and chain names are listed in the Supporting Information file 3. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) Wallin, E.; von Heijne, G. Protein Sci. 1998, 7, 1029-1038. (2) Kanapin, A.; Batalov, S.; Davis, M. J.; Gough, J.; Grimmond, S.; Kawaji, H.; Magrane, M.; Matsuda, H.; Schonbach, C.; Teasdale, R. D.; Yuan, Z. Genome Res. 2003, 13, 1335-1344. (3) Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov, I. N.; Bourne, P. E. Nucleic Acids Res. 2000, 28, 235-242. (4) Tusnady, G. E.; Dosztanyi, Z.; Simon, I. Bioinformatics 2004, 20, 2964-2972. (5) White, S. H. Protein Sci. 2004, 13, 1948-1949. (6) Cserzo, M.; Wallin, E.; Simon, I.; vonHeijne, G.; Elofsson, A. Protein Eng. 1997, 10, 673-676. (7) Tusnady, G. E.; Simon, I. Bioinformatics 2001, 17, 849-850. (8) Krogh, A.; Larsson, B.; vonHeijne, G.; Sonnhammer, E. L. L. J. Mol. Biol. 2001, 305, 567-580. (9) Yuan, Z.; Mattick, J. S.; Teasdale, R. D. J. Comput. Chem. 2004, 25, 632-636. (10) Rost, B.; Yachdav, G.; Liu, J. Nucleic Acids Res. 2004, 32, W321326. (11) Granseth, E.; von Heijne, G.; Elofsson, A. J. Mol. Biol. 2005, 346, 377-385. (12) Adamian, L.; Nanda, V.; DeGrado, W. F.; Liang, J. Proteins 2005, 59, 496-509. (13) Beuming, T.; Weinstein, H. Bioinformatics 2004, 20, 1822-1835. (14) Eyre, T. A.; Partridge, L.; Thornton, J. M. Protein Eng. Des. Sel. 2004, 17, 613-624. (15) Ulmschneider, M. B.; Sansom, M. S.; Di Nola, A. Proteins 2005, 59, 252-265. (16) Ahmad, S.; Gromiha, M. M.; Sarai, A. Proteins 2003, 50, 629635. (17) Yuan, Z.; Huang, B. Proteins 2004, 57, 558-564. (18) Wang, J.-Y.; Lee, H.-M.; Ahmad, S. Proteins 2005, 61, 481-491. (19) Garg, A.; Kaur, H.; Raghava, G. Proteins 2005, 61, 318-324. (20) Yuan, Z.; Bailey, T. L. Proceedings of the 26th Annual International Conference of the IEEE EMBS 2004, 2889-2892. (21) Sanner, M. F.; Olson, A. J.; Spehner, J. C. Biopolymers 1996, 38, 305-320. (22) Samanta, U.; Bahadur, R. P.; Chakrabarti, P. Protein Eng. 2002, 15, 659-667. (23) Altschul, S. F.; Madden, T. L.; Schaffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D. J. Nucleic Acids Res. 1997, 25, 33893402. (24) Jones, D. T. J. Mol. Biol. 1999, 292, 195-202. (25) Yuan, Z.; Bailey, T. L.; Teasdale, R. D. Proteins 2005, 58, 905912. (26) Vapnik, V. The Nature of Statistical Learning Theory: Springer: New York, 2000. (27) Smola, A.; Scholkopf, B. In NeuroCOLT Technical Report Series, NC-TR-1998-030, http://www.neurocolt.com, 1998. (28) Bagos, P. G.; Liakopoulos, T. D.; Spyropoulos, I. C.; Hamodrakas, S. J. BMC Bioinformatics 2004, 5, 29. (29) Noguchi, T.; Akiyama, Y. Nucleic Acids Res. 2003, 31, 492-493. (30) Livingstone, C. D.; Barton, G. J. Comput. Appl. Biosci. 1993, 9, 745-756. (31) Betts, M. J.; Russell, R. B. In Bioinformatics for Geneticists; Barnes, M. R., Gray, I. C., Eds.; Wiley: New York, 2003.
Journal of Proteome Research • Vol. 5, No. 5, 2006 1069
research articles (32) Rost, B.; Sander, C. Proteins 1994, 20, 216-226. (33) Cuff, J. A.; Barton, G. J. Proteins 1999, 34, 508-519. (34) Gianese, G.; Bossa, F.; Pascarella, S. Protein Eng. 2003, 16, 987992. (35) Qin, S.; He, Y.; Pan, X. M. Proteins 2005, 61, 473-480. (36) Sim, J.; Kim, S. Y.; Lee, J. Bioinformatics 2005, 21, 2844-2849. (37) Yuan, Z.; Burrage, K.; Mattick, J. S. Proteins 2002, 48, 566-570.
1070
Journal of Proteome Research • Vol. 5, No. 5, 2006
Yuan et al. (38) Bairoch, A.; Apweiler, R.; Wu, C. H.; Barker, W. C.; Boeckmann, B.; Ferro, S.; Gasteiger, E.; Huang, H.; Lopez, R.; Magrane, M.; Martin, M. J.; Natale, D. A.; O’Donovan, C.; Redaschi, N.; Yeh, L. S. Nucleic Acids Res. 2005, 33, D154-159.
PR050397B