Solvent Accessible Surface Area of Amino Acid Residues in Globular

Mar 31, 2009 - Alexey K. Shaytan,*,† Konstantin V. Shaitan,‡ and Alexei R. Khokhlov†. Physics Department and Biology Department, Moscow State Un...
0 downloads 0 Views 2MB Size
1224

Biomacromolecules 2009, 10, 1224–1237

Solvent Accessible Surface Area of Amino Acid Residues in Globular Proteins: Correlation of Apparent Transfer Free Energies with Experimental Hydrophobicity Scales Alexey K. Shaytan,*,† Konstantin V. Shaitan,‡ and Alexei R. Khokhlov† Physics Department and Biology Department, Moscow State University, Moscow 119991, Russia Received December 31, 2008; Revised Manuscript Received February 25, 2009

It is known that the distribution of amino acid residues in globular proteins between surface and interior is in certain correlation with various experimental scales based on partitioning of amino acids or their analogs between water and organic solvents. These scales are often used in various quantitative structure-activity relationship (QSAR) studies as well as for evaluation of stability of proteins. In this work we have analyzed the distribution of residues based on their solvent accessible surface area in more than 8000 protein structures. Using extensive statistical sampling, we have computed residue apparent free energies of transfer between protein interior and surface applying various criteria for classifying residues as exposed or buried. The correlation of these statistical energies with several experimental hydrophobicity scales is discussed. We propose three types of statistical apparent transfer free energy scales and show that each of these scales is in better correlation with one of the experimental hydrophobicity scales (water/vapor, water/cyclohexane, and water/octanol transfer scales). The data are interpreted through the application of theoretical considerations by Finkelstein et al. (Protein Struct. Funct. Genet. 1995, 23, 142) based on random energy model of heteropolymer globules. The deviation of apparent transfer free energies from experimental scales is discussed and analyzed. The variations of amino acid distribution in proteins with the size of protein structure is discussed and the final protein set is chosen to minimize these variations.

Introduction Water environment plays a key role in folding and stability of almost all protein structures. Hydrophobic interaction of apolar amino acids is considered to be the main driving force of protein folding.1 Understanding the principles and energetics of protein structure formation with respect to the distribution of different types of amino acids in the core and at the surface of the protein is beneficial for de novo protein sequence design,2,3 design of protein inspired,4 and smart polymers.5 It is known from previous studies that in globular proteins6-16 amino acid residues with polar and charged side chains tend on average to be more exposed to solvent than those with apolar side chains and the distribution of residues relative to the surface of the protein globule is in approximate correlation with distribution of amino acids (or derived compounds) between water and a less polar solvent. This distribution resembles Boltzmann-like statistics as if amino acids were not fixed in the protein structure but were in a dynamic equilibrium between the surface phase and the protein interior. The similarity of statistics in globular proteins and Boltzmann statistics of separately taken elements of proteins was first noted by Pohl.17 Theoretical groundings for this fact based on random energy model of heteropolymer globule18 were laid by Finkelstein et al.19 According to their considerations, the occurrence of structural elements in stable folds of random heteropolymer chains should be governed by the equations * To whom correspondence should be addressed. E-mail: alexeyshaytan@ gmail.com. † Physics Department. ‡ Biology Department.

1 OCCURENCE ∝ exp(-∆Fselect /RTc)

(1)

1 ) (∆E - ∆σ /2RTc) - RTc ln M1 /M ∆Fselect

(2)

where F1select is the “selective (apparent) free energy” of some element (referred as element “1”), OCCURENCE is the ratio of number of observed native folds with the element to the number of native folds without it, Tc is the conformational temperature typical for freezing of random heteropolymers, ∆E is the average (over possible compact folds of random sequences) energy difference between the fold with and without the structural element, ∆σ is the change in energy dispersion of all compact folds of random heteropolymer with and without the element, and M1/M represents the fraction of folds that include the element. This Boltzmann-like dependence of occurrence on mean energy of the element is applicable to an ensemble of at least globular protein structures and is formed due to the fact that the number of different primary amino acid sequences (which for globular proteins resemble random heteropolymer sequences) that stabilizes protein with this structural element is exponentially dependent on the selective free energy of this element. The energies that can be derived using eq 1 from analysis of occurrence of amino acids in the interior and exterior of proteins are usually also called “apparent transfer free energies”. The conformational temperature for proteins was shown to be near 300 K by Finkelstein et al.20 However, Thomas and Dill21 argued that with respect to interior-exterior residue partitioning the proteins in the Protein Data Bank (PDB) may not be wellmodeled by the random heteropolymer assumption of Finkelstein et al. since the effective temperature Tc depends on the length, composition, and compactness of the proteins in the database, while the random heteropolymer model results are independent

10.1021/bm8015169 CCC: $40.75  2009 American Chemical Society Published on Web 03/31/2009

Surface Area of Amino Acid Residues in Proteins

of protein length and composition. Thomas and Dill showed that, for protein sets with different “propensities”, the conformational temperatures of residue distribution vary significantly, but the Boltzmann-like distribution still holds. Thus, smaller proteins with many nonpolar residues may appear “hotter” than bigger ones. To partially tackle this problem in our study we explicitly analyzed the dependence of residue distribution on protein size and tried to achieve size-independent statistics. Correlation of apparent transfer free energies (and other statistical parameters, such as, for example, type-dependent hydrophobicities derived in the recent work of Gomes et al.)22 with experimental transfer free energies between water and less polar solvent have been reported many times.8-16 Chothia first performed6 the quantitative study of accessible surface of amino acids in globular proteins and characterized their preference to be in the core or at the surface, while Nozaki and Tanford earlier23 established one of the first hydrophobicity scales of amino acids based on their partitioning between water, ethanol, and dioxane. Wolfenden et al.8 was one of the first to study numerically the correlation between various experimental amino acid transfer scales and statistical data. This opened a long lasting debate concerning the best experimental polarity scale representing the observed statistical distributions. While many experimental scales were suggested (e.g., water/ethanol,23 water/ vapor,8 water/cyclohexane,13 water/octanol24,25 transfer, water/ air interface adsorption,26 water/hexane transfer and adsorption at interface)27 and many more semiexperimental, only several were extensively tested and used; these are water/vapor, water/ cyclohexane, and water/octanol scales, with the latter maybe being one of the mostly applied scales for various computational studies of protein structures and quantitative structure-activity relations in medicinal chemistry. This tendency was partially started by Eisenberg and McLachlan28 who calculated their atomic solvation parameters based on water-to-octanol transfer free energies of amino acids. Preference for water/octanol transfer scale also stems from considerations that octanol molecules, being partially polar partially hydrophobic, better resemble the heterogenious environment of the protein interior than pure hydrophobic or vapor phases. However, such properties of octanol pose serious problems in interpretation as was discussed by Wolfenden elsewhere16 (e.g., “water-dragging”, specific forces of attraction between indole derivatives and alcohols), which made him advocate the use of an easier understandable water/cyclohexane scale.16 Except for the choice of the experimental scale the method of assessing the statistical distribution of amino acid residues in proteins is important for analyzing such type of correlations. Guy9 reported that some statistical studies of residue distribution in proteins had better correlation with water/vapor transfer free energies while others with water/organic solvent and concluded that this was probably related to different methods used to classify residues as exposed or buried in statistical studies. In this study we have made our goal to clarify the abovementioned problems in correlation of experimental transfer free energies with statistical distribution of residues in globular proteins. We attempted to examine various criteria for computing apparent transfer free energies and to continuously trace the change in amino acid constitution at various levels of their solvent accessibility. To perform such analysis with sufficient statistical reliability, one requires a rather extensive protein set for examination. In previous studies of residue average accessibility in globular proteins, the number of analyzed protein chains did not exceed 500 structures; for our analysis, this number had to be essentially increased, which persuaded us into

Biomacromolecules, Vol. 10, No. 5, 2009

1225

elaborating a semiautomated procedure for protein set construction rather than doing it manually. Another problem addressed in the paper is the dependence of estimated parameters on the size of the proteins and the choice of the representative protein size window for our analysis. The problem was at an initial level addressed by Miller et al.11 who concluded that average amino acid compositions (except for cysteine) of protein surface and interior were essentially the same for protein groups with different molecular weights. Our data advocate for a more accurate treatment of this issue and shows that concentrations may vary considerably with the protein average size. The article is further organized as follows: Methods describes the protein set and its preparation, the calculation of solvent accessible surface area and apparent transfer free energies. In Results and Discussion section the dependence of protein composition on the size of structures is analyzed, residue distributions are presented and correlation of statistical data with experimental hydrophobicity scales is discussed.

Methods Choice of Protein Set. To perform the study the protein set had to be chosen that conforms to three requirements: (1) the protein set should be nonredundant, that is, no close homologues should be present in the set, because this would imply an “evolutionary” bias on the results and invalidate the assumptions of random energy model theory; (2) the protein set should contain water-soluble globular proteins that maintain their structure in water solution without any additional interactions with stabilizing compounds; (3) high resolution full protein structures should be available in the PDB database. The choice of such protein set in an automatic manner is not an easy task, so we admit that in the final set some protein structures that do not strictly conform to these requirements may be present, however, their effect on the overall statistics should be negligible. The final protein set was formed as follows. The Non-redundant PDB chain set list (from September 5, 2008) maintained by NCBI was obtained,29 which contains a list of protein chains grouped by their primary sequence similarity accessed by BLAST algorithm.30 From each homology group with p-value of 10- 7 one protein chain entry was taken that was marked as representative (with the high resolution and other preselected criteria). The corresponding full protein structures were obtained from the WWPDB database.31 At this step our set contained 11748 structures. Next a series of selective and modification procedures were applied to a set. A total of 197 structures were discarded because they had an incomplete description of atomic structure of one or more residues. PDB entries were filtered for the keywords “MEMBRANE” (819 structures were discarded), “TOXIN” (269 structures discarded), and “FIBER” and “FIBROUS” (22 structures discarded). A total of 699 structures were discarded because they contained DNA or RNA. The data obtained from the PDB (when resolved by X-ray) in the majority of cases is an asymmetric unit of crystal. To obtain a real biological molecule (as it is expected to be in solution for globular proteins), the macromolecule that has been shown to be, or is believed to be, functional from this asymmetric unit data one has to apply BIOMT transformations. To 9742 remaining structures, these BIOMT transformations were applied to generate full biological molecules from crystallographic asymmetric unit data. The neglect of BIOMT transformations (assuming that we generated our protein set in an automatic manner) may give us monomeric units with hydrophobic surfaces exposed to the environment, that are in reality buried inside the monomer-monomer contacts, and thus bias our statistics. 277 structures were discarded for unusual BIOMT transformations. For structures containing several models (mainly NMR structures), only the first model was left for analysis. The next step was to exclude structures that did not represent compact protein structures, but rather huge assemblies or complexes (e.g., virus capsid, etc.) since these would be inappropriate for our study. For this

1226

Biomacromolecules, Vol. 10, No. 5, 2009

Shaytan et al. radius of 1.4 Å was used, which corresponds to the size of a water molecule. Individual atom accessibilities were calculated and then summed to obtain the solvent accessible surface area (SASA) of each amino acid residue and its side chain in every protein structure. R-Carbon atoms were considered to be part of the side chain. Relative solvent accessibilities for residues and side chains were calculated with respect to their accessibilities in an extended Ala-X-Ala tripeptide taken from ref 34. Apparent Transfer Free Energies Calculation. In experiment, transfer free energy of a molecule between two immiscible phases (1 and 2) is calculated as

Figure 1. Each data point represents a single protein structure according to its gyration radius and number of residues in the logarithmic scale. The dashed line is the approximation of the lower border of the “cloud”. Protein structures corresponding to points outside the region outlined by solid lines were excluded from the further investigation (see comments in text).

Ftransfer ) -kT ln

c1 c2

(4)

Analogously apparent transfer free energies of amino acid residues between two apparent phases can be defined. The apparent phase is defined as a set of residues that conform to some criteria, (e.g., their SASA is 0). Given a criteria to attribute residues to an apparent phase one can introduce the concentration of each type of residue by the equation

ck )

Nk



(5)

N i i

where Ni is the number of residues of type i in the phase. Then apparent transfer free energies can be calculated using the same relation (4) using eq 5 to calculate the concentrations. In this study, criteria for the definition of apparent phases were based on relative solvent accessible surface of residue side chains. Figure 2. Distribution of studied protein structures by the number of residues. The distribution was smoothed using adjacent averaging with a window of 10 residues.

purpose, the structures were filtered according to their radius of gyration to number of residues ratio. In Figure 1 every protein structure is presented by a single data point with coordinates being the radius of gyration and the number of residues for each structure. This “cloud” of data points in the log-log plot has a sharp linear lower border that obviously corresponds to the maximum packing density given the fixed number of residues; this lower border is very well approximated by the following function

Rg ) 0.2 × N0.38

(3)

where N is the number of residues in protein and Rg is expressed in nanometers. The power dependence with exponent of 0.38 is rather close to that found in the globular state of homopolymers with the exponent of one-third. Protein structures whose data points in Figure 1 lie close to the line defined by eq 3 may be with high probability considered to be in globular state. To exclude the structures that are far away from this dependency we have excluded 1435 structures whose gyration radius lies outside of the region 0.18 × N0.38 < Rg < 0.3 × N0.38. The final protein set for analysis contained 8022 structures. Figure 2 presents the distribution of the proteins in the set by the number of residues in the structure. The list of PDB codes of the final protein set is available as Supporting Information to this article. Solvent Accessible Surface Area Calculations. The solvent accessible surface area was computed using the NACCESS program,32 which implements the algorithm of Lee and Richards,33 whereby a probe of given radius is rolled around the surface of the molecule, and the path traced out by its center is the accessible surface. The probe

Results and Discussion Size Independent Protein Statistics. Calculating average properties of amino acid residues in an ensemble of protein structures may seem ambiguous because conformational statistics of each protein or a group of proteins may depend on many factors, making these properties different in various subsets of our protein set as noted by Thomas and Dill.21 However, our aim here is to trace common regularities in exposure and burying of various types of residues by water-soluble proteins; that is why as long as we strive to study the general influence of water environment on protein conformational statistics our set is almost appropriate except for one problem: conformational statistics of proteins and the amount of influence of solvent-protein interactions on it may still be dependent on the size of the protein. Indeed, while large protein globules may have relatively big hydrophobic cores that contribute significantly to the stability of the structure, small globules will have to expose the majority of their residues to the solvent and thus have to seek for other types of interactions that will contribute to its stability. This may be clearly illustrated if we look at the occurrence of cysteine residues in proteins of various size (see Figure 3). The dramatic increase of cysteine occurrence in proteins with number of residues less than 100 outlines the importance of disulfide bonds for stability of smaller proteins. Analogous phenomena may be present for other residues. Thus, gathering statistics on exposure of residues throughout the whole set of protein structures will yield doubtful results and thus we have to confine our study to some range of protein sizes. To clarify these issues we have studied the dependence of occurrence of various amino acids in proteins on the number of residues in the structure as well as

Surface Area of Amino Acid Residues in Proteins

Biomacromolecules, Vol. 10, No. 5, 2009

1227

Figure 3. Total occurrence as well as occurrence in interior and at the surface (see text) are plotted. Horizontal lines in each case represent the occurrence averaged over all protein structures with number of residues more than 300. Each data point represents the range 40 residues wide. Standard deviations are depicted with vertical lines.

the composition of protein interior and surface. Figure 3 presents the selected plots for analysis. The residues were considered to be in the interior if their relative SASA was less than 5% and at the surface phase in the opposite case. Quick analysis reveals that the overall occurrence as well as surface and interior concentrations may significantly vary with protein size up to some threshold size, after which they become more or less constant in the range of statistical error. This threshold size for various amino acids varies between 100 and 300 residues. Selected examples of such dependence are presented in Figure 3. The analysis of such dependencies for all the residues allows to roughly summarize their behavior based on the total occurrence of residues in the following way: (1) small proteins include less hydrophobic residues (ALA, ILE, LEU, VAL, PHE), this is quite natural because very small globules (less than 100 residues) have problems in isolating hydrophobic side chains from water to avoid positive contribution to the overall free energy of the system. (2) Small proteins also include less negatively charged residues (ASP, GLU) and a bit more positively charged residues (ARG, LYS). Because, in principle, from the physical point of view positive and negative charges are equivalent, this difference can be regarded as some evolutionary bias related to the function and interaction of such small proteins. (3) Small proteins have a lot of cysteine (CYS) apparently to stabilize their structure by disulfide bonding when hydrophobic interactions still make a weak contribution. (4) The abundance of glycine (GLY) may be attributed to the necessity of enhanced flexibility in small proteins. The corresponding dip

at the size of 100 residues in the plot of interior occurrence advocates the proposition that glycine is primarily involved in turns at this size. (5) Serine has a special type of dependency: it has a pronounced occurrence maximum at around 100 residues, with side chains primarily in exposed state. (6) Other residues (HIS, ASN, TRP, MET, PRO, THR, TYR) do have constant or slightly varying (in the region of smaller proteins) dependencies. The above-mentioned analysis dealt only with the overall occurrence of residues, leaving aside the changes in their concentration in the interior and at the surface, which may be significant even if the total occurrence remains independent of the number of residues. However, it can be shown that such characteristics also become independent of the protein size in the region where number of residues is greater than 300. The main outcome from this analysis is that further we will confine our statistical analysis only to proteins that have more than 300 residues to avoid the “finite size” effects. It is known that large proteins usually have a domain structure, that is, they consist of several subglobules, which usually have no more than 250-300 residues.35 Thus, the value of 300 residues can also be regarded as the threshold above which we mainly deal with multidomain proteins and subglobules of limiting size. Figure 4 presents the dependence of total proteins SASA on the number of residues in proteins. This question was discussed earlier by Miller et al.11 who showed that accessible surface area is a simple power function of the molecular weight for monomeric proteins

1228

Biomacromolecules, Vol. 10, No. 5, 2009

AS ∝ M k

Shaytan et al.

(6)

When a least-squares fit for a set of 46 proteins is used, they have obtained k ) 0.73 that is close enough to the case of solid bodies of the similar shape, where surface area is obviously proportional to the two-thirds power of the volume. We have extended this analysis up to more than eight thousand proteins and consequently to the proteins of much higher molecular mass. As seen from the log-log plot in Figure 4, the “cloud” of the data points is almost linear, however, the slope of the “cloud” is somewhat different for smaller and larger proteins. The slight change in slope happens at the number of residues (N) of about 200. The linear fit in the log-log scale for the regions N < 200 and N > 200 yields slopes of 0.74 and 0.86. The obtained linear fits in the log-log scale correspond to the dependencies of the form

AS ) b × Nk

(7)

where N is the number of residues and b and k are fitting coefficients, with k being the slope of the linear fit in the log-log plot. Assuming that molecular mass is on the average linearly related to the number of residues, for the region N < 200 we obtain a good corroboration of ref 11 with our value of k ) 0.74. For the region N > 200 the dependency is more different from two-thirds and is more close to unity. This reveals that for larger proteins the “regime” of surface conformational statistics changes, their surface becomes more developed, and the explanation for this is the domain structure of larger proteins. Distribution of Amino Acid Residues by their Exposure. Now we can proceed to the main goal of this work: the detailed study of exposure of various residue types and their side chains in our protein set. For each type of residue, we have to somehow assess its preference to be exposed to this or that extent and visualize and analyze its tendency to have different levels of exposure. We could build simple distribution histograms of residue accessibility (see Figure 5). These histograms show the preference of selected residues to have this or that level of relative exposure. However, such histograms are difficult for interpretation and comparison. Dependencies between physical properties of residue side chains and the shape of the histogram are not obvious because such histograms for all residues tend to have similar decaying shape, as the number of residues with high exposure decays in an exponential-like manner for any residue type. A good example is the histogram for asparagine (see Figure 5); despite the fact that it is strongly hydrophilic, its histogram decays fast at high levels of exposure, thus making it hard to do any conclusions on its hydrophilicity by visual examination of the plot. A more informative plot is the dependency of the fraction of specific residue type among all residues with given relative solvent accessibility on the relative accessibility. The fraction can be seen as the concentration of residues in the apparent phase consisting of all residues with given relative accessibility. For each value of relative SASA (x%), residues having their accessibility in the range x ( 1% were considered to belong to one “apparent phase”, and concentration of each type was calculated using eq 5. The resulting plots are presented in Figure . These plots outline the relative behavior of residues and thus are easier for interpretation. Each plot has an additional data point at zero relative accessibility representing the concentration among all residues that are completely buried inside the protein.

Figure 4. Each data point represents a single protein structure according to its total SASA and number of residues in the logarithmic scale. The solid and dashed lines are the linear approximations in the log-log scale for the regions N < 200 and N > 200 with slopes 0.74 and 0.86, respectively (linear regression coefficients R are 0.96 and 0.98, respectively). They correspond to resulting dependencies AS ) 2.20 × N 0.74 and AS ) 1.07 × N 0.86.

Figure 5. Example of simple distribution histogram of asparagine residue occurrence in globular proteins with respect to its relative accessible surface area.

A lot of plots thus have a pronounced discontinuity at zero accessibility meaning that the structural composition of the shielded core of the protein taken as a whole apparent phase differs from the structural composition of the phase that include residues that are almost completely shielded, but still are positioned near the surface (relative accessibility is less than 1%, but more than 0%). It should be noted that, while the values of the plot at each value of accessibility in Figure represent the concentration of residues in an infinitely small hypothetical phase, which includes residues with accessibility in an infinitely small range around the given value, the values of the data points at zero accessibility are statistics over a phase of finite size, that is, the phase of all residues inaccessible to solvent. These discontinuity gaps mean that the shielded core of the protein globule should have its own fine internal structure in terms of burying residues even deeper from the surface. Such internal structure is overseen by classifying residues only by their relative accessibility because this parameter can not be used as a continuous measure of residue preference to be in the very core of the structure when the residue is not exposed to the solvent at all. The direction of the change at this gap is well correlated with the size of the residue side chain: small residues (GLY, ALA, CYS) have increased concentration among

Surface Area of Amino Acid Residues in Proteins

Biomacromolecules, Vol. 10, No. 5, 2009

1229

1230

Biomacromolecules, Vol. 10, No. 5, 2009

Shaytan et al.

Figure 6. Dependencies of residue concentration in the “apparent phase” on the relative accessibility. The residue concentration is determined as the fraction of residue type among all residues with the given relative accessibility to the solvent. The line was generated by approximating the dependency using a natural smoothing spline. The data point at zero relative accessibility represents the data for the residues that are completely shielded form the solvent.

100% inaccessible residues, while residues with larger side chains (PHE, TRP, TYR, GLN, LYS, ASN, HIS, GLU, ASP, ARG) do the opposite. This fact partially points out that merely steric effects can account for the fact that small residues can be easier accommodated in the core and in increased quantity than bigger ones. Let us now perform the qualitative analysis of the dependencies presented in Figure according to their similarity. Residues with large hydrophobic side chains (ILE, LEU, VAL, PHE, MET, CYS) have monotonous exponential-like decaying concentration values with increasing exposure. This is quite natural because their exposure is energetically unfavorable. The alanine residue (ALA) has a bit different dependency; it also has an exponential-like decaying part, but only up to the accessibility of 50%, which is then followed by a region of increased concentration. The reason for this difference between the first group and alanine may be seen in the fact that alanine has a rather small side chain (only a methyl group) and it is easier for it (in terms of sterically allowed conformations and energetic favorability) to adopt more distorted conformations with higher levels of exposure. Another peculiarity of alanine is the relatively big discontinuity gap at 0 accessibility (16% occurrence in the core of proteins vs 10% among residues with 1% of accessibility); this may also be attributed to the fact that alanine is a rather small amino acid and consequently it is sterically easier for it to be completely shielded from solvent than for bigger residues. The plot for alanine can be well compared with that for glycine, the smallest of all residues (side chain is just hydrogen). For glycine we see an even bigger discontinuity gap (12 vs 5%). Glycine has extreme conformational flexibility due to absence of side chain and hence is mainly found in unstructured regions of proteins (the inclusion of glycine in secondary structure is

possible but is entropically unfavorable) this leads to an abundance of glycine among the residues with high accessibility even at values more than 100% percent. The concentration of glycine in the region between 0 and 60% remains stable compared to alanine. So it may be thought that the exponentiallike decay in this region on alanine plot is connected with the hydrophobic unfavorability of methyl group exposure. The proline residue (PRO) also has a hydrophobic side chain, however, it is the only residue whose side chain is connected twice with the backbone, thus restricting its conformational flexibility. Due to these restrictions and inability to form second hydrogen bond proline can not form proper R-helices or β-sheets and is mainly found in turns of the protein structure and thus has higher exposure. This fact dramatically changes the concentration versus accessibility plot; there is no exponentiallike decay (although the side chain is as hydrophobic as, e.g., that of valine), but rather constant increase of concentration among residues with higher accessibility. The divergence of plot based on whole residue and side chain accessibilities is again due to the fact that proline is conformationally hindered and can not adopt distorted conformations with high exposure of side chain atoms. Let us now consider residues with hydrophilic and charged side chains. It is natural to expect that those will have elevated concentrations at high exposure values, however, the affinity of side chain to water seems to be not the only principle governing there exposure rates; again, an important issue is the preference of residues to this or that position in protein structure and the composition of the side chain. Residues like ASN, ASP, and SER present almost monotone increase dependencies of concentration on accessibility. All these residues have relatively small hydrophilic side chains that can form hydrogen bonds. Due to the competition with hydrogen bonding of backbone

Surface Area of Amino Acid Residues in Proteins Table 1. List of 19 Amino Acids (Except Proline), their Abbreviations, and R-H Analogs abbr.

one-letter code

name

R-H analoga

GLY LEU ILE VAL ALA PHE CYS MET THR SER TRP TYR GLN LYS ASN GLU HIS ASP ARG

G L I V A F C M T S W Y Q K N E H D R

glycine leucine isoleucine valine alanine phenylalanine cysteine methionine threonine serine tryptophan tyrosine glutamine lysine asparagine glutamate histidine aspartate arginine

hydrogen isobutane n-butane propane methane toluene methanethiol methylethylsulfide ethanol methanol 3-methylindole 4-methylphenol propionamide n-butylamine acetamide propionic acid 4-methylimidazole acetic acid N-propylguanidine

a Name of R-H analogs of amino acid residue, where R- is the side chain of amino acid.

atoms, these residues prefer to avoid R-helices and β-sheet and reside mainly in loops or near the edges of helices; this in turn gives rise to a high concentration of these residues among those with high and very high accessibility. In contrast, other residues with hydrophilic side chains, namely, ARG, GLN, GLU, LYS, and HIS, have a hump-shaped dependency, although the hydration energies (See Table 2) of some of these residues may much exceed those of the previous group. The reason for these hump-shaped plots can be seen in the fact that side chains of these residues also include a considerable portion of nonpolar atoms, and their side chains are amphipathic by nature; therefore, these residues have to find balance between exposure of their polar groups and burying of the hydrophobic groups, thus, they are unlikely to have very high exposures. This also depends on the proportion between polar and nonpolar groups; for example, LYS and ARG have much more pronounced hump-shaped dependency than GLU and GLN, which have twice as less carbons in their side chain. The remaining residues are TRP, TYR, and THR. Side chains of THR and TYR are mainly hydrophobic with inclusion of one polar atom; these residues are often found in β-sheets where they can easily accommodate their bulk side chains. Concentration versus accessibilities plots for these residues are a combination of a hump-shaped plot in the region of low accessibilities with an exponential-like decay at higher ones and bears the characteristics of both the plot of hydrophilic and hydrophobic residues. This is quite understandable: these residues try to expose a small portion of their surface that is represented by polar atoms while keeping the hydrophobic surface shielded, and further exposure is energetically unfavorable due to hydrophobic hydration of the side chain. The dependency in Figure for threonine (THR) is a bit confusing because it does not clearly match any of the previously discussed groups, however, it can be regarded as the transformation of the dependency for serine by the influence of one more carbon atom in the side chain. The dependency remains almost constant up to accessibility of 60% and then goes down a bit due to the unfavorability of carbon atom exposure. Now we have some understanding at least at qualitative level how the residues are distributed in protein structure according to their accessibility, what are the preferences for them and what are the qualitative factors that influence their behavior. An

Biomacromolecules, Vol. 10, No. 5, 2009

1231

Table 2. Experimental Transfer Free Energies of Amino Acid Side Chains or Analogous Molecules from Water to a Less Polar Phase at pH 7 (expressed in kcal/mol) amino acid typea GLY LEU ILE VAL ALA PHE CYS MET THR SER TRP TYR GLN LYS ASN GLU HIS ASP ARG

N N N N N A P N P P A A P + P P+h +

V >b

CH > Wc

O > Wd

2.39 2.28 2.15 1.99 1.94 -0.76 -1.24 -1.48 -4.88 -5.06 -5.88 -6.11 -9.38 -9.52 -9.68 -10.24 -10.27 -10.95 -19.92

0.94 4.92 4.92 4.04 1.81 2.98 1.28 2.35 -2.57 -3.40 2.33 -0.14 -5.54 -5.55 -6.64 -6.81 -4.66 -8.72 -14.92

0 2.30 2.46 1.66 0.42 2.44 1.39 (2.10) 1.68 0.35 -0.05 3.07 1.31 -0.30 -1.35 -0.79 -2.35 (-0.87) 0.18 -2.46 (-1.05) -1.37

O > Wocce CH > Of 0g 2.40 2.27 1.61 0.65 2.86 1.17 1.82 0.90 0.69 3.24 1.86 0.38 -1.65 0.30 -2.48 -1.18 (1.04) -2.49 -0.66

0 1.68 1.52 1.44 0.45 -0.40 -1.05 -0.27 -3.86 -4.29 -1.68 -2.39 -6.18 -5.14 -6.79 -5.40 -5.78 -7.20 -14.49

a Type of residue side chain according to Lehninger.36 N, nonpolar, aliphatic; P, polar, uncharged; A, aromatic; “+”, positively charged; “-”, negatively charged. b Vapor/water transfer free energies of R-H compounds taken from Wolfenden et al.8 c Cyclohexane/water transfer free energies taken from Wolfenden.13 d Octanol/water transfer free energies for amino acid side chains relative to glycine measured in acetyl amino acid amides (Ac-X-amides). A scale of Fauchere and Pliska24 modified by Wimley et al.25 Original values of Fauchere et al. are given in parentheses. e A semiexperimental scale of octanol-to-water transfer free energies proposed by Wimley et al.25 Transfer free energies of side chains measured in a series of host-guest pentapeptides (AcWL-X-LL) and then corrected for the X-dependent changes in the nonpolar surface of host peptide. The authors present this scale as the best estimate of solvation of residues occluded by neighboring residues of moderate size. f Cyclohexane to octanol transfer scale derived from cyclohexane/water and octanol/ water scales. g The reference point is “virtual glycine” for more information see ref by Wimley et al.25 h The pK of histidine is very close to 7, so it can be both in ionized and nonionized forms. Vapor/water and cyclohexane/ water transfer energies include the correction for ionization (see Radzicka and Wolfenden).13 The scale of Fauchere and Pliska uses data for nonionized form, while the data from Wimley at al. (O > W occ) is presented for the ionized form of histidine with the value for nonionized form in parentheses.

important point here is that concentrations of this or that type of residue among residues with some accessibility are not only determined by the affinity of the side chain to solvent water but also by considerations of the size of the side chain, conformational flexibility of residue in structure formation, and atomic constitution of the side chain. Keeping this in mind, we can proceed to comparing the quantitative characteristics that we can obtain from these distributions with physical and chemical properties of amino acid side chains. Correlation of Statistical Data with Experimental Transfer Free Energies. Let us now study the question stated in the beginning of the article that is the correlation between experimental transfer free energies of amino acid side chains from water to a less polar phase and statistics on exposure of amino acids in globular proteins. For this comparison we have chosen three free energy scales: water/vapor, water/cyclohexane and water/octanol transfer scales. There are several reasons for this choice: (1) these scales represent the basic and simple solvent-solute interactions that are easy for interpretation (at least when compared to any other scale), which are hydration, dispersion interaction (in the case of solvation in cyclohexane), and the possibility for amphipathic accommodation with hydrogen bond formation (in the case of solvation in octanol); (2) there is abundant data available on verification, accuracy, discussion, and comparison of scales; (3) scales are extensively

1232

Biomacromolecules, Vol. 10, No. 5, 2009

used in practical applications. Despite these scales having been extensively utilized and compared in many works (see Introduction) a short summarizing discussion on the applicability and validity of these scales is appropriate in the context of this work. Experimental Scales. Three experimental transfer free energy scales used for comparison are presented in Table 2 while amino acid abbreviations and corresponding R-H compounds are presented in Table 1. The proline residue is not included further in our consideration because it is quite different from all other residues and does not have a side chain analog. The fourth column of Table 2 presents the data for the transfer of R-H analogs of amino acid side chains from vapor to water phase. R-H analogs are simple substances representing the side chains of residues with a hydrogen bonded instead of C-R atom. The data was partially measured by Wolfenden et al.,8 partially collected from other sources, and partially derived indirectly from other measurements. The main points worthwhile to note in the scale are following: all aliphatic compounds (GLY′,37 LEU′, ILE′, VAL′, ALA′) have positive transfer free energies, the inclusion of a polar atom (sulfur in MET′ and CYS′), or an aromatic ring (PHE′, TRP′, TYR′) already shifts the equilibrium toward the preference for the compound to reside in water. The compound with the highest negative energy of transfer is ARG′, which outperforms all other residues in there preference to reside in water by 9 kcal/mol, that is, an extremely huge number when expressed in terms of concentration equilibrium (under the same conditions the concentration of ARG′ in water would exceed that of ASP′ by about 3 × 106 times). The next scale is the water/cyclohexane transfer scale for R-H analogs measured by Radzicka et al.13 using UV adsorption and proton magnetic resonance (column 6 of Table 2). The measurements were performed for the nonionized compounds (suppressing ionization by adding KOH or HCl) and then correcting the obtained value for the ionization happening at pH 7 (assuming only nonionized part of the solute entered the organic phase). The scale was advocated by Wolfenden et al.16 because cyclohexane represents an easy apolar environment that accounts for the dispersion interaction of solute atoms with the solvent and does not have any special unpredictable interactions with solutes. Due to free energy being a function of state, the values of water/cyclohexane transfer may be regarded as the hydration potential of molecules corrected for the transfer from vapor to cyclohexane. Wolfenden et al. 13 showed that the energy difference between water/vapor and water/cyclohexane transfer is in good correlation with total accessible surface of molecules and thus are mainly due to dispersion interactions with the solvent. Transfer from vapor to cyclohexane is unfavorable only for small molecules GLY′ and ALA′ and is due to the energy of cavity formation in cyclohexane is not compensated by the attraction energy gain. This additional attraction (growing with the size of molecule) of solutes to cyclohexane phase leads to the fact that many molecules that have negative hydration potential do have positive cyclohexane to water transfer energies (PHE′, CYS′, MET′, TRP′). The last scale for consideration is 1-octanol to water transfer free energy scale. This is probably the most used one in medicinal chemistry, however, the most controversial as compared to the previous. Mostly frequently utilized is the scale by Fauchere et al. 24 calculated from the transfer of Ac-X-amides, however, some papers admitted that values for several residues were ambiguous; the scale modified by Wimley et al.25 to correct these inconsistencies is presented in the Table 2. It should be noted that unless cyclohexane 1-octanol is a rather polar solvent

Shaytan et al. Table 3. Correlation Coefficients of Experimental Scales Presented in Table 2 V>W CH > W O>W O > Wocc CH > O

V>W

CH > W

O>W

O > Wocc

CH > O

1 0.94 0.68 0.62 0.97

1 0.86 0.77 0.98

1 0.64 0.74

1 0.64

1

and an effective H-bond donor, that is, octanol in not 100% immiscible with water, at saturation (which is the case of wet octanol used in the experiments), it contains about 2.3 M of dissolved water. In principle, having a mixture of polar and apolar atoms, octanol may be a phase more similar to the heterogeneous environment of the protein core than cyclohexane, however, several considerations show that wet octanol in experiment may have special selective interactions with some solute molecules that have to be taken into account while interpreting the results. As pointed out by Wolfenden,16 one consideration is the possible “dragging” of water molecules by solute into octanol happening in the experimental setup; the other example is the specific attraction forces between indole derivatives and alcohols, which makes TRP′ exhibit extreme preference to octanol. To dwell further on the difference of cyclohexane and 1-octanol used as reference phases we have constructed a cyclohexane to 1-octanol transfer scale that is presented in the last column of Table 2. The scale clearly shows that all side chains of residues containing polar atoms, that are chared or have aromatic groups prefer octanol to cyclohexane. Moreover, this scale resembles that of cyclohexane/water transfer meaning that octanol by its solvation properties is closer to water than to cyclohexane. This is also justified if we look at the correlation matrix of these scales (see Table 3); octanol/ cyclohexane partitioning scale is in excellent corelation with cyclohexane/water and water/vapor scales, while water/vapor scale is in good correlation with water/cyclohexane scale. Apparent Transfer Free Energies. The calculation of apparent transfer free energies deserves a separate discussion. To assess these quantities, usually a binary classification is introduced (residues are classified as buried and exposed), and eqs 4 and 5 are used to evaluate the values. Strictly speaking, the rigorous derivation of the presented equations from the ideas of Finkelstein et al. (eqs 1 and 2) and the exact relationship between ∆Fselect, ∆E, and Ftransfer (eq 4) has not been found in the literature and deserves a separate investigation. However, the suggested approach where Ftransfer is believed to be the estimate of ∆Fselect, and ∆E is rather natural and generally accepted. To study correlation between the set of apparent transfer free energies and experimental ones we do need to introduce some kind of binary classification. The question of classification of residues as exposed to the solvent or buried may be one of the most intricate ones. In the previous works the authors have suggested several methods: Chothia et al., Janin et al., and Miller et al.6,11,14 suggested to classify residues as internal ones if their relative accessibility was less than 5% and as surface ones in the opposite case. Wertz et al.7 used an original seven-step scoring procedure to determine whether each atom was considered exposed or buried; Prabhakaran and Ponnuswamy38 approximated proteins as ellipsoids and divided proteins into layers and calculated the occurrence of residues in these layers. Guy9 based on data of Prabhakaran and Ponnuswamy suggested the layer analysis where he tried to determine the free energy profiles of amino acid transfer through the layers. The analogs of such free energy profiles using as

Surface Area of Amino Acid Residues in Proteins

Biomacromolecules, Vol. 10, No. 5, 2009

1233

Figure 7. Correlation of different sets of residue apparent transfer free energies with experimental sets as a function of classification criteria X. The apparent free energies were determined for the transfer of residues from the virtual phase with (a) Racc ) 0 and (b) 0 < Racc < 1 to the phase with X < Racc < X + 1, where Racc is the relative accessibility of residue.

the varying parameter relative accessibility in our study are the concentration plots of Figure . In this paper we will try to study different criteria for binary classification based on relative accessibilities of residue side chains and compare the correlations with experimental data. To rank different possible binary classification criteria we have computed a series of apparent transfer free energy sets with different classification criteria and built the plots of correlation with experimental data sets as a function of a classification parameter. In the first case (Figure 7a) the transfer between the core of the protein (Racc ) 0, where Racc is the relative accessibility of the side chain of residue expressed in percent) and residues with exposure in the range X < Racc < X + 1 characterized by accessibility parameter X were studied. According to Figure , the concentration of residues in the core differs from the concentration in a virtual phase with very low accessibility (0 < Racc < 1). Because this gap may be caused by the specificity of interactions in the core apart from the interactions with the solvent we have also performed the correlation analysis of apparent transfer free energies between the virtual phase of residues with 0 < Racc < 1 and various phases with X < Racc < X + 1 (Figure 7b). Figure 7 allows only a rough analysis of the changes happening to the apparent transfer free energy sets upon the change of target transfer phase, characterized by accessibility parameter X. However, it enables us to find the regions of best correlations for various experimental data sets. It is seen from Figure 7a that correlation curve for water/ vapor scale starts at correlation coefficient of around 0.7, then reaches its maximum at X ) 25% (Rmax ) 0.93) and gradually decreases at large values of accessibility parameter. The behavior of water/octanol scale is different: the correlation at low values of accessibility criteria X is very poor and then it monotonically increases and reaches its maximum at the right end of the plot (Rmax ) 0.95). Water/cyclohexane correlation curve exhibits its best correlation in the mid range (Rmax ) 0.906 at X ) 55%). In other words, with the increase of accessibility parameter the apparent transfer energies first are at better correlation with water/vapor scale, then with water/cyclohexane scale and at large values of X with water/octanol scale. The suggested octanol/ cyclohexane scale exhibits poorer results in all ranges of X, as does the semitheoretical water/octanol scale of Wimley et al.25 If we take as the interior reference phase not the core of the protein but the group of residues with very small accessibility (Figure 7a) the results are a bit different. For all the curves there is no steep rise at low values of X. It is connected with a different

interior phase taken as reference in this case. It clearly shows the difference in the correlation behavior of octanol/water scale and water/vapor, water/cyclohexane scales: the water/vapor scale gives better approximations to apparent free energies when the transfer is considered from the phase with low accessibility to the phase with moderate accessibility (best correlation Rmax ) 0.91 at X ) 12%), water/octanol scale is better when we consider transfer to the phases of high accessibility (best correlation Rmax ) 0.95 at X ) 80%). Cyclohexane/water scale is somewhere in the middle between these two (best correlation Rmax ) 0.906 at X ) 2%). The behavior of correlation curves at Figure 7a and b is not drastically different, so we will confine our further consideration to the case where the inaccessible by solvent core of the protein is considered as the interior phase, and study this in a more detailed way. Let us introduce three sets of apparent transfer free energies that represent the transfer from the interior of the protein to: (1) the group of residues with 10 < Racc < 20, this range will include the plateau of high correlation with water/ vapor scale; (2) the group of residues with 50 < Racc < 60, this range will include the plateau of high correlation with water/ cyclohexane scale; (3) the group of residues with 95 < Racc < 105, this range will include the high correlation region with water/octanol scale. The obtained data for these scales and correlation coefficients with experimental scales are presented in Table 4. We see that these three statistical scales reveal the best correlation with various experimental scales. Water/vapor energies are in the best correlation with “10-20” set statistical energies, water/cyclohexane with “50-60” statistics, and wateroctanol with “95-105” statistics. What is the reason for this change? To gain insight into these correlations we have further analyzed and compared the two-dimensional correlation diagrams of apparent transfer energies vs experimental sets (see Figures 8-910). But before that we will discuss some general aspects of the correlation. Theoretical Considerations. Let us examine some groundings for the correlations between apparent transfer free energies and experimental ones. Even if we assume the already mentioned assumption that the apparent transfer free energy that we measure according to eqs 4 and 5 can be interpreted as ∆E (the average free energy difference between the folds with exposed or buried side chain) in the theory of Finkelstein et al. (eq 2) we note that no exact correlation between transfer free energy of residue side chains and selective free energy of residues follows from these considerations. The reason is that the free energy difference ∆E must include free energy

1234

Biomacromolecules, Vol. 10, No. 5, 2009

Shaytan et al.

Table 4. Three Statistical Scales of Residue Side Chain Transfer from the Core of the Protein to Virtual Phases of Various Exposure (see Text) and Correlation Coefficients with Experimental Scales (see Table 2)

GLY LEU ILE VAL ALA PHE CYS MET THR SER TRP TYR GLN LYS ASN GLU HIS ASP ARG V>W CH > W O>W O > Wocc CH > O

“10-20”

“50-60”

“95-105”

0.855 0.313 0.516 0.661 0.848 -0.017 0.779 0.355 -0.249 -0.147 -1.11 -1.168 1.335 -1.989 -0.944 -1.478 -1.224 -1.162 -2.284 0.93 0.80 0.57 0.53 0.83

0.828 1.507 1.738 1.571 1.097 1.465 2.292 1.262 -0.315 -0.28 0.739 0.116 -1.884 -3.22 -1.267 -2.434 -0.937 -1.81 -2.62 0.87 0.90 0.85 0.79 0.85

-0.565 2.302 2.66 2.138 0.728 2.314 3.171 1.94 -0.061 -0.413 2.293 1.397 -1.507 -2.893 -1.59 -2.392 -0.482 -2.101 -1.426 0.71 0.84 0.93 0.87 0.74

Figure 8. 2-D correlation diagram of water/vapor experimental transfer free energies of amino acid side chains and two sets of apparent transfer free energies calculated from statistics of residue occurrence in globular proteins. “10-20 set” is the apparent transfer energy from the core of the protein (Racc ) 0) to the phase of residues with 10 < Racc < 20, where Racc is the relative accessibility (see text). Data points for this set are depicted by the letters of one-letter residue code (see Table 1). The dash-dotted line is the least-squares linear fit with correlation coefficient 0.93. “95-105 set” is the apparent transfer energy from the core of the protein (Racc ) 0) to the phase of residues with 95 < Racc < 105. Data points for this set are depicted according to the legend above or below the one-letter code of residue. The dotted line is the least-squares linear fit with correlation coefficient 0.71.

components not only connected with the interactions of the side chain with solvent and interior of protein but also many more components such as: the interaction of backbone with solvent and interior (including possible hydrogen bonding rearrangement cased by the need of high side chain exposure in unusual conformations), the conformational distortion of backbone and the influence of side chain on the energy of this distortion (residues with high exposure are often found in loops and turns

Figure 9. 2-D correlation diagram of water/octanol experimental transfer free energies of amino acid side chains and two sets of apparent transfer free energies calculated from statistics of residue occurrence in globular proteins. “10-20 set” is the apparent transfer energy from the core of the protein (Racc ) 0) to the phase of residues with 10 < Racc < 20, where Racc is the relative accessibility (see text). Data points for this set are depicted according to the legend above or below the one-letter code of residue. The dash-dotted line is the least-squares linear fit with correlation coefficient 0.57. “95-105 set” is the apparent transfer energy from the core of the protein (Racc ) 0) to the phase of residues with 95 < Racc < 105. Data points for this set are depicted by the letters of one-letter residue code (see Table 1). The dotted line is the least-squares linear fit with correlation coefficient 0.93.

Figure 10. 2-D correlation diagram of water/cyclohexane experimental transfer free energies of amino acid side chains and two sets of apparent transfer free energies calculated from statistics of residue occurrence in globular proteins. “50-60 set” is the apparent transfer energy from the core of the protein (Racc ) 0) to the phase of residues with 50 < Racc < 60, where Racc is the relative accessibility (see text). Data points for this set are depicted by the letters of one-letter residue code (see Table 1). The dashed line is the least-squares linear fit with correlation coefficient 0.90. “95-105 set” is the apparent transfer energy from the core of the protein (Racc ) 0) to the phase of residues with 95 < Racc < 105. Data points for this set are depicted according to the legend above or below the one-letter code of residue. The dotted line is the least-squares linear fit with correlation coefficient 0.84.

which require increased flexibility of the chain), the entropical losses connected with high exposure of side chain, and so on. All these factors combined with problems in definition of protein surface as a phase and the specificity of protein interior structure

Surface Area of Amino Acid Residues in Proteins

and interactions that are very different from that in any organic bulk solvent (e.g., hydrogen bonded structure, salt bridges, disulfide bonds) make us believe that the level of correlation found between statistical data and side chain transfer energies is even higher than could be expected. Let us separate the apparent transfer free energy into several intuitive phenomenological components that we will use for further qualitative interpretation of the statistical data

∆E ) Ftsch + Ftbb + Fiisch + Fconf + Fss + Ffunc + ... (8) is the hypothetical transfer free energies of side chain Fsch t from surface to the interior of protein when only energy components common for all residues are considered (solvation by water, dispersion and electrostatic interactions between the atoms in the interior and at the surface); this term (in the case when the transfer between 100% exposed and buried residues is considered) should be the best approximation to the experimental data on side chain transfer between water and a hypothetical organic solvent that is similar by its average properties to protein interior. All other components of eq 8 are the components that may lead to deviation of ∆E from the best correlation with experimental transfer energies. Fbb t is the transfer energy of the backbone atoms, Fii accounts for any special interaction of the side chain in the interior of the protein such as salt bridges between charged side chains and disulfide bonds between cysteines, Fconf accounts for the free energy term induced by the range of backbone flexibility (which is, e.g., much bigger for GLY than for any other residue and gives it preference to accommodate in coil-like parts of protein structure), Fss is the term that accounts for the relationship between the residue and its preference to various secondary structure regions, for example, residues whose side chains are capable of forming hydrogen bonds (SER, ASN, ASP, etc.) that compete with hydrogen bonding of backbone, are R-helix disruptors, and thus will rarely belong to R-helix, which affects their exposure/ burial ratio. Ffunc is the possible bias in free energy due to the the considerable importance of residue in functioning of the protein; we should admit that, despite the fact that globular proteins resemble quasi-random sequences, they still have some special function, and unless this function is fullfilled, their existence will be cutoff by the evolution process. Such a term may be, for example, important for histidine, which is a rather special functional amino acid in proteins involved in catalytic sites and used as coordinating ligand in metalloproteins. Histidine can change its protonation state during the function of the protein. Further, we will refer to the terms of eq 8 to describe important factors that alter residue exposure. We would also suggest that a possible continuation of our work would be an attempt to assess not only the apparent transfer free energies but also the terms of eq 8 that constitute/ influence this apparent energy by direct additional analysis of the protein structure. For instance, one can combine exposure statistics with statistics on disulfide bonds, salt bridges, secondary structure elements, and chain conformation. 2-D Diagrams. Figure 8 outlines the correlation between water/vapor transfer energies and apparent transfer free energies for two statistical sets (“10-20” and “95-105”), straight lines represent the least-squares linear fits between experimental and statistical energies. While the correlation for the “10-20” set is 0.93, the correlation for “95-105” set is much worse (R ) 0.71). The exposure rate of 10-20% is in principle rather small, this means that we are dealing with the apparent transfer of

Biomacromolecules, Vol. 10, No. 5, 2009

1235

residues from the interior of the protein to the group of residues, which are situated close to the surface, but remain mainly buried inside with only some atoms or groups being at the surface. It is natural to suppose that side chains will mainly try to expose their polar atoms while leaving apolar ones buried wherever it is possible. This fact in principle should not favor better correlation with water/vapor transfer because the experimental data is obtained for fully solvated side chains. Indeed from Figure 8 it is seen that the correlation (even for “10-20” set) is not very good, and it mainly arises from the positive correlation in transfer energies between the group of hydrophilic residues (R, D, H, N, E, Q, K) and the group of hydrophobic resides (C, M, F, A, V, I, L), but when each of these groups is considered separately, there is simply no reasonable correlation. If we look at water/vapor scale we note that residues are effectively grouped into five distinct groups with close values of hydration energy. The gap in the hydration energies between these groups is rather high, while the resolution of the scale (the difference in hydration energies of residues inside these groups) inside these groups is poor. Let us now look at “10-20” and “95-105” scales. The “10-20” scale is a more compressed one. It captures only the main tendencies of residues to be at the surface or in the interior, while “95-105” scale is more sensible to the specificity of each type of residues as well as to different “biasing” factors of eq 8. Let us analyze this change in more details. A, G, V, I, L form a close set of points on a “10-20” diagram, while on a “95-105” diagram, this set is distorted: V, I, L become more hydrophobic in this set and remain together on the diagram, which is quite reasonable, but A, G do not follow this tendency. This deviation might be connected with the increased role of Fconf term in apparent energy for this set. The group of C, M, and F also shifts upward, but cysteine (C) shifts considerably more. This deviation may stem from the Fsch ii term, which is again more pronounced in “95-105” scale. Y, W also shift upward, but become more separated. This means that the “95-105” scale is more sensible to the difference in residue structure and properties. The statistical energy for S and T remains approximately the same. If we look at Figure , we see that these residues are almost equally distributed at various levels of exposure. The changes in the group of highly hydrophilic residues (R, D, H, N, E, Q, K) are controversial; in principle, hydrophilic residues tend to be highly exposed and cover the surface of protein, Thus, their apparent free energies should be lower in the “95-105” set; this tendency is correct for all the residues except for arginine (R) and histidine (H). Histidine is actually a rather special functional amino acid, which is why Ffunc may give considerable contribution to the transfer energy. As for R, its behavior reveals its amphiphilicity and the interplay between its willingness to expose its hydrophilic end and bury its hydrophobic tail (see Figure and the hump-shaped concentration dependency); thus, high levels of exposure become unfavorable due to exposure of the hydrophobic tail. The overall conclusion that can be made from this analysis is following: water/vapor scale captures mainly the rough tendency of residue distribution between surface and interior, namely, residues with polar atoms tend to be at the surface while apolar ones in the core of the protein. The “10-20” scale in comparison with “95-105” scale also mainly captures the rough tendency of residue distribution in proteins while the latter is more sensitive to the type of residue, its structure and other factors of eq 8. This may be the reason that the correlation of water/vapor scale with “10-20” scale is better than with “95-105” scale.

1236

Biomacromolecules, Vol. 10, No. 5, 2009

Let us now consider Figure 9 and the correlation of the same statistical scales with water/octanol transfer free energies. It was already discussed that wet octanol by its solvation properties resembles water to a good extent, which is why it is natural to suppose that water/octanol transfer energies will be less sensitive to some big differences in residue side chain structure, while more sensitive to the small differences in solute structure and properties. In other words, the resolution of this scale between the residues of similar composition and shape would exceed that of water/vapor scale. This makes water/octanol scale more compressed than water/vapor scale but more uniformly populated with data points. This supposition is justified if we look at Figure 9 in comparison with Figure 8. Luckily, this type of sensitivity of water/octanol scale is very well correlated with the sensitivity of “95-105” statistical scale, such a good correlation may be considered a good coincidence, however, we will try to analyze it in terms of eq 8. In the region of octanol to water transfer energies of more than 1 kcal/mol there is a perfect correlation for the residues Y, V, M, L, F, and I and reasonable deviation for C and W. However, this deviation both in the case of TRP and CYS is quite understandable. As was already mentioned, TRP (W) does have a special type of attractive interaction with octanol, thus, octanol is not a perfect model of protein interior and has its own specificity. The abnormal apparent hydrophobicity of CYS (C) is due to Fii and, namely, disulfide bonding, which is, of course, impossible in octanol. If we exclude cysteine from the statistics, the correlation coefficient becomes even higher, R ) 0.96! Let us now consider the position of glycine (G) and alanine (A), the residues for which the term Fconf may give considerable contribution. We know that due to this effect their apparent transfer energy is lower than Fsch t , however, these residues fit the correlation rather well. The explanation to this may be in the fact that the octanol/ water scale was measured not from the transfer of R-H analogs but from the transfer of Ac-X-amides, thus, the occlusion effect by neighboring atoms may have contributed to the lowering of alanine transfer free energy. As for glycine, it was taken as a reference point in the octanol/water scale and its transfer energy is actually zero, while hydrogen (H2) was considered as a reference for the water/vapor scale. The correlation of octanol scale in the region of hydrophilic residues (D, E, R, K, N, Q) is much poorer than for other residues. Again, as for the water/ vapor scale, it fails to resolve the fine structure of their distribution in protein structures although it is more “sensible” than water/vapor scale. Notably, arginine (R) that was considered very hydrophilic by the water/vapor scale in the water/ octanol scale is close to the position of lysine (K) which better fits its behavior in proteins. However, it is unable to resolve the difference between K and R, which is apparent in protein statistics. The overall conclusion for this scale is the following: water/ octanol scale is more “sensitive” to the small variations in structure and composition of residues and better resolves the residues by their structure and properties than water/vapor scale. The water/octanol scale is in very good correlation with “95-105” set when hydrophobic and moderately hydrophilic residues are considered. Let us now consider that last water/cyclohexane scale (Figure 10). This scale is in good correlation with the “50-60” set; moreover, if we exclude cysteine (C) and several charged and polar residues (E, Q, K, R), the correlation becomes absolutely perfect, R ) 0.99. This means that this scale is a perfect scale for the group of apolar and moderately hydrophilic residues; in this region it is even better than water/octanol scale (keeping

Shaytan et al.

in mind that the best correlation is obtain for different statistical sets). Water/cyclohexane scale is more sensitive to variations of atomic structure of solutes than water/vapor since the additional interaction of solutes with cyclohexane is approximately proportional to solute surface, as noted by Wolfenden et al.13 However, since water/cyclohexane scale is still rather close to water/vapor scale, it is not a surprise that its correlation with statistical data in the region of highly hydrophilic residues (R, D, E, K, N, Q) is also very poor (R ) 0.15). The overall conclusions that can be made from the analysis of these three statistical and three experimental scales are following. The proposed statistical scales differ by their “sensitivity”. Whereas the “10-20” captures only the rough tendencies of residues to be exposed or buried, the “50-60” and “95-105” scales are more sensitive to the peculiarities of residue behavior. Water/vapor experimental scale is also less “sensitive” to the peculiarities of side chain structure and thus is in better correlation with “10-20” statistical scale than with other scales. The water/cyclohexane and water/octanol scales are more “sensitive” to the peculiarities of side chain structure, however, their “sensitivity” type is different and they are in better correlation with different statistical scales. The correlation among all scales is considerably better in the group of hydrophobic and moderately hydrophilic residues than in the group of highly hydrophilic ones. If we consider the overall correlation performance of the scales and its possible applicability in QSAR studies, the water/octanol scale may have small preference among others; however, as we see from Table 4, this preference is quite dependent on the statistics that are used.

Conclusion In this study we have analyzed the distribution of residues between interior and surface in globular proteins and traced correlations between apparent transfer free energies and experimental hydrophobicity scales using theoretical considerations based on the theory of random eneregy model of heteropolymer globule. We examined various methods for residue classification as exposed or buried and in a detailed way analyzed correlations between various statistical scales and experimental scales. We propose three types of statistical apparent transfer free energy scales and show that each of these scales is in better correlation with one of the experimental hydrophobicity scales (water/vapor, water/cyclohexane, and water/octanol transfer scales). In some cases we report really good linear correlation coefficients of R ) 0.93 for the whole set of residues or R ) 0.99 for a subset of 14 residues. However, it should be stated once more that, although hydrophobic attraction of residues plays a key role in stability and folding of globular proteins and the distribution of residues between surface and interior of globular proteins under some circumstances may be in a very good correlation with partitioning coefficients for amino acids between water and organic solvent, many more factors (which we tried to summarize in the form of eq 8) influence the distribution of residues in proteins making these correlations unstable and dependent on many factors (statistical classification criteria, experimental scale details, etc). Acknowledgment. We wish to thank Dr. Anna Panchenko from NCBI for useful discussion concerning the Non-redundant protein set and Dr. Simon Hubbard from Manchester University for kindly providing the NACCESS program.

Surface Area of Amino Acid Residues in Proteins

Supporting Information Available. List of PDB codes of the protein set containing 8022 structures. This material is available free of charge via the Internet at http://pubs.acs.org.

References and Notes (1) Bieri, O.; Kiefhaber, T. In Kinetic models in protein folding; Pain, R. H., Ed.; Oxford University Press: Oxford, U.K., 2000. (2) Kamtekar, S.; Schiffer, J. M.; Xiong, H.; Babik, J. M.; Hecht, M. H. Science 1993, 262, 1680–1685. (3) Desjarlais, J. R.; Handel, T. M. Protein Sci. 1995, 4, 2006–2018. (4) Tew, G. N.; Liu, D.; Chen, B.; Doerksen, R. J.; Kaplan, J.; Carroll, P. J.; Klein, M. L.; Degrado, W. F. Proc. Natl. Acad. Sci. U.S.A. 2002, 99, 5110–5114. (5) Khokhlov, A. R.; Khalatur, P. G. Phys. ReV. Lett. 1999, 82, 3456– 3459. (6) Chothia, C. Nature (London) 1976, 248, 338–339. (7) Wertz, D. H.; Scheraga, H. A. Macromolecules 1978, 11, 9–15. (8) Wolfenden, R.; Andersson, L.; Cullis, P. M.; Southgate, C. C. Biochemistry 1981, 20, 849–855. (9) Guy, H. R. Biophys. J. 1985, 47, 61–70. (10) Rose, G. D.; Geselowitz, A. R.; Lesser, G. J.; Lee, R. H.; Zehfus, M. H. Science 1985, 229, 834–838. (11) Miller, S.; Janin, J.; Lesk, A. M.; Chothia, C. J. Mol. Biol. 1987, 196, 641–656. (12) Lawrence, C.; Auger, I.; Mannella, C. Proteins 1987, 2, 153–161. (13) Radzicka, A.; Wolfenden, R. Biochemistry 1988, 27, 1664–1670. (14) Janin, J.; Miller, S.; Chothia, C. J. Mol. Biol. 1988, 204, 155–164. (15) Samanta, U.; Bahadur, R. P.; Chakrabarti, P. Protein Eng., Des. Sel. 2002, 15, 659–667. (16) Wolfenden, R. J. Gen. Physiol. 2007, 129, 357–362. (17) Pohl, F. M. Nat. New Biol. 1971, 234, 277–279. (18) Derrida, B. Phys. ReV. B: Condens. Matter Mater. Phys. 1981, 24, 2613–2626.

Biomacromolecules, Vol. 10, No. 5, 2009

1237

(19) Finkelstein, A. V.; Badretdinov, A. Y.; Gutin, A. M. Proteins: Struct., Funct., Genet. 1995, 23, 142–150. (20) Finkelstein, A. V.; Gutin, A. M.; Azat, Proteins: Struct., Funct., Genet. 1995, 23, 151–162. (21) Thomas, P. D.; Dill, K. A. J. Mol. Biol. 1996, 257, 457–469. (22) Gomes, A. L. C.; de Rezende, J. R.; Antoˆnio; Shakhnovich, E. I. Proteins: Struct., Funct., Bioinf. 2007, 66, 304–320. (23) Nozaki, Y.; Tanford, C. J. Biol. Chem. 1971, 246, 2211–2217. (24) Fauche`re, J. L.; Pliska, V. Eur. J. Med. Chem. 1983, 18, 369+. (25) Wimley, W. C.; Creamer, T. P.; White, S. H. Biochemistry 1996, 35, 5109–5124. (26) Bull, B. H.; Breese, K. Arch. Biochem. Biophys. 1974, 161, 665–670. (27) Okhapkin, I.; Askadskii, A.; Markov, V.; Makhaeva, E.; Khokhlov, A. Colloid Polym. Sci. 2006, 284, 575–585. (28) Eisenberg, D.; McLachlan, A. D. Nature (London) 1986, 319, 199– 203. (29) ftp://ftp.ncbi.nih.gov/mmdb/nrtable/nrpdb.090508. (30) Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; Lipman, D. J. J. Mol. Biol. 1990, 215, 403–410. (31) http://www.wwpdb.org. (32) Hubbard, S. J.; Thornton, J. M. NACCESS; Department of Biochemistry and Molecular Biology, University College: London, 1993. (33) Lee, B.; Richards, F. M. J. Mol. Biol. 1971, 55, 379–400. (34) Hubbard, S. J.; Campbell, S. F.; Thornton, J. M. J. Mol. Biol. 1991, 220, 507–530. (35) Finkelstein, A. V.; Ptitsyn, O. B. Protein Physics; Academic Press: New York, 2002. (36) Nelson, D. L.; Cox, M. M. Lehninger Principles of Biochemistry; Freeman, W.H. and Company: Oxford, U.K., 2008. (37) The R-H compounds and side chains would be further referred as residue three-letter codes followed by a prime. (38) Prabhakaran, M.; Ponnuswamy, P. J. Theor. Biol. 1980, 87, 623–637.

BM8015169