Surface Accessibility of Protein Post-Translational Modifications

Apr 12, 2007 - Aidan P. Tay , Chi Nam Ignatius Pang , Daniel L. Winter , and Marc R. Wilkins. Journal of Proteome Research 2017 16 (5), 1988-2003...
0 downloads 0 Views 521KB Size
Surface Accessibility of Protein Post-Translational Modifications Chi Nam Ignatius Pang,† Andrew Hayen,‡ and Marc Ronald Wilkins*,† Systems Biology Group, School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, 2052, Australia, and School of Mathematics and Statistics, University of New South Wales, Sydney, NSW, 2052, Australia Received December 14, 2006

Protein post-translational modifications are crucial to the function of many proteins. In this study, we have investigated the structural environment of 8378 incidences of 44 types of post-translational modifications with 19 different approaches. We show that modified amino acids likely to be involved in protein-protein interactions, such as ester-linked phosphorylation, methylarginine, acetyllysine, sulfotyrosine, hydroxyproline, and hydroxylysine, are clearly surface associated. Other modifications, including O-GlcNAc, phosphohistidine, 4-aspartylphosphate, methyllysine, and ADP-ribosylarginine, are either not surface associated or are in a protein’s core. Artifactual modifications were found to be randomly distributed throughout the protein. We discuss how the surface accessibility of posttranslational modifications can be important for protein-protein interactivity. Keywords: post-translational modifications • protein-protein interaction • surface accessibility • intrinsic disorder • domains and linkers

Introduction There are strong links between the biological type of the post-translational modifications and their surface accessibility. An amino acid side chain that undergoes enzymatic posttranslational modification (PTM) needs to be accessible on the surface of the protein. Surface-accessible regions of a protein must have more free backbone hydrogen bonds for associating with the enzymes that catalyze the modification and the reverse reaction, if required.1 A PTM recognition domain may specifically recognize and bind to a modified amino acid only if it is on the surface of the protein or polypeptide. Modified amino acid side chains packed within the structured and ordered region of the protein would typically be inaccessible to any PTM recognition domain due to steric hindrance.2 There are certain protein structural properties that correlate with the surface accessibility of amino acid residues in a folded polypeptide chain. These include protein linker regions, coils or loops, and disordered regions. Although PTMs may not always be in surface-accessible regions, surface-accessible amino acids would have a higher likelihood of being modified. For instance, increased flexibility of disordered regions allow them to fold, such that the amino acid side chains would easily fit into a modifying enzyme’s catalytic site. This is one reason why PTMs such as acetylation, methylation, phosphorylation, and ADP-ribosylation occur mainly within regions of intrinsic disorder.1,2 Protein interaction domains may bind and recognize posttranslational modification. There are a diverse variety of PTM * To whom correspondence should be addressed. E-mail: m.wilkins@ unsw.edu.au. Phone: +612 9385 3633. Fax: +612 9385 1483. † School of Biotechnology and Biomolecular Sciences. ‡ School of Mathematics and Statistics. 10.1021/pr060674u CCC: $37.00

 2007 American Chemical Society

recognition domains, for example, Src homology 2 (SH2) domain binds phosphotyrosine (pTyr), 14-3-3 domain binds phosphoserine (pSer), and acetylation and methylation of lysine residues in histones create binding sites for bromo- and chromo- domains, respectively.3 Association of a modified protein with these interaction domains may be controlled and switched dynamically by the addition or removal of the PTM. The main criterion for this dynamic switching is the reversibility of the modification. Phosphorylation is a well-known example of this. Different combinations of post-translational modification sites and interaction domains can control a protein’s interaction partners and how they work in concert to orchestrate the protein-protein interaction networks.3,4 Several studies of protein-protein interaction networks have shown the importance of intrinsic disorder in the topology of protein-protein interactions network. Consensus results showed that hub proteins were enriched for intrinsic disordered regions as compared to non-hub proteins.5-7 These disordered regions confer flexibility for hub proteins to interact with a diverse number of partners, with high specificity and low affinity. This has important implications for the reversibility of these proteinprotein interactions and their role in regulating the proteinprotein interaction network.7 The function of a protein can be controlled by conformational changes upon phosphorylation of protein loops, causing disorder-order transition and allowing phosphorylation mediated protein-protein interaction to occur.8,9 These interactions are mediated by the hydrogenbonding of the phosphoryl group of the phosphorylated residues by the binding domain, for example, the interaction between 14-3-3 protein and 14-3-3 binding partners.10 Furthermore, proteins with more interaction partners have a greater likelihood of being phosphorylated.11 Journal of Proteome Research 2007, 6, 1833-1845

1833

Published on Web 04/12/2007

research articles This manuscript asks fundamental questions about where modifications are found on proteins. As there is relatively little structural information for native, post-translationally modified proteins, it is not known which modifications are surface associated and which are not. Accordingly, here we aim to determine whether post-translationally modified amino acids are more likely to be within surface-accessible regions as compared to unmodified amino acids. First, we selected a set of proteins that contain post-translational modifications. Second, various scoring systems were used to predict molecular properties of the protein sequences, such as the protein’s surface accessibility and regions of intrinsic disorder. Third, the residues from each sequence were separated into the modified residue dataset and the nonmodified residue dataset. Finally, hypothesis testing was performed to test whether there are any significant differences between the structural environment of the modified and unmodified residues. For all modified proteins documented in Swiss-Prot, we strikingly show that most reversible modifications are found on the surface of proteins. We discuss how the surface accessibility of post-translational modifications can be important for protein-protein interactivity.

Methods Database of Modified Proteins. Sequences with PTMs of interest were extracted from the UniProt/Swiss-Prot database, release 49.712 using Swissknife.13 Modification information was from the MOD_RES field. The MOD_RES field includes modifications such as phosphorylation, methylation, and acetylation but does not include fatty-acid modifications or glycosylation. Swiss-Prot entries containing the O-linked N-acetylglucosamine (O-GlcNAc) modification were retrieved from Swiss-Prot release 50.4, also using Swissknife. The entries with O-GlcNAc modifications were merged with entries containing MOD_RES data to obtain the final sequence and PTM dataset for this study. Unless otherwise stated, the term post-translational modification used in this study refers only to the MOD_RES and O-GlcNAc entries. Filtering of Modified Protein Database. As we were to subsequently examine structural properties of proteins, and had to ensure that our results were unbiased, the sequence database was filtered to remove homologues, short sequences, and membrane-bound proteins. UniRef90 merges sequences with 90% or more sequence identity from UniProt into sequence clusters.14 Members of relevant sequence clusters of interest were extracted using custom Perl scripts and BioPerl and grouped together into FASTA files containing multiple sequences.15 Multiple sequence alignments for these clusters were performed with ProbCons,16 with the exception of long sequences, which were aligned with ClustalW.17 The sequence dataset was then filtered further. Sequences with less than 37 amino acid residues were removed from the database. This number was chosen because it approximates the length of the smallest protein structural domain in the CATH database.18 Membrane-associated proteins, as defined by the presence of the keyword “membrane” or variants thereof in UniProt entries, were also removed, as were any proteins that were in the same UniRef90 cluster as a membrane protein. This removed all proteins containing transmembrane domains from the modified protein database. The list of Swiss-Prot accession numbers of the proteins in the final sequence database is in Table S1 (see Supporting Information). Modified Residue Dataset. Amino acids from all proteins in our modified protein database were classified into two 1834

Journal of Proteome Research • Vol. 6, No. 5, 2007

Pang et al.

datasets: post-translationally modified residues and unmodified residues. The modified residue dataset contained the following information: the type and position of the modified amino acid, the Swiss-Prot accession number of the source protein, and the identification number of the source protein. Each entry in the modified residue dataset included comments on the reliability of the modification: whether it was experimentally validated, predicted by sequence alone, or assigned by taxonomic similarity. The modified residue dataset was classified by modified residue type, for example, a group for phosphoserine and another group for phosphotyrosine. Removal of Homologous Residues from the Modified Residue Dataset. An issue with the modified residue dataset was that there were a number of modified residues documented in orthologous proteins. If all of these modified residues were kept, this would introduce bias into our subsequent analyses. Accordingly, the modified residue dataset was filtered to remove redundant entries when the same modified residue was found to be invariant in two or more protein orthologues. Only one modified residue was kept. When done for all modified residues from all modified proteins, this yielded a nonredundant modified residue dataset. In cases where modified residues came from proteins with different numbers of modifications, modified residues from the sequence with more modifications were retained. For example, the human 60S acidic ribosomal protein P0 (RLA0_HUMAN, P05388) has an experimentally verified phosphoserine at position 304. The mouse orthologue (RLA0_MOUSE, P14869) also has an experimentally verified phosphoserine at the invariant position 304. In this case, the modified residue from the human protein was retained, as the human protein was known to have a greater number of experimentally determined modifications. The result of the above classification was that some sequences in the modified protein database only contained redundant modified amino acid residues. These sequences were subsequently removed from the modified protein database to ensure that there was minimal redundancy in the unmodified residue dataset. So returning to our example, RLA0_MOUSE was ultimately removed from the modified protein database. Unmodified Residue Dataset. The unmodified residue dataset contains groups of the 20 unmodified amino acids, extracted from a subset of the sequences used to generate the modified protein dataset. The structure of this dataset was identical to that for the modified residue dataset. Although residues in the modified residue dataset were derived from more than one sequence in a UniRef90 cluster, only one sequence was chosen from each UniRef90 cluster to generate the unmodified residue dataset. Experimentally verified and potentially modified residues, including those assigned by taxonomic similarity, were excluded from the dataset. Because the N- and C-termini of proteins tend to be more solvent accessible than the rest of the protein, the first 5 and last 5 residues of each protein sequence were also excluded from the unmodified residue dataset. To reduce the size of the unmodified residue dataset, 10 000 residues were randomly sampled for each type of amino acid type. This became the final unmodified residue dataset. Protein Surface Accessibility Prediction. The solvent or surface accessibility of modified and unmodified residues in protein sequences was predicted with four different approaches. The first three methods utilize propensity scores for each amino acid to be on the surface or buried within the core. These are: average surface accessibility (ASA)19 with window sizes of 3, 5, 7, and 9; Kyte-Doolittle’s hydropathy scale20 with

research articles

Surface Accessibility of Protein Modifications

window size of 9; and a hydropathy scale (GOR) developed by Naderi-Manesh et al. (2001)21 with a window size of 17. For the latter, a threshold accessibility of 9% was chosen because it has the best prediction accuracy out of the thresholds provided. The fourth method, RVP-net,22,23 uses a neural network to predict the relative solvent accessibility of amino acids. The amino acid propensity scores for ambiguous amino acid code, B (aspartic acid or asparagine), Z (glutamic acid or glutamine), and X (any of the 20 common amino acids), were estimated by the arithmetic mean of the possible amino acids. This estimation method was used for methods that utilize amino propensity score, including DomCut, domain linker index, and George and Heringa’s linker propensity scores. The exception was that the geometric mean, rather than the arithmetic mean, is used for ASA’s smoothing window and ambiguous amino acid score estimation. The arithmetic mean for the residue’s propensity score within a smoothing window was calculated, and this score was allocated within the central residue. Only odd-number window sizes were utilized. When the window was flanking outside the N- or C-terminus of the sequence, the window was shortened on the corresponding side that had exceeded the sequence. Prediction of Intrinsically Disordered Regions. The natively disordered or unstructured regions of sequences containing modified and unmodified residues were predicted with four neural network methods. The first predictor was RONN,24 which is a neural network method based on the bio-basis function. Three other neural-networks from the DisEMBL package were also used:25 coils predictor, hot loops predictor, and the X-ray structure missing co-ordinates predictor. Coils predictor (DCOILS) proposes loops and coils as defined by Kabash and Sander.26 The hot loops predictor (HOTLOOPS) estimates regions that may have a high degree of mobility and high B-factor if the protein’s structure is elucidated with X-ray crystallography. The X-ray structure missing co-ordinate predictor (REMARK465) is trained with remark465 entries in the PDB database. Prediction of Linker versus Domain Regions. The propensity of modified and unmodified amino acids to be in a linker region was predicted with 5 different scoring systems. Linker (Linker), helical linker (Helical), and non-helical linker (NonHelical), by George and Heringa (2003)27 were used with window sizes of 5, 7, and 9 residues. The third and fourth scoring systems, Domcut28 and domain linker index (DLI),29 both used a window size of 15 residues. Secondary Structure Prediction. Secondary structure of sequence regions containing modified and unmodified residues was predicted by PSIPRED to be coiled, helical, or extended.30 Default parameters were employed, and the Swiss-Prot sequence database release 49.5 was used. The most frequently predicted secondary structure containing each type of modified residue was recorded. Hypothesis Testing. Hypothesis tests between the modified residue dataset and unmodified residue dataset were performed to determine whether there were significant differences between the structural environments of residues in the two sets. The hypothesis tests were used to compare each structural property for each modified residue type, such as phosphoserine, to the relevant unmodified residue type, in this case unmodified serine. Statistical analyses were performed using the R statistical package.31 Kolmogorov-Smirnov tests (K-S tests) were performed to verify if modified and unmodified residue datasets could be

approximated by a normal distribution (p < 0.05). Levene’s test was performed to test whether there were approximately equal variances between the modified and unmodified residue datasets (p < 0.05). Depending on the result of Levene’s test, two-sided paired Student’s t-tests were performed, assuming equal variance or unequal variance (p < 0.05). Where data were not normally distributed, the two-sided Wilcoxon rank sum test was used (p < 0.05). The density plot for each pair of data was drawn to graphically explore the qualitative differences between the two datasets. Summarizing Linker and Domain Region Prediction Results. The propensity for modified and unmodified amino acids to be found in helical or non-helical linkers was determined with a decision tree. The decision tree was manually created and is shown in Figure 1. Results of statistical tests from Linker, with window sizes of 5, 7, and 9 (Linker_5,7,9), Domcut scores and DLI scores both with window size of 15 (Domcut_15 and DLI_15 respectively), were first evaluated. A vote was cast from these general linker prediction systems to determine whether a residue is likely to be within a linker region, domain region, or a region where no significant predictions could be made. If a residue was predicted to be found in a linker region, it was further evaluated as a helical or non-helical linker region. Statistical test results from George and Heringa’s helical and non-helical scoring systems (Helical_5,7,9 and Non-Helical_5,7,9 respectively) were used to cast a vote to decide whether the residue was likely to be within a helical or nonhelical linker region. Then, to ensure consistency with secondary structure prediction results, results from PSIPRED were used to determine if helical regions were helical linkers or general linkers and if coiled and extended regions were nonhelical linkers or general linkers.

Results Databases of Proteins and Residues. To allow the surface accessibility of modified amino acids to be determined, we first prepared a database of modified and unmodified residues. Great care was taken in the preparation of this database. Regular protein tertiary structure is more likely to be absent in short polypeptides; therefore, short sequences were removed from the database. The surface accessibility of residues in membrane proteins requires more sophisticated methods to predict accurately; consequently, they were also excluded from this study. The number of sequences remaining after each step of filtering is shown in the Supporting Information (Table S2). After the removal of sequences from the database according to the methods, there were 4022 protein sequences from which modified residues were then sourced and 3796 protein sequences from which unmodified residues were then sourced. The number of modified residues resulting from this process ranged from 3255 (phosphoserine) to 2 (1-thioglycine). Where less than 17 incidences of a modification were found, these modifications were ignored. This ensured that only modifications with a reliable number of datapoints were analyzed. Interpretation of Statistical Tests. Statistical tests were used to compare the structural environments of modified and unmodified residues. In all cases, we assumed that the unmodified residues represented a mixed population of both modified and unmodified amino acids, present in a variety of structural environments. The reason why modified amino acids are likely to be present in the unmodified residue dataset is that we know only a very small proportion of all postJournal of Proteome Research • Vol. 6, No. 5, 2007 1835

research articles

Pang et al.

Figure 1. Decision tree for summarizing linker and domain region prediction results. Refer to text for more details regarding the decision tree.

translational modifications. If the structural environment of a population of modified residues was significantly different to their corresponding unmodified residues, the modification was classified as either surface associated or within the protein core. We illustrate this concept in Figure 2. For the subsequent presentation of results (Tables 1-4), we use orange to show a modification is significantly surface associated, yellow to show a modification is significantly protein core, and white to show that a modification is not significantly different to unmodified residues in its structural environment. Hypothesis testing results for all the prediction methods are available as Supporting Information (Tables S3-S4). Reversible Modifications. Enzyme-mediated reversible modifications were expected to show a preference for surfaceaccessible and disordered region environments. We found strong evidence that phosphoserine, phosphothreonine, phosphotyrosine, and N6-acetyllysine were more likely to be found within surface-accessible and disordered region than their unmodified counterparts (Table 1). As an example, density plots showing differences between the structural environments of phosphoserine and unmodified serine are shown (Figures 3-5). They graphically illustrate that phosphoserine is more surface accessible than unmodified serine and that it has a greater propensity to be found in regions of intrinsic disorder and coils. The structural environment predictions for phosphoserine were consistent among the various methods used. Phosphohistidine and 4-aspartylphosphate were predicted to be in different structural environments to the other phosphoamino acids (Table 1). Phosphohistidine demonstrated no specific preference for surface or buried regions as compared to unmodified residues and no significant preferences for disordered or ordered regions. In comparison to unmodified aspartic acid, 4-aspartylphosphate was clearly predicted to be buried within the core of proteins and within ordered regions. 1836

Journal of Proteome Research • Vol. 6, No. 5, 2007

Figure 2. Graphical interpretation of statistical tests. The density estimates (white) at the center represents the distribution of structural environment for the unmodified residues. As the modification status of most amino acids is unknown, it represents a mixed population of both modified and unmodified residues. The density plots at the left and right side represent the two possible scenarios for the modified residues; they are discriminated as surface associated (orange) or buried within protein core (yellow), respectively. Density plots that overlap markedly with the density estimate for the unmodified residues (white) indicates no significant differences in their structural environment to unmodified residues. The color scheme used here is identical to that for Tables 1 to 4.

O-GlcNAc is a reversible modification and one which exists interchangeably with phosphoserine and phosphothreonine.32 However, we found that there was only weak evidence that O-GlcNAc-serine or -threonine was within regions of intrinsic

Surface Accessibility of Protein Modifications Table 1. Structural Environment of Reversible Modificationsa

a

b C, coils or loops; E, extended (β-sheets); H, helices. c L, general linker; HL, helical linker; NHL, non-helical linker; D, domain; NS, not significant.

Table 2. Structural Environment of Methylationa

a

b C, coils or loops; E, extended (β-sheets); H, helices. c L, general linker; HL, helical linker; NHL, non-helical linker; D, domain; NS, not significant.

disorder in comparison to the unmodified amino acid type. Further, O-GlcNAc-serine and -threonine do not show any

research articles preference in terms of surface accessibility as compared to unmodified serine or threonine. Note that the number of known O-GlcNAc modified residues was low and this reduced the statistical power of our analysis. Methylation. Methylarginine is a modification that is enzymemediated and may be involved in protein-protein interactions.33 It showed a propensity to be surface accessible and in disordered regions, as compared to unmodified arginine (Table 2). The results showed striking similarity with the reversible ester-linked phosphorylation and acetyllysine that are mediated by enzyme catalysis and recognized by interaction domains. Interestingly, asymmetric dimethylarginine showed a very high average number of modifications per protein. This was due to the proteins nucleolin (P08199), RNA-binding protein EWS (Q07666), and polyadenylate binding protein 2 (Q28165) having 10, 29, and 14 asymmetric dimethylarginines, respectively. Methylation of lysine is enzyme-mediated and is commonly found in proteins associated with nucleic acids.4 In contrast to the strong surface accessibility results for methylarginine, there were conflicting results for mono-, di- and trimethyllysine. Most prediction methods for mono- and di-methyllysine did not show any significant results for surface accessibility or propensity for order or intrinsic disorder. For trimethyllysine, the results for RVP-net predicted it to be within surfaceaccessible regions, whereas ASA, KD, and GOR methods predicted the opposite to be true. Therefore, trimethyllysine does not seem to have any clear preference for the surface accessibility, due to the inconsistent results. This contrasts with our observation that trimethyllysine in histones is on surfaceaccessible tail regions.4 Irreversible Modifications. We grouped together a number of diverse but irreversible modifications (Table 3). These showed some differences in their structural environment. All acetylated residues, with the exception of N-terminal acetylalanine, were predicted to be surface associated and found in regions of intrinsic disorder. Interestingly, N-acetylalanine does not show a preference for surface accessibility, but there was some evidence to suggest that it is found in intrinsically disordered regions. Note that although the majority of Nacetylation modifications in the dataset are found at the N-terminus, a small number of N-acetylated residues are found within the sequence of proteins but at sites of protein cleavage. Sulfotyrosine was shown to prefer surface-accessible and disordered region as compared to unmodified tyrosine. Pyruvic acid (serine) and ADP-ribosylarginine are unusual as they are suggested to be found within the protein’s core and in regions of structural order. Allysine and 4-carboxyglutamate did not show any conclusive result. Hydroxylation. It is not yet known whether hydroxylation is reversible or irreversible.4 It tends to be found multiple times in repeated protein sequences, for example in the Gly-X-X repeat of collagen. We found most hydroxylations to be clearly within surface-accessible and intrinsically disordered regions, as compared to unmodified residues (Table 4). Note, however, that the results for this modification may be biased by the high number found on proteins such as human collagen alpha-1(V) chain (P20908), which is known to have 59 hydroxylated residues of which 45 are hydroxyprolines. Amidation. Most incidences of amidation, particularly those associated with hydrophobic residues, were preferentially found within surface-accessible and disordered regions of proteins (Table 4). The exception to this was amidated arginine and Journal of Proteome Research • Vol. 6, No. 5, 2007 1837

research articles

Pang et al.

Table 3. Structural Environment of Irreversible Modificationsa

a

b

C, coils or loops; E, extended (β-sheets); H, helices. c L, general linker; HL, helical linker; NHL, non-helical linker; D, domain; NS, not significant.

lysine, which were not significantly more surface accessible or buried as compared to their unmodified residue counterparts. They did, however, show some propensity to be within disordered regions. The results for neutral hydrophobic residues were supported by a larger number of datapoints than for charged hydrophilic residues. Table 4 only presents the structural environment for selected amidation types; more amidation types are available as Supporting Information (Table S5). Spontaneous or Artifactual Modifications. Spontaneous or artifactual modifications did not show any strong preferences for surface accessibility or propensity for regions of order or intrinsic disorder (Table 4). However, pyrrolidone carboxylic acid showed some evidence to be within regions of intrinsic disorder. Linker, Domains and Preferred Secondary Structure. The linker region predictions produced by the decision tree for phosphoserine, phosphothreonine, and phosphotyrosine appeared to give high quality predictions, as they were all found to be within non-helical linker regions. This was consistent with the surface accessibility and intrinsic disorder predictions for this modification. However, for some other modification types, such as methylarginine, there was disagreement between linker/domain predictions and other structural environment parameters. Domains are typically regions of order rather than intrinsic disorder, but this was not shown for methylarginine. Of the 44 post-translational modification types presented in Tables 1-4 (and Supporting Information Table S5), 23 were predicted as domain-associated, 8 predictions were non-helical linkers, 2 were predicted as general linkers, and 11 predictions produced statistically insignificant results. For secondary struc1838

Journal of Proteome Research • Vol. 6, No. 5, 2007

tures, the majority of post-translational modifications were predicted to be within coiled regions. Only 5 out of 44 posttranslational modifications presented in Tables 1-4 and Table S5 were not predicted to be within coil regions, instead being predicted as either in helices or extended regions (β-sheets). Those five modifications were phosphohistidine, 4-aspartylphosphate, ADP-ribosylarginine, isoleucine amide, and methionine amide. Accuracy of Predictions. Although the purpose of this manuscript is to understand the surface-accessible nature of PTMs, it is also valuable to briefly compare the results of the predictions obtained from the different methods used here. Some surface accessibility prediction methods provide consistent and sensitive hypothesis testing results, for example, ASA and RVP-net. Unexpectedly, there is relatively little difference in results from different window sizes for ASA. However, ASA_3 predictions tended to be weaker then ASA with larger window sizes. The KD_9 and GOR_17 methods also tended to make weaker predictions. Intrinsic disorder prediction methods accurately predicted the structural environment of reversible post-translational modifications, because these modifications are likely to be within regions of intrinsic disorder. There were 18 out of 44 modification types where all four order/disorder region prediction methods agreed with each other. In these cases, the predictions are of very high confidence. In most other cases (20 out of 44), three out of four prediction methods were in agreement. These predictions are also likely to be of high confidence. For secondary structure prediction, we used a single method, PSIPRED.30 Although we did not use another method to provide

Surface Accessibility of Protein Modifications

research articles

Table 4. Structural Environment of Hydroxylation, Amidation, and Spontaneous or Artifactual Modificationsa

a

b

C, coils or loops; E, extended (β-sheets); H, helices. c L, general linker; HL, helical linker; NHL, non-helical linker; D, domain; NS, not significant.

confirmatory predictions, PSIPRED is widely accepted as being one of the most accurate.34,35 For the prediction of linker/ domain, our decision tree (Figure 1) helped make the results more succinct and interpretable. Interestingly, of the 44 modifications studied, none were predicted to be found in helical linkers. This observation also holds true for all 117 modification types we studied (see Table S3, Supporting Information).

(2006)37 used homology modeling of protein tertiary structure and surface-accessibility calculation of predicted structure to predict phosphorylation sites. This method, however, is limited by a lack of structural information.37 Intrinsic disorder has been used in the prediction of phosphorylation1 and methylation sites.38 Our results, in which we studied modifications in addition to those above, should be useful in the prediction of many more types of modifications on proteins.

Discussion

Phosphorylation. Ester-bonded phosphorylation such as phosphoserine, phosphothreonine, and phosphotyrosine were strongly predicted to be within surface accessible, intrinsically disordered, and coiled linker regions. This is consistent with their involvement in phosphorylation-mediated conformational change in proteins and in phosphoamino acid-mediated protein interactions.8

Here we have used a number of approaches to predict the structural environment of protein post-translational modifications. These approaches have clearly revealed that modified amino acids, with the exception of those that are spontaneous or artifactual, are found nonrandomly in proteins. As we have studied 44 types of modification, and in most cases have used more than one method to predict protein structural environments, we believe this is the most comprehensive study of modification-associated structural environments to date. In other studies, the surface accessibility and degree of intrinsic disorder have been used to aid the prediction of PTM. Lee et al. (2006)36 used RVP-net23 to predict the surface accessibility of proteins and used this to predict phosphorylation and sulfation sites. Any residues with surface accessibility above a threshold were predicted as modified.36 Arthur et al.

Phosphohistidine and 4-aspartylphosphate were not surface associated, and 4-aspartylphosphate was clearly predicted to be within the core of proteins. Interestingly, proteins containing phosphohistidine and 4-aspartylphosphate are both involved in the bacterial two-component system. The more complex variant of the two-component system is called the phosphorelay system (reviewed in refs 39 and 40), which involves protein-protein interactions between a histidine protein kinase, which acts as a transmembrane sensor, and a cognate Journal of Proteome Research • Vol. 6, No. 5, 2007 1839

research articles

Pang et al.

Figure 3. Density estimates of surface accessibility prediction scores for phosphoserine and unmodified serine, using the ASA method. The numbers of datapoints per graph are shown within the legend. The x-axis represents the prediction scores obtained from using each of the window sizes. A higher scores means that the residue is more likely to be surface accessible and vice versa. (A) window size of 3, (B) window size of 5, (C) window size of 7, and (D) window size of 9.

response regulator. Histidine kinase uses ATP to autophosphorylate a conserved histidine residue within itself.8 Subsequently, the phosphoryl group of histidine kinase is transferred to a conserved aspartic acid residue in the response regulator. In our study, phosphohistidine was predicted by PSIPRED to be within a helical region. The tertiary structure of phosphotransferase Spo0B confirms this.40 The residues surrounding phosphohistidine form an R-helix that participates in the hydrophobic interaction with the response regulator; this hydrophobic interface is well conserved throughout evolution.40 This may explain why we found phosphohistidine to be slightly hydrophobic by ASA_5, KD_9, and GOR_17. There are reasons why 4-aspartylphosphate residues are predicted to be buried within the core of proteins. In the unphosphorylated form of relevant proteins, such as Spo0F, the unmodified apspartic acid residue is solvent exposed. Spo0F can form a complex with the histidine kinase, KinA, which allows the transfer of the phosphoryl moiety from the histidine kinase to the aspartic acid residue. Phosphorylation of Spo0F leads to a conformational change by mediating an intra-protein interaction. This causes the aspartic acid residue to become less solvent exposed and prevents hydrolysis of the high-energy acyl phosphate bond.40,41 1840

Journal of Proteome Research • Vol. 6, No. 5, 2007

The conformational change also exposes a previously buried surface that may participate in new protein-protein interactions. For example, upon phosphorylation of the aspartic acid at the N-terminal domain, the protein SpoA exposes a previously buried DNA binding site at its C-terminus.8,42,43 O-Linked GlcNAc. O-GlcNAc is interchangeable with phosphoserine and phosphothreonine.32 Contrary to phosphoamino acids, O-GlcNAc showed a lack of preference for intrinsic disorder regions and was not clearly surface associated. In fact, serine O-GlcNAc seemed to prefer hydrophobic regions, as shown by the results for GOR_17. The preference of O-GlcNAc transferases for hydrophobic and amphipathic regions has been documented,44 and it has been proposed that the presence of the O-GlcNAc within hydrophobic domains is responsible for disrupting protein-protein interactions.44 This is different to other modifications that are required for protein-protein interactions to take place. N-Acetyllysine. Almost half of the N-acetyllysine residues (37 out of 77) were from histone proteins. N-acetyllysine was clearly predicted to be surface associated in our study. The surface accessibility of acetyllysine, particularly in association with histone tails, allows it to be modified by acetyltransferase and

Surface Accessibility of Protein Modifications

research articles

Figure 4. Density estimates of surface accessibility and coil region predictions for phosphoserine and unmodified serine using 4 different methods. The numbers of datapoints are shown within the legend. The x-axis values represent the prediction scores. (A) Hydrophathy score developed by Kyte and Doolittle (1982) using a window size of 9 (KD_9); a higher score is associated with protein core. (B) Naderi-Manesh et al. (2001) method, using window size of 17 (GOR_17); a higher score is associated with protein core. (C) Neural network prediction method (RVP-net); a score of 100% means the residue is fully surface accessible. (D) Coils prediction from PSIPRED; a high score predicts a residue is likely to be within protein coils.

subsequently reversed by a deacetylase. It also allows Nacetyllysine to be recognized and interact with bromodomains. Bromodomains are present in many eukaryotic proteins which are involved in the regulation of gene expression.4 Arginine Methylation. Arginine methylation regulates RNA processing, transcriptional regulation, signal transduction, and DNA repair.45 It is known to be found on proteins, such as histones, that are subject to multisite modification. We showed that it is likely to be surface associated and found in regions of intrinsic disorder. Surface-associated modifications, such as phosphoserine and phosphotyrosine, tend to be reversible by enzymes. Arginine methylation also shows these criteria and supports the hypothesis that arginine methylation is reversible. Although arginine methylases are known, the demethylases are yet to be discovered. The members of proteins in the same family as amine oxidase LSD1, a known lysine demethylase, may also be possible candidates for intracellular demethylase.33,45 Recent in vivo studies such as those performed by Cuthbert et al. (2004)46 and Wang et al. (2004)47 have suggested that monomethylarginine is converted to citrulline via deimi-

nation by protein arginine deiminase 4 (PAD4). Although this may be part of the pathway for reversing arginine methylation,33 a review of the in vitro studies shows that the aforementioned enzymatic activity is unlikely;48 therefore, arginine demethylation remains a controversial issue. The structural environment of methylarginine also suggests it may be associated with specific recognition domains. However, the involvement of methyl-arginine in protein-protein interactions remains unknown.33 A final observation is that the consistency of the structural environment of methylarginine indicates it may be associated with a specialized sequence motif. This is support by Daily et al. (2005),38 who showed the enrichment of glycine, and the depletion of glutamic acid and glutamine (11 residues around methylated arginine. Irreversible Modifications. Irreversible modifications, such as modifications 4-carboxyglutamate, allysine, pyruvic acid (serine), and ADP-ribosylarginine, were not expected to prefer surface-accessible or disordered regions for at least two reasons. First, they are not recognized by modification-dependent binding domains, requiring the modification to be physically Journal of Proteome Research • Vol. 6, No. 5, 2007 1841

research articles

Pang et al.

Figure 5. Density estimates of prediction results of intrinsic disorder region prediction results for phosphoserine and unmodified serine using various neural network disorder prediction methods. The numbers of datapoints are shown within the legend. The x-axis values represent the prediction scores. (A) The bio-basis neural network predictor (RONN). Three neural network predictors from the DisEMBL: (B) coils predictor (DCOILS), (C) X-ray crystallography protein structure missing co-ordinate predictor (REMARK465), and (D) the hot loops predictor (HOTLOOPS). For these prediction methods, the results are expressed as a probability of the residue to be upon the surface of protein. For all graphs, a value close to 1 predicts that the residue is surface accessible.

accessible on a protein’s surface. Second, the irreversible nature of these modifications means that the removal of the modification by an enzyme, through binding and catalysis, does not occur. This is consistent with our results. Hydroxylation. We found hydroxylated residues to be in regions of coiled secondary structure, in regions of high surface accessibility and intrinsic disorder. This is consistent with the high presence of hydroxylation in collagen molecules and other proteins in the extracellular matrix. The hydroxylation of proline allows additional hydrogen bonding to occur between molecules of the collagen triple helix49 and bestows collagen fibers with structural rigidity. In effect, it makes possible a structurally necessary protein-protein interaction. Although the structural environment predictions for hydroxylation are strong, the presence of repetitive hydroxylation motifs within the modified residue dataset may have introduced bias in estimating surface accessibility and intrinsic disorder. Sulfotyrosine. Sulfotyrosine is thought to be irreversible. It modulates protein-protein interactions of secreted or membrane bound proteins by providing them with hydrogen bonds 1842

Journal of Proteome Research • Vol. 6, No. 5, 2007

to bind other proteins.50 This similarity to the role of hydroxyproline is striking. Tyrosine sulfation is also involved in optimal receptor-ligand interactions, optimal proteolytic processing, and proteolytic activation of extracellular proteins.50,51 The importance of tyrosine sulfation for protein-protein interaction explains why this modification needs to be highly surface accessible, as predicted by our results. Note, however, that our study mostly analyzed secreted sulfated proteins because we removed membrane associated proteins from our sequence database. There are typically 3-4 acidic amino acids within (5 residues of the tyrosine sulfation site.50 These charged residues will cause tyrosine sulfation sites to be in highly surface-accessible and disordered regions. Amidation. Amidation occurs in proteins after they are cleaved by endoproteinases. If the cleavage site is C-terminal to a glycine, the enzymes peptidylglycine R-hydroxylating monooxygenase and peptidyl-R-hydroxyglycine R-amidating lyase will remove this glycine and amidate the adjacent upstream amino acid.52 If the upstream amino acid is neutral and hydrophobic, it will tend to favor amidation, whereas charged

research articles

Surface Accessibility of Protein Modifications

hydrophilic residues are less reactive (Interpro: IPR00013453). To allow access of all enzymes, it is intuitive that this occurs in disordered and surface-accessible regions. This was supported by our results. Usually there are at least two basic residues after the glycine cleavage site,54 which contributes to this being a surface-associated, intrinsically disordered region. Conformation Change is Required for ADP-Ribosylation of G-Proteins. ADP-ribosylation is one of the few modifications that we predicted to be buried in the core of proteins. All incidences of ADP-ribosylation in our dataset involved the modification of G-protein’s GRs subunit by the cholera enterotoxin subunit A1. It is known that the GDP-bound form of GRs subunit cannot be ADP-ribosylated, in contrast to the GTPbound form.55 The binding of GTP causes a conformational change and may expose a normally buried arginine to ADPribosylation. This locks the G-protein in the GTP-bound form, which causes the production of cAMP and the ultimate induction of diarrhoea.55 ADP-ribosylation is a relatively bulky PTM in comparison to other modifications listed in this study; the bulky size may enable it to work as a molecular “wedge”. N-Terminal Acetylation. Most acetylated residues in our modified residue dataset, except for lysine, were found only at the N-terminal end of a protein. N-acetylation is an irreversible co-translational modification.41 Its function is largely undefined, and it is expected that there are many more N-terminally acetylated amino acids on as-yet uncharacterised proteins.41 The N-terminal region of proteins tends to be unfolded, surface accessible, and disordered, which agrees with our structural environment predictions. N-acetylalanine does not have any preferences for surface accessibility or inaccessibility; this might indicate that N-acetylalanine has a different function to Nterminal acetylated amino acids or may just reflect its hydrophobic nature. Spontaneous and Artifactual Modifications. Because spontaneous and artifactual modifications occur randomly and ubiquitously, they are expected to occur in any position along a protein sequence, regardless of surface accessibility or order/ disorder. This was seen in our results. Pyrrolidone carboxylic acid spontaneously forms by the cyclization of N-terminal glutamine residues. It is known to be an experimental artifact.56,57 Within our modified residues dataset, 149 out of 438 pyrrolidone carboxylic acid modifications were at the Nterminus of polypeptides. A further 91 and 7 were C-terminal to arginine and lysine, respectively, potentially representing an experimental artifact formed spontaneously on trypsin cleavage of peptides during protein characterization experiments. The deamidation of asparagine may occur as a result of enzymatic modification,58 may form spontaneously under physiological conditions,59 or may be an experimental artifact.59,60 It typically forms in an N-G motif, which can be found anywhere in a protein. Other Statistical Considerations. There are a number of aspects of the statistical tests that warrant discussion. The clustering of modifications by their biological function or position in protein, prior to statistical tests, may further improve reliability and interpretability of results. N-terminal acetylation is one example where the position of the residue in the protein is probably a more important factor than its immediate sequence environment. The consideration of repeated sequence motifs, such as found in collagen, should also be considered to ensure minimal structural bias. However, this requires a balance of minimizing dataset redundancy with having enough data to produce statistically reliable results. In

regards to the quantity of data, the number of significant results increased with the incidence of the modification. This can be clearly observed in the Supporting Information (Table S3). Although this study successfully estimated surface accessibility for 44 types of post-translational modification, similar studies for other modifications will not be possible until a greater number of modifications are experimentally found. Evolutionary Rates of Surface-Accessible Regions and Surface-Associated Modifications. There are many proteins within a cell that contain modified amino acids and many proteins with modification-specific binding domains. The question raised is then how does the cell prevent random interactions from occurring? It is believed, for example, that the assembly of phosphorylated proteins with their corresponding protein complex takes place “just-in-time” for the required biological activity.11,61 The gene for the phosphorylated protein is regulated and is only expressed just as it is needed. When the required job is completed, the phosphorylated protein is then quickly degraded via the ubiquitin-proteasome pathway. Evidence suggests that dynamically transcribed proteins are more likely to be phosphorylated and regulated by targeted degradation.62 Moreover, proteins with more interaction partners are more likely to be phosphorylated, and these proteins are more likely to be actively targeted for degradation.11 Most importantly, these mechanisms were shown to have co-evolved independently in the cell-cycle regulation of humans, S. cerevisiae, S. pombe, and A. thaliana.62 Jensen et al. (2006)62 has shown that the evolutionary loss or gain of transcriptional regulation and post-translational modifications are highly correlated and tend to occur in unison within a short evolutionary time scale. It is tempting to mention that surfaceaccessible regions may have faster evolutionary rates,63,64 similarly for disordered regions, especially those within alternatively spliced exons.65-67 These regions may evolve faster due to less structural constraints and more freedom for nonsynonymous amino acid substitution.66 They may have contributed to the faster evolutionary speed of phosphorylation sites and possibly for other post-translational modifications, required for the above-mentioned co-evolution.

Conclusion Here we have shown that reversible post-translational modifications, particularly those specifically bound by recognition domains, are mainly found within surface-accessible regions of proteins. They are also more likely to be within regions of intrinsic disorder. Some irreversible modifications show strong preference for surface-association, others show preference for protein core, and some show no clear preference for structural environment. This manuscript has also shown the power of using the combination of many surface accessibility, intrinsic disorder region, and structural prediction methods to generate a consensus prediction.

Acknowledgment. C.N.I.P. is the recipient of an Australian Postgraduate Award. This research is supported in part by a University of New South Wales Faculty Research Grant. We thank Robert M. Esnouf for providing the standalone version of RONN software. Supporting Information Available: Table S1 contains the Swiss-Prot accession number for the sequences utilized in the construction of the modified residue dataset and the unmodified residue dataset. Table S2 contains the number of sequences through each filtering step of creating the final Journal of Proteome Research • Vol. 6, No. 5, 2007 1843

research articles sequence database. Table S3 contains the summary hypothesis testing results, including the modification types with less than 17 datapoints and is similar to Tables 1-4 in the text. Table S4 is similar to Table S3, however with information regarding the statistical testing results. It displays the p-value of the hypothesis test, whether the mean for the modified residue dataset was above, equal to, or below the mean for the unmodified residue dataset. This is followed by the mean for the unmodified dataset, the 95% confidence interval for the hypothesis test, whether the t-test or Wilcox rank sum test was used, and whether the variance was equal or unequal as determined by Levene’s test. Table S5 contains the structural environment of amidation and it is similar to Tables 1-4 in the text. This material is available free of charge via the Internet at http:// pubs.acs.org.

References (1) Iakoucheva, L. M.; Radivojac, P.; Brown, C. J.; O’Connor, T. R.; Sikes, J. G.; Obradovic, Z.; Dunker, A. K. The importance of intrinsic disorder for protein phosphorylation. Nucleic Acids Res. 2004, 32(3), 1037-1049. (2) Dunker, A. K.; Brown, C. J.; Lawson, J. D.; Iakoucheva, L. M.; Obradovic, Z. Intrinsic disorder and protein function. Biochemistry 2002, 41(21), 6573-6582. (3) Pawson, T.; Nash, P. Assembly of cell regulatory systems through protein interaction domains. Science 2003, 300(5618), 445-452. (4) Seet, B. T.; Dikic, I.; Zhou, M. M.; Pawson, T. Reading protein modifications with interaction domains. Nat. Rev. Mol. Cell Biol. 2006, 7(7), 473-483. (5) Patil, A.; Nakamura, H. Disordered domains and high surface charge confer hubs with the ability to interact with multiple proteins in interaction networks. FEBS Lett. 2006, 580(8), 20412045. (6) Ekman, D.; Light, S.; Bjorklund, A. K.; Elofsson, A. What properties characterize the hub proteins of the protein-protein interaction network of Saccharomyces cerevisiae? Genome Biol. 2006, 7(6), R45. (7) Dunker, A. K.; Cortese, M. S.; Romero, P.; Iakoucheva, L. M.; Uversky, V. N. Flexible nets. The roles of intrinsic disorder in protein interaction networks. Febs J. 2005, 272(20), 5129-5148. (8) Johnson, L. N.; Lewis, R. J. Structural basis for control by phosphorylation. Chem. Rev. 2001, 101(8), 2209-2242. (9) Groban, E. S.; Narayanan, A.; Jacobson, M. P. Conformational changes in protein loops and helices induced by post-translational phosphorylation. PLoS Comput. Biol. 2006, 2(4), e32. (10) Bustos, D. M.; Iglesias, A. A. Intrinsic disorder is a key characteristic in partners that bind 14-3-3 proteins. Proteins 2006, 63(1), 35-42. (11) Batada, N. N.; Hurst, L. D.; Tyers, M. Evolutionary and physiological importance of hub proteins. PLoS Comput. Biol. 2006, 2(7), e88. (12) Bairoch, A.; Apweiler, R.; Wu, C. H.; Barker, W. C.; Boeckmann, B.; Ferro, S.; Gasteiger, E.; Huang, H.; Lopez, R.; Magrane, M.; Martin, M. J.; Natale, D. A.; O’Donovan, C.; Redaschi, N.; Yeh, L. S. The Universal Protein Resource (UniProt). Nucleic Acids Res. 2005, 33(Database issue), D154-159. (13) Hermjakob, H.; Fleischmann, W.; Apweiler, R. Swissknife - “lazy parsing” of SWISS-PROT entries. Bioinformatics 1999, 15(9), 771772. (14) Wu, C. H.; Apweiler, R.; Bairoch, A.; Natale, D. A.; Barker, W. C.; Boeckmann, B.; Ferro, S.; Gasteiger, E.; Huang, H.; Lopez, R.; Magrane, M.; Martin, M. J.; Mazumder, R.; O’Donovan, C.; Redaschi, N.; Suzek, B. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 2006, 34(Database issue), D187-191. (15) Stajich, J. E.; Block, D.; Boulez, K.; Brenner, S. E.; Chervitz, S. A.; Dagdigian, C.; Fuellen, G.; Gilbert, J. G.; Korf, I.; Lapp, H.; Lehvaslaiho, H.; Matsalla, C.; Mungall, C. J.; Osborne, B. I.; Pocock, M. R.; Schattner, P.; Senger, M.; Stein, L. D.; Stupka, E.; Wilkinson, M. D.; Birney, E. The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002, 12(10), 1611-1618. (16) Do, C. B.; Mahabhashyam, M. S.; Brudno, M.; Batzoglou, S. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res. 2005, 15(2), 330-340.

1844

Journal of Proteome Research • Vol. 6, No. 5, 2007

Pang et al. (17) Thompson, J. D.; Higgins, D. G.; Gibson, T. J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22(22), 46734680. (18) Pearl, F.; Todd, A.; Sillitoe, I.; Dibley, M.; Redfern, O.; Lewis, T.; Bennett, C.; Marsden, R.; Grant, A.; Lee, D.; Akpor, A.; Maibaum, M.; Harrison, A.; Dallman, T.; Reeves, G.; Diboun, I.; Addou, S.; Lise, S.; Johnston, C.; Sillero, A.; Thornton, J.; Orengo, C. The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis. Nucleic Acids Res. 2005, 33(Database issue), D247-251. (19) Moelbert, S.; Emberly, E.; Tang, C. Correlation between sequence hydrophobicity and surface-exposure pattern of database proteins. Protein Sci. 2004, 13(3), 752-762. (20) Kyte, J.; Doolittle, R. F. A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 1982, 157(1), 105132. (21) Naderi-Manesh, H.; Sadeghi, M.; Arab, S.; Moosavi Movahedi, A. A. Prediction of protein surface accessibility with information theory. Proteins 2001, 42(4), 452-459. (22) Ahmad, S.; Gromiha, M. M.; Sarai, A. Real value prediction of solvent accessibility from amino acid sequence. Proteins 2003, 50(4), 629-635. (23) Ahmad, S.; Gromiha, M. M.; Sarai, A. RVP-net: online prediction of real valued accessible surface area of proteins from single sequences. Bioinformatics 2003, 19(14), 1849-1851. (24) Yang, Z. R.; Thomson, R.; McNeil, P.; Esnouf, R. M. RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins. Bioinformatics 2005, 21(16), 3369-3376. (25) Linding, R.; Jensen, L. J.; Diella, F.; Bork, P.; Gibson, T. J.; Russell, R. B. Protein disorder prediction: implications for structural proteomics. Structure 2003, 11(11), 1453-1459. (26) Kabsch, W.; Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983, 22(12), 2577-2637. (27) George, R. A.; Heringa, J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 2002, 15(11), 871-879. (28) Suyama, M.; Ohara, O. DomCut: prediction of inter-domain linker regions in amino acid sequences. Bioinformatics 2003, 19(5), 673-674. (29) Dumontier, M.; Yao, R.; Feldman, H. J.; Hogue, C. W. Armadillo: domain boundary prediction by amino acid composition. J. Mol. Biol. 2005, 350(5), 1061-1073. (30) Jones, D. T. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 1999, 292(2), 195202. (31) R Development Core Team. R: A language and environment for statistical computing; R Foundation for Statistical Computing: Vienna, Austria, 2005. http://www.R-project.org. (32) Slawson, C.; Hart, G. W. Dynamic interplay between O-GlcNAc and O-phosphate: the sweet side of protein regulation. Curr. Opin. Struct. Biol. 2003, 13(5), 631-636. (33) Bannister, A. J.; Kouzarides, T. Reversing histone methylation. Nature 2005, 436(7054), 1103-1106. (34) Eyrich, V. A.; Marti-Renom, M. A.; Przybylski, D.; Madhusudhan, M. S.; Fiser, A.; Pazos, F.; Valencia, A.; Sali, A.; Rost, B. EVA: continuous automatic evaluation of protein structure prediction servers. Bioinformatics 2001, 17(12), 1242-1243. (35) Montgomerie, S.; Sundararaj, S.; Gallin, W. J.; Wishart, D. S. Improving the accuracy of protein secondary structure prediction using structural alignment. BMC Bioinformatics 2006, 7, 301. (36) Lee, T. Y.; Huang, H. D.; Hung, J. H.; Huang, H. Y.; Yang, Y. S.; Wang, T. H. dbPTM: an information repository of protein posttranslational modification. Nucleic Acids Res. 2006, 34(Database issue), D622-627. (37) Arthur, J. W.; Sanchez-Perez, A.; Cook, D. I. Scoring of predicted GRK2 phosphorylation sites in Nedd4-2. Bioinformatics 2006, 22(18), 2192-2195. (38) Daily, K. M.; Radivojac, P.; Dunker, A. K. In Intrinsic disorder and protein modifications: building an SVM predictor for methylation; IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB: San Diego, CA, November 2005; pp 475-481. (39) Hoch, J. A.; Varughese, K. I. Keeping, signals straight in phosphorelay signal transduction. J. Bacteriol. 2001, 183(17), 49414949.

research articles

Surface Accessibility of Protein Modifications (40) Varughese, K. I. Molecular recognition of bacterial phosphorelay proteins. Curr. Opin. Microbiol. 2002, 5(2), 142-148. (41) Walsh, C. T. Posttranslational Modifications of Proteins: Expanding Natrue’s Inventory, 1st ed.; Roberts and Co. Publishers: Colorado, 2006. (42) Cho, H. S.; Pelton, J. G.; Yan, D.; Kustu, S.; Wemmer, D. E. Phosphoaspartates, in bacterial signal transduction. Curr. Opin. Struct. Biol. 2001, 11(6), 679-684. (43) Kern, D.; Volkman, B. F.; Luginbuhl, P.; Nohaile, M. J.; Kustu, S.; Wemmer, D. E. Structure of a transiently phosphorylated switch in bacterial signal transduction. Nature 1999, 402(6764), 894898. (44) Yang, X.; Zhang, F.; Kudlow, J. E. Recruitment of O-GlcNAc transferase to promoters by corepressor mSin3A: coupling protein O-GlcNAcylation to transcriptional repression. Cell 2002, 110(1), 69-80. (45) Bedford, M. T.; Richard, S. Arginine methylation an emerging regulator of protein function. Mol. Cell 2005, 18(3), 263-272. (46) Cuthbert, G. L.; Daujat, S.; Snowden, A. W.; Erdjument-Bromage, H.; Hagiwara, T.; Yamada, M.; Schneider, R.; Gregory, P. D.; Tempst, P.; Bannister, A. J.; Kouzarides, T. Histone deimination antagonizes arginine methylation. Cell 2004, 118(5), 545-553. (47) Wang, Y.; Wysocka, J.; Sayegh, J.; Lee, Y. H.; Perlin, J. R.; Leonelli, L.; Sonbuchner, L. S.; McDonald, C. H.; Cook, R. G.; Dou, Y.; Roeder, R. G.; Clarke, S.; Stallcup, M. R.; Allis, C. D.; Coonrod, S. A. Human PAD4 regulates histone arginine methylation levels via demethylimination. Science 2004, 306(5694), 279-283. (48) Thompson, P. R.; Fast, W. Histone citrullination by protein arginine deiminase: is arginine methylation a green light or a roadblock? ACS Chem. Biol. 2006, 1(7), 433-441. (49) Mizuno, K.; Hayashi, T.; Bachinger, H. P. Hydroxylation-induced stabilization of the collagen triple helix. Further characterization of peptides with 4(R)-hydroxyproline in the Xaa position. J. Biol. Chem. 2003, 278(34), 32373-32379. (50) Moore, K. L. The biology and enzymology of protein tyrosine O-sulfation. J. Biol. Chem. 2003, 278(27), 24243-24246. (51) Monigatti, F.; Hekking, B.; Steen, H., Protein sulfation analysis-A primer. Biochim. Biophys. Acta 2006, 1764(12), 1904-1913. (52) Martinez, A.; Treston, A. M. Where does amidation take place? Mol. Cell. Endocrinol. 1996, 123(2), 113-117. (53) Mulder, N. J.; Apweiler, R.; Attwood, T. K.; Bairoch, A.; Bateman, A.; Binns, D.; Bradley, P.; Bork, P.; Bucher, P.; Cerutti, L.; Copley, R.; Courcelle, E.; Das, U.; Durbin, R.; Fleischmann, W.; Gough, J.; Haft, D.; Harte, N.; Hulo, N.; Kahn, D.; Kanapin, A.; Krestyaninova, M.; Lonsdale, D.; Lopez, R.; Letunic, I.; Madera, M.; Maslen, J.; McDowall, J.; Mitchell, A.; Nikolskaya, A. N.; Orchard, S.; Pagni, M.; Ponting, C. P.; Quevillon, E.; Selengut, J.; Sigrist, C. J.; Silventoinen, V.; Studholme, D. J.; Vaughan, R.; Wu, C. H. InterPro, progress and status in 2005. Nucleic Acids Res. 2005, 33(Database issue), D201-205.

(54) Bradbury, A. F.; Smyth, D. G. Biosynthesis of the C-terminal amide in peptide hormones. Biosci. Rep. 1987, 7(12), 907-916. (55) Enomoto, K.; Gill, D. M. Cholera toxin activation of adenylate cyclase. Roles of nucleoside triphosphates and a macromolecular factor in the ADP ribosylation of the GTP-dependent regulatory component. J. Biol. Chem. 1980, 255(4), 1252-1258. (56) Sanger, F.; Thompson, E. O.; Kitai, R. The amide groups of insulin. Biochem. J. 1955, 59(3), 509-518. (57) Awade, A. C.; Cleuziat, P.; Gonzales, T.; Robert-Baudouy, J. Pyrrolidone carboxyl peptidase (Pcp): an enzyme that removes pyroglutamic acid (pGlu) from pGlu-peptides and pGlu-proteins. Proteins 1994, 20(1), 34-51. (58) Hochstrasser, D. F. Proteome in perspective. Clin. Chem. Lab. Med. 1998, 36(11), 825-836. (59) Sarioglu, H.; Lottspeich, F.; Walk, T.; Jung, G.; Eckerskorn, C. Deamidation as a widespread phenomenon in two-dimensional polyacrylamide gel electrophoresis of human blood plasma proteins. Electrophoresis 2000, 21(11), 2209-2218. (60) Wright, H. T. Nonenzymatic deamidation of asparaginyl and glutaminyl residues in proteins. Crit. Rev. Biochem. Mol. Biol. 1991, 26(1), 1-52. (61) de Lichtenberg, U.; Jensen, L. J.; Brunak, S.; Bork, P. Dynamic complex formation during the yeast cell cycle. Science 2005, 307(5710), 724-727. (62) Jensen, L. J.; Jensen, T. S.; de Lichtenberg, U.; Brunak, S.; Bork, P. Co-evolution of transcriptional and post-translational cell-cycle regulation. Nature 2006, 443(711), 594-597. (63) Goldman, N.; Thorne, J. L.; Jones, D. T. Assessing the impact of secondary structure and solvent accessibility on protein evolution. Genetics 1998, 149(1), 445-458. (64) Pal, C.; Papp, B.; Lercher, M. J. An, integrated view of protein evolution. Nat. Rev. Genet. 2006, 7(5), 337-348. (65) Brown, C. J.; Takayama, S.; Campen, A. M.; Vise, P.; Marshall, T. W.; Oldfield, C. J.; Williams, C. J.; Dunker, A. K. Evolutionary rate heterogeneity in proteins with long disordered regions. J. Mol. Evol. 2002, 55(1), 104-110. (66) Romero, P. R.; Zaidi, S.; Fang, Y. Y.; Uversky, V. N.; Radivojac, P.; Oldfield, C. J.; Cortese, M. S.; Sickmeier, M.; LeGall, T.; Obradovic, Z.; Dunker, A. K. Alternative splicing in concert with protein intrinsic disorder enables increased functional diversity in multicellular organisms. Proc. Natl. Acad. Sci. U.S.A. 2006, 103(22), 8390-8395. (67) Chen, F. C.; Wang, S. S.; Chen, C. J.; Li, W. H.; Chuang, T. J. Alternatively and constitutively spliced exons are subject to different evolutionary forces. Mol. Biol. Evol 2006, 23(3), 675-682.

PR060674U

Journal of Proteome Research • Vol. 6, No. 5, 2007 1845