Prediction of Tyrosine Sulfation with mRMR Feature Selection and Analysis Shen Niu,†,‡,| Tao Huang,†,‡,| Kaiyan Feng,| Yudong Cai,*,§,⊥ and Yixue Li*,‡,| Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, P. R. China, Institute of Systems Biology, Shanghai University, Shanghai 200444, P. R. China, Shanghai Center for Bioinformation Technology, Shanghai 200235, P. R. China, and Centre for Computational Systems Biology, Fudan University, Shanghai 200433, P. R. China Received July 10, 2010
Protein tyrosine sulfation is a ubiquitous post-translational modification (PTM) of secreted and transmembrane proteins that pass through the Golgi apparatus. In this study, we developed a new method for protein tyrosine sulfation prediction based on a nearest neighbor algorithm with the maximum relevance minimum redundancy (mRMR) method followed by incremental feature selection (IFS). We incorporated features of sequence conservation, residual disorder, and amino acid factor, 229 features in total, to predict tyrosine sulfation sites. From these 229 features, 145 features were selected and deemed as the optimized features for the prediction. The prediction model achieved a prediction accuracy of 90.01% using the optimal 145-feature set. Feature analysis showed that conservation, disorder, and physicochemical/biochemical properties of amino acids all contributed to the sulfation process. Site-specific feature analysis showed that the features derived from its surrounding sites contributed profoundly to sulfation site determination in addition to features derived from the sulfation site itself. The detailed feature analysis in this paper might help understand more of the sulfation mechanism and guide the related experimental validation. Keywords: Sulfation • maximum relevance minimum redundancy • incremental feature selection • nearest neighbor algorithm
1. Introduction Various post-translational modifications (PTMs) of proteins play important roles in proteome structural and functional diversity and regulate various biological processes. Tyrosine sulfation, one of the PTMs, occurs in various species and cell types.1-5 It is catalyzed by one of the two tyrosylprotein sulfotransferases (TPSTs, TPST-1,6 and TPST-27,8) through transfer of the sulfuryl group from 3′-phosphoadenosine-5′phosphosulfate (PAPS) to the phenol group of tyrosine.1 As one of the most universal PTMs in secreted and transmembrane proteins, tyrosine sulfation has been experimentally demonstrated to be essential to extracellular protein-protein interactions, intracellular protein transportation modulation, and protein proteolytic process regulation3,9-11 and implicated in various pathophysiologial processes such as atherosclerosis, lung disease, and HIV infection.12-14 Approximately up to 1% of the tyrosines in the total protein content of a cell can be sulfated.15 In the overview, identification of protein tyrosine sulfation sites is of fundamental importance to understand the * Corresponding authors. Yixue Li: Tel.: 86-21-54065001. Fax: 86-2154065057. E-mail:
[email protected]. Yudong Cai: Tel.: 86-21-66136132. Fax: 8621-66136109. E-mail:
[email protected]. † These authors contributed equally to this work. ‡ Chinese Academy of Sciences. § Shanghai University. | Shanghai Center for Bioinformation Technology. ⊥ Fudan University.
6490 Journal of Proteome Research 2010, 9, 6490–6497 Published on Web 10/25/2010
molecular mechanism of tyrosine sulfation in biological systems. Because of the lability of sulfotyrosine, it is difficult to determine tyrosine sulfation sites using conventional experimental approaches including chemical sequencing and mass spectrometry analysis.15-17 Although many sulfated proteins have been identified, few sulfotyrosine sites have been exactly determined.1,3 In addition, to determine tyrosine sulfation sites by conventional experimental approaches may be time consuming and labor intensive especially for large-scale data sets. Therefore, it is much more convenient and efficient to predict tyrosine sulfation sites using in silico algorithms, especially at the proteome level. Since there are no specific sequence conservation patterns around the tyrosine sulfation site, it is difficult to predict the sulfotyrosine site.9,18 Rosenquist and Nicholas studied the effect of basic, hydrophobic, small amino acids, disulfide, N-glycosylation (sugar), and acidic sites surrounding the tyrosine sites and found that some sulfation sites were surrounded by acidic amino acids.19,20 However, many other sulfotyrosine sites have no acidic residues in their flanking regions; for instance, Tyr30 in mouse Lumican can be sulfated, but no acidic amino acids existed within (5 residues.9 Further, Yu et al. developed a position-specific scoring matrix (PSSM) to predict tyrosine sulfation sites in seven-transmembrane peptide receptors.21 In 2002, Sulfinator22 was developed using four Hidden Markov Models for the prediction of sulfotyrosine sites using informa10.1021/pr1007152
2010 American Chemical Society
Tyrosine Sulfation with mRMR Feature Selection and Analysis tion from sequence alignment of 68 sequence windows containing tyrosine sulfation sites. Chang et al. developed a method called SulfoSite,18 which considered both structural information such as secondary structure and accessible surface area (ASA) and other sequence information for the prediction of tyrosine sulfation sites using the SVM model and used 162 experimentally verified tyrosine sulfation sites as positive samples. However, most existing methods have their limitations. For example, Sulfinator cannot identify certain kinds of sulfated tyrosines, such as sulfotyrosine in extracellular class II leucinerich repeat (LRR) proteins which were identified by mass spectrometry experiment.15,23 For SulfoSite, its prediction model is a “black box”, from which we cannot obtain any useful biology information, and it has no biology analysis of features they used. In this work, a new computational method to predict tyrosine sulfation sites was developed based on the machine learning approach nearest neighbor algorithm (NNA), incorporated by feature selection (IFS based on mRMR). The features we used can be grouped into three categories: position-specific scoring matrices (PSSM) conservation scores, amino acid factors, and disorder scores. Our study has the following features: (1) three kinds of features were considered; (2) a nearest neighbor algorithm (NNA) was used as the prediction model which is more effective than the HMM (Hidden Markov Model) and SVM (Support Vector Machine) that were used by Sulfinator and SulfoSite; (3) the jackknife cross-validation method was used to evaluate the performance of our classifier; and (4) features were selected and analyzed. Feature analysis shows that the conservation of amino acids at some certain residue sites around tyrosine plays important roles in the sulfation site prediction; it also shows that secondary structure, codon diversity, molecular volume, polarity, and electrostatic charge of amino acids in the flanking sequences are important for the sulfation process and that the structural disorder of the flanking sequence and sulfation are also strongly related.
2. Materials and Methods 2.1. Data Sets. We downloaded protein sequences containing sulfated tyrosine sites from SysPTM (version 1.1)24 and UniProt (version 2010_06).25,26 By removing redundant sequences and sequences less than 50 amino acids, 75 protein sequences were left and used in our study. We randomly separated these 75 protein sequences into two parts: 60 sequences as the training data set and 15 sequences as an independent test data set. Then we extracted consecutive peptides containing 9 residues with tyrosine itself, 4 residues upstream, and 4 residues downstream of the tyrosine (Y) for the training and independent test data set separately. The sulfated tyrosine sites are considered to be positive samples, and the unsulfated tyrosine sites are considered to be negative samples. For the training data set, there are a total of 731 samples including 102 positive samples and 629 negative samples. For the independent test data set, there are a total of 96 samples including 27 positive samples and 69 negative samples. The training and independent test data sets were given in Data set S1 and Data set S2 (Supporting Information), respectively. 2.2. Feature Construction. 2.2.1. PSSM Conservation Score Features. In biological analysis, one of the most important aspects of concern is the evolutionary conservation. A more conserved status of residues at specific protein sites may indicate that they are under stronger selective pressure and
research articles
therefore are likely to be more important for the protein functioning. There are imperfect conserved flanking sequences surrounding sulfotyrosine sites which have been demonstrated by several previous works.18,19,22 For example, there are much more D residues at the site directly upstream to the sulfotyrosine sites. Both our and previous studies considered sequence conservation features as a primary factor for the prediction of tyrosine sulfation sites.18,21,22 Position-specific iterative BLAST (PSI BLAST)27 can be used to measure the conservation status for a specified location. It denotes normalized probabilities (log odds scores) of conservation against transitions to 20 different amino acids for a specific residue by a 20-dimensional vector. All such 20-dimensional vectors for all residues in a given sequence composed a matrix called PSSM (position-specific scoring matrix). Residues conserved through cycles of PSI BLAST were suggested to be important in biological functioning. In our study, the PSSM conservation score was used to quantify the conservation status against 20 different amino acids of each residue in the protein sequence. 2.2.2. Amino Acid Factor Features. The specificity and diversity of protein structure and function are largely attributed to the composition of various properties of each of the 20 amino acids. Previous studies have shown the important effect of individual amino acid physicochemical properties in discriminating the sulfotyrosine from those nonsulfated tyrosines. Rosenquist et al. had demonstrated the effect of basic, hydrophobic, small amino acids, disulfide, N-glycosylation (sugar), and especially acidic sites surrounding the tyrosine sites on the prediction of sulfotyrosine sites.19,20 The effect of polarity, secondary structure, and charge distribution on the determination of tyrosine sulfation has also been demonstrated in ref 20. Atchley et al.28 performed multivariate statistical analyses on AAIndex29 which is a database of amino acids’ biochemical and physicochemical properties. They summarized and transformed AAIndex to five highly interpretable and multidimensional numeric indices reflecting secondary structure, polarity, molecular volume, electrostatic charge, and codon diversity. We used these five numerical index scores (we called “amino acid factors”) to represent the respective properties of each amino acid in the research. 2.2.3. Disorder Score Features. The functional importance of protein segments that lack fixed 3-D structures under physiological conditions has been increasingly recognized.30,31 The disordered regions of proteins always contain sorting signals, PTM sites, and protein ligands and consist of disordered, unstructured, and flexible regions without regular secondary structure. Protein disorder in the nonglobular segments allows for more modification sites and interaction partners, so it is of great importance for protein structure and function.30,32,33 In this study, we used VSL234 to calculate the disorder score which represents each amino acid disorder status in the given protein sequence. The VSL2 predictors can accurately predict both long and short disordered regions in proteins.35,36 The features of disorders consist of the disorder scores of the tyrosine site and four flanking sites at both C-terminal and N-terminal. 2.2.4. Feature Space. The feature space of our samples consists of the features of PSSM conservation scores, amino acid factors, and disorder scores. For the tyrosine (Y) site, a total of 21 features were used, including 20 PSSM conservation scores and 1 disorder score. For each of its 4 surrounding amino Journal of Proteome Research • Vol. 9, No. 12, 2010 6491
research articles
Niu et al.
acids in both C-terminal and N-terminal, a total of 26 features were used, including 20 PSSM conservation scores, 5 amino acid factors, and 1 disorder score. Overall, each sample peptide was encoded by 26 × 8 + 21 ) 229 features. 2.3. mRMR Method. To rank the importance of the 229 features, we used the maximum relevance minimum redundancy (mRMR) method that was first developed by Peng et al.37 for the analysis of the microarray data. The mRMR method could rank features based on their relevance to the target, and at the same time, the redundancy of features was also considered. Features that have the best trade-off between maximum relevance to target and minimum redundancy were considered as “good” features. To quantify both relevance and redundancy, mutual information (MI), which quantifies the relationship between two vectors, is defined as the following.
I(x, y) ) ∫∫p(x, y)log
p(x, y) dxdy p(x)p(y)
(1)
where x and y are vectors; p(x,y) is the joint probabilistic density; and p(x)and p(y) are the marginal probabilistic densities. Let Ω denote the whole feature set, while Ωs denotes the already-selected feature set which contains m features and Ωt denotes the to-be-selected feature set which contains n features. c denotes the class of sample tyrosine, whether it was sulfated or not. Relevance D of the feature f in Ωt with the target c can be calculated by D ) I(f, c)
(2)
and redundancy R of the feature f in Ωt with all the features in Ωs can be calculated by R)
∑
1 I(f, fi) m f ∈Ω i
(3)
s
To obtain the feature fj in Ωt with maximum relevance and minimum redundancy, eq 2 and eq 3 are combined with the mRMR function
[
∑
]
max 1 I(fj, c) I(f f ) (j ) 1, 2, ..., n) fj ∈ Ω t m f ∈Ω j i i
s
(4)
For a feature set with N (N ) m + n) features, the feature evaluation will continue N rounds. After these evaluations, we will get a feature set S by the mRMR method ′} S ) {f1′, f2′, ..., fh′ , ..., fN
(5)
In this feature set S, each feature has an index h indicating the round number in which the feature is extracted. Better features will be extracted earlier with a smaller index h. 2.4. Nearest Neighbor Algorithm. In our study, the nearest neighbor algorithm (NNA) is used as a prediction model. The NNA makes its decision by calculating similarities between the test sample and all the training samples. In our study, the distance between vectors px and py is defined as follows38,39 6492
Journal of Proteome Research • Vol. 9, No. 12, 2010
D(px, py) ) 1 -
px · py ||px || · ||py ||
(6)
where px · py is the inner product of px and py and ||p|| represents the module of vector p. The smaller the D(px,py) is, the more similar px to py is. In the NNA, given a vector pt and training set P ) {p1, p2, ..., pn, ..., pN}, pt will be designated to the same class of its nearest neighbor pn in P, i.e., the vector having the smallest D(pn,pt) D(pn, pt) ) min{D(p1, pt), D(p2, pt), ..., D(pz, pt), ..., D(pN, pt)} (z * t)
(7)
2.5. Jackknife Cross-Validation Method. We used the jackknife cross-validation method40-42 (also called the leaveone-out cross-validation, LOOCV), which is objective and effective to evaluate the performance of a classifier. In the jackknife cross-validation method, every sample is tested by the predictor that is trained with all the other samples. To evaluate the performance of our sulfation site predictor, we calculated the accuracy rates for the positive, negative, and total samples separately as follows
{
correctly predicted positive samples positive samples correctly predicted negative samples accuracy φ negative data set ) negative samples
accuracy φ positive data set )
correctly predicted positive samples + correctly predicted negative samples overall accuracy ) positive samples + negative samples (8)
2.6. Incremental Feature Selection (IFS). Although mRMR could rank the features based on their importance, the number of features to be used to optimize the discrimination between sulfated and nonsulfated samples was not known. In this study, we used incremental feature selection (IFS)39,43 to determine the optimal number of features. An incremental feature selection is conducted for the ranked features. Features in the ranked feature set are added one by one from higher to lower rank. When one feature is added, a new feature set is obtained. Thus, we get N feature sets where N is the number of features, and the ith feature set is Si ) {f1, f2, ..., fi} (1 e i e N) On the basis of each of the N feature sets, an NNA predictor was constructed and tested with the jackknife cross-validation test. With N overall accurate prediction rates, positive accuracy rates, and negative accuracy rates calculated, we obtain an IFS table with one column being the index i and the other column being the overall accuracy rate. Soptimal is the optimal feature set that achieves the highest overall accuracy rate.
3. Results and Discussion 3.1. mRMR Result. Using the mRMR program, we obtained the ranked mRMR list of 229 features. Within the list, a smaller index of a feature indicates that it is deemed as a more important feature in discriminating the positive samples from the negative ones. The mRMR list was provided in Table S1 (Supporting Information) and was used in the IFS procedure for feature selection and analysis.
Tyrosine Sulfation with mRMR Feature Selection and Analysis
Figure 1. Distribution of prediction accuracy against feature number IFS prediction accuracy plotted against feature numbers based in Table S2 (Supporting Information). The maximum accuracy is 0.9001 when 145 features are included. These 145 features were considered as the optimal feature set of our classifier.
3.2. IFS Result. On the basis of the outputs of mRMR, we built 229 individual predictors for the 229 subfeature sets to predict tyrosine sulfation sites. As described in the Materials and Methods section, we tested the predictors with one feature, two features, three features, etc., and the IFS results can be found in Table S2 (Supporting Information). Figure 1 shows
research articles
the IFS curve plotted based in Table S2 (Supporting Information). The maximum accuracy is 0.9001 when 145 features are included. These 145 features were considered as the optimal feature set of our classifier. On the basis of these 145 features, the prediction accuracies of the positive samples and negative samples were 0.6667 and 0.9380, respectively. The 145 optimal features were given in Table S3 (Supporting Information). 3.3. Optimal Feature Set Analysis. As described in the Materials and Methods section, there were three kinds of features: PSSM conservation scores, amino acid factors, and disorder scores. The number distribution of each feature type in the optimized 145 features was investigated and shown in Figure 2A. Among the optimized 145 features, there were 34 features of amino acid factor, 6 features of disorder score, and 105 features of PSSM conservation score. This suggests that all three kinds of features contribute to the prediction of protein tyrosine sulfation sites and that conservation score may play an irreplaceable role for sulfation site prediction. Although there are only 9 disorder scores in the initial 229 feature set, 6 disorder scores were selected in the optimal feature set. This indicates the important role of disorder status in tyrosine sulfation determination. The site-specific distribution of the optimal feature set, shown in Figure 2B, demonstrates that sites 1 and 2 influence mostly the prediction of tyrosine sulfation sites. Sites at the center (sites 6 and 7) and site 9 have a relatively small effect on tyrosine sulfation, and sites 3, 4, 5, and 8 have the smallest effect on tyrosine sulfation. The sitespecific distribution of the optimal feature set is quite interesting, revealing that the residues at the two distal sides and the relative center are more important for tyrosine sulfation prediction than the remaining residues. 3.3.1. PSSM Conservation Feature Analysis. Since there were 105 PSSM conservation score features which account for
Figure 2. Feature and site-specific distribution of the optimal feature set. (A) Feature distribution of the optimal feature set. Among the optimized 145 features, there were 105 features of PSSM conservation score, 34 features of amino acid factor, and 6 features of disorder score. (B) Site-specific distribution of the optimal feature set. The site-specific distribution of the optimal feature set demonstrates that sites 1 and 2 influence mostly the prediction of tyrosine sulfation. Sites at the center (sites 6 and 7) and site 9 have a relatively small effect on tyrosine sulfation, and sites 3, 4, 5, and 8 have the smallest effect on tyrosine sulfation. Journal of Proteome Research • Vol. 9, No. 12, 2010 6493
research articles
Niu et al.
Figure 3. Feature and site-specific distribution of the PSSM features in the optimal feature set. (A) Feature distribution of the PSSM features in the optimal feature set. We investigated the number of each kind of amino acid of the PSSM features and found that the conservation against mutations to the 20 amino acids influences differently the tyrosine sulfation. Mutations to amino acids C, A, W, K, and M have a larger influence on sulfation than the mutations to other amino acids. (B) Site-specific distribution of the PSSM features in the optimal feature set. The conservation status of site 1, site 2, site 5, and site 6 was very important for the sulfation site prediction.
the greatest proportion of the optimized 145 features, we investigated the number of each kind of amino acid of the PSSM features (Figure 3A) and found that the conservation against mutations to the 20 amino acids influences differently the sulfation. Mutations to amino acids C, A, W, K, and M have a larger influence on sulfation than the mutations to other amino acids. We also investigated the number of PSSM features at each site (Figure 3B). The conservation status of “AA1”, “AA2”, “AA5”, and “AA6” sites was very important for the sulfation site prediction, shown in Figure 3B. Particularly, the amino acid at site 4 had been shown to be imperfectly conserved and in most cases is a D residue.18 The first feature in the mRMR feature list is the PSSM feature at site 4 against transition to amino acid D, indicating that it is the most important feature for the prediction of tyrosine sulfation sites which is consistent with previous studies. In addition, the features within the top 10 features in the optimal feature list contain four other PSSM conservation features: the conservation status against residue R at site 9 (index 3, “pssm9.1”), the conservation status against residue T at site 1 (index 8, “pssm1.16”), the conservation status against residue Y at site 7 (index 9, “pssm7.18”), and the conservation status against residue I at site 1 (index 10, “pssm1.9”). 3.3.2. Amino Acid Factor Analysis. The number of each type of amino acid factor features (Figure 4A) and the number of amino acid factor features at each site (Figure 4B) were analyzed. It was found that secondary structure, molecular volume, codon diversity, and polarity were almost equally important features to the sulfation site prediction. The electrostatic charge amino acid factor feature has a small influence on sulfation site prediction. In Figure 4B, residues at site 2, 4, and 9 have the most important effect on sulfation site prediction, and residues at sites 1, 6, 7, and 8 have a smaller effect on sulfation site prediction. Residues at site 3 have the least 6494
Journal of Proteome Research • Vol. 9, No. 12, 2010
effect on sulfation site prediction. The site-specific distribution of the amino acid factor features is consistent with the results of previous work showing that the neighboring residues contribute moderately to sulfation with some sites with relatively more influence on sulfation site determination such as site 4.20 A previous study demonstrated that the charge of the residue at site 4 is critical for tyrosine sulfation.20 The electrostatic feature of this site has an index of 5 in our mRMR feature list indicating it is one of the most important features for the tyrosine sulfation site prediction. The residue at site 2 can influence the sulfation degree of the tyrosine site.20 The index of the polarity feature of the amino acid at this site is 2 in the mRMR feature list. This indicates that the influence of the residue at this site on tyrosine sulfation degree may be mediated by its polarity status. The existence of amino acid polarity, secondary structure, molecular volume, and electrostatic charge features in the optimal feature set had all been supported by the effect of these physicochemical properties on the tyrosine sulfation process demonstrated by refs 19, 20, and 44. 3.3.3. Disorder Score Analysis. An NMR study of hirudin found that the peptide chain at the tyrosine sulfation site is too flexible and disordered to determine a structure in that region.45 Rosenquist et al. also demonstrated that small amino acids near the tyrosine sulfation sites were nonuniformly distributed and should make the peptide chain following the tyrosine become very flexible, which suggests that TPST may require a substrate that can make a sharp turn when binding to the enzyme.19 The effects of coil structures and turninducing residues on the tyrosine sulfation site determination had also been demonstrated by various studies.18,44,46 Within the optimal feature set, six disorder scores were selected: the disorder scores at site 1, site 2, site 4, site 5, site 6, and site 7. The selection of 6 out of the 9 total disorder scores
research articles
Tyrosine Sulfation with mRMR Feature Selection and Analysis
Figure 4. Feature and site-specific distribution of the amino acid factor features in the optimal feature set. (A) Feature-specific distribution of the amino acid factor features in the optimal feature set. It was found that the secondary structure, molecular volume, codon diversity, and polarity were almost equally important features to the sulfation site prediction. The electrostatic charge amino acid factor feature has a small influence on sulfation site prediction. (B) Site-specific distribution of the amino acid factor features in the optimal feature set. Residues at sites 2, 4, and 9 have the most important effect on sulfation site prediction, and residues at sites 1, 6, 7, and 8 have a smaller effect on sulfation site prediction. Residues at site 3 have the least effect on sulfation site prediction. Table 1. Prediction Performance Comparison
our method, training data set
our method, independent test data set
Sulfinator, independent test data set
SulfoSite, independent test data set
SVM-based method, training data set
SVM-based method, independent test data set
sample group
number of samples
number of correctly predicted samples
prediction accuracy
positive negative total positive negative total positive negative total positive negative total positive negative total positive negative total
102 629 731 27 69 96 27 69 96 27 69 96 102 629 731 27 69 96
68 590 658 22 69 91 21 66 87 20 67 87 67 592 659 20 68 88
0.6667 0.9380 0.9001 0.8148 1.0000 0.9479 0.7778 0.9565 0.9063 0.7407 0.9710 0.9063 0.6569 0.9412 0.9015 0.7407 0.9855 0.9167
indicates that the disorder status within the tyrosine region is quite important for the tyrosine sulfation process. From the site distribution of the six disorder scores, we can see that the disorder status of the Y site and three adjacent sites may have a greater effect on the sulfation process, and this is consistent with the study carried out by Rosenquist et al. showing that the peptide chain immediately following the sulfated tyrosine should be very flexible to satisfy the requirement of a sharp turn when a substrate binds to enzymes.19 3.4. Comparisons with Existing Methods. We used an independent test data set containing 96 samples including 27 positive samples and 69 negative samples. We put this data set into both our methods and two previously developed
methods: Sulfinator and SulfoSite. We also put our training data set and independent test data set into a SVM-based method. The prediction accuracies for positive, negative, and total samples are shown in Table 1. As shown in Table 1, the overall prediction accuracy for an independent test data set of our method is 0.9479, which is better than Sulfinator (0.9063) and SulfoSite (0.9063). The overall prediction accuracy of the SVM method for the training data set (0.9015) is a little better than our method (0.9001), but for the independent test data set, our method (0.9479) is much better than the SVM-based method (0.9167). Overall, we can say that our method is a little better than the SVM-based method for tyrosine sulfation site prediction. Journal of Proteome Research • Vol. 9, No. 12, 2010 6495
research articles 3.5. Directions for Experimental Validation. The selected features at different sites may provide guidelines for researchers to find or validate new determinants of protein tyrosine sulfation. For example, among the top 10 features in the optimal feature set, two of them, the conservation status against residue D at site 4 (index 1, “pssm4.3”)18 and the electrostatic charge property of residues at site 4 (index 5, “aai4.4”),20 had been explicitly validated by researchers. The disorder status at site 2 (index 6, “disorder2”) is consistent with that the peptide chain at the tyrosine sulfation site is disordered.19,45 The polarity property of residue at site 2 (index 2, “aai2”) suggests that the previously observed influence of site 2 on the sulfation degree20 may be mediated by its polarity status. The remaining six featuressthe conservation status against residue R at site 9 (index 3, “pssm9.1”), the codon diversity property of residues at site 8 (index 4, “aai8.3”), the molecular volume of residues at site 6 (index 7, “aai6.2”), the conservation status against residue T at site 1 (index 8, “pssm1.16”), the conservation status against residue Y at site 7 (index 9, “pssm7.18”), and the conservation status against residue I at site 1 (index 10, “pssm1.9”)sare yet to be validated by experiments.
4. Conclusion In this study, we developed a method for the prediction of protein tyrosine sulfation sites. Our approach considered not only information of sequence conservation but also an individual amino acid’s physicochemical features and residue disorder status within the tyrosine regions. Our method achieved an overall accuracy of 90.01%. On the basis of the feature selection algorithm, a compact set of features were selected, which are deemed as the features that contribute significantly to the prediction of protein tyrosine sulfation. The selected features may provide important clues of the sulfation mechanism and guide the related experimental validations.
Acknowledgment. This work was supported by grants from National High-Tech R&D Program (863) (2006AA02Z334, 2007DFA31040), National Basic Research Program of China (2006CB910700), and Key Research Program (CAS) (KSCX2-YW-R-112).
Supporting Information Available: Training and independent test data sets used in this study; mRMR list; IFS result; and the optimal feature set. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) Moore, K. L. The biology and enzymology of protein tyrosine O-sulfation. J. Biol. Chem. 2003, 278 (27), 24243–6. (2) Huttner, W. B. Sulphation of tyrosine residues-a widespread modification of proteins. Nature 1982, 299 (5880), 273–6. (3) Kehoe, J. W.; Bertozzi, C. R. Tyrosine sulfation: a modulator of extracellular protein-protein interactions. Chem. Biol. 2000, 7 (3), R57–61. (4) Moore, K. L. Protein tyrosine sulfation: a critical posttranslation modification in plants and animals. Proc. Natl. Acad. Sci. U.S.A. 2009, 106 (35), 14741–2. (5) Varin, L.; Marsolais, F.; Richard, M.; Rouleau, M. Sulfation and sulfotransferases 6: Biochemistry and molecular biology of plant sulfotransferases. Fed. Proc. 1997, 11 (7), 517–25. (6) Ouyang, Y.; Lane, W. S.; Moore, K. L. Tyrosylprotein sulfotransferase: purification and molecular cloning of an enzyme that catalyzes tyrosine O-sulfation, a common posttranslational modification of eukaryotic proteins. Proc. Natl. Acad. Sci. U.S.A. 1998, 95 (6), 2896–901. (7) Beisswanger, R.; Corbeil, D.; Vannier, C.; Thiele, C.; Dohrmann, U.; Kellner, R.; Ashman, K.; Niehrs, C.; Huttner, W. B. Existence of
6496
Journal of Proteome Research • Vol. 9, No. 12, 2010
Niu et al.
(8)
(9) (10)
(11) (12) (13)
(14)
(15) (16)
(17) (18)
(19) (20) (21) (22) (23) (24)
(25) (26)
(27)
(28) (29) (30) (31)
distinct tyrosylprotein sulfotransferase genes: molecular characterization of tyrosylprotein sulfotransferase-2. Proc. Natl. Acad. Sci. U.S.A. 1998, 95 (19), 11134–9. Ouyang, Y. B.; Moore, K. L. Molecular cloning and expression of human and mouse tyrosylprotein sulfotransferase-2 and a tyrosylprotein sulfotransferase homologue in Caenorhabditis elegans. J. Biol. Chem. 1998, 273 (38), 24770–4. Yu, Y.; Hoffhines, A. J.; Moore, K. L.; Leary, J. A. Determination of the sites of tyrosine O-sulfation in peptides and proteins. Nat. Methods 2007, 4 (7), 583–8. Zhang, Y.; Jiang, H.; Go, E. P.; Desaire, H. Distinguishing phosphorylation and sulfation in carbohydrates and glycoproteins using ion-pairing and mass spectrometry. J. Am. Soc. Mass Spectrom. 2006, 17 (9), 1282–8. Huttner, W. B. Protein Tyrosine Sulfation. Trends Biochem. Sci. 1987, 12 (9), 361–363. Koltsova, E.; Ley, K. Tyrosine sulfation of leukocyte adhesion molecules and chemokine receptors promotes atherosclerosis. Arterioscler., Thromb., Vasc. Biol. 2009, 29 (11), 1709–11. Liu, J.; Louie, S.; Hsu, W.; Yu, K. M.; Nicholas, H. B.; Rosenquist, G. L. Tyrosine sulfation is prevalent in human chemokine receptors important in lung disease. Am. J. Respir. Cell Mol. Biol. 2008, 38 (6), 738–743. Farzan, M.; Babcock, G. J.; Vasilieva, N.; Wright, P. L.; Kiprilov, E.; Mirzabekov, T.; Choe, H. The role of post-translational modifications of the CXCR4 amino terminus in stromal-derived factor 1 alpha association and HIV-1 entry. J. Biol. Chem. 2002, 277 (33), 29484–9. Onnerfjord, P.; Heathfield, T. F.; Heinegard, D. Identification of tyrosine sulfation in extracellular leucine-rich repeat proteins using mass spectrometry. J. Biol. Chem. 2004, 279 (1), 26–33. Salek, M.; Costagliola, S.; Lehmann, W. D. Protein tyrosine-Osulfation analysis by exhaustive product ion scanning with minimum collision offset in a NanoESI Q-TOF tandem mass spectrometer. Anal. Chem. 2004, 76 (17), 5136–42. Huttner, W. B. Determination and occurrence of tyrosine O-sulfate in proteins. Methods Enzymol. 1984, 107, 200–23. Chang, W. C.; Lee, T. Y.; Shien, D. M.; Hsu, J. B.; Horng, J. T.; Hsu, P. C.; Wang, T. Y.; Huang, H. D.; Pan, R. L. Incorporating support vector machine for identifying protein tyrosine sulfation sites. J. Comput. Chem. 2009, 30 (15), 2526–37. Rosenquist, G. L.; Nicholas, H. B., Jr. Analysis of sequence requirements for protein tyrosine sulfation. Protein Sci. 1993, 2 (2), 215–22. Bundgaard, J. R.; Vuust, J.; Rehfeld, J. F. New consensus features for tyrosine O-sulfation determined by mutational analysis. J. Biol. Chem. 1997, 272 (35), 21700–5. Yu, K. M.; Liu, J.; Moy, R.; Lin, H. C.; Nicholas, H. B., Jr.; Rosenquist, G. L. Prediction of tyrosine sulfation in seven-transmembrane peptide receptors. Endocrine 2002, 19 (3), 333–8. Monigatti, F.; Gasteiger, E.; Bairoch, A.; Jung, E. The Sulfinator: predicting tyrosine sulfation sites in protein sequences. Bioinformatics 2002, 18 (5), 769–70. Monigatti, F.; Hekking, B.; Steen, H. Protein sulfation analysis--A primer. Biochim. Biophys. Acta 2006, 1764 (12), 1904–13. Li, H.; Xing, X.; Ding, G.; Li, Q.; Wang, C.; Xie, L.; Zeng, R.; Li, Y. SysPTM: a systematic resource for proteomic research on posttranslational modifications. Mol. Cell. Proteomics 2009, 8 (8), 1839– 49. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res. 2010, 38, (Database issue), D142-8. Jain, E.; Bairoch, A.; Duvaud, S.; Phan, I.; Redaschi, N.; Suzek, B. E.; Martin, M. J.; McGarvey, P.; Gasteiger, E. Infrastructure for the life sciences: design and implementation of the UniProt website. BMC Bioinf. 2009, 10, 136. Altschul, S. F.; Madden, T. L.; Schaffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D. J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17), 3389–402. Atchley, W. R.; Zhao, J.; Fernandes, A. D.; Druke, T. Solving the protein sequence metric problem. Proc. Natl. Acad. Sci. U.S.A. 2005, 102 (18), 6395–400. Kawashima, S.; Kanehisa, M. AAindex: amino acid index database. Nucleic Acids Res. 2000, 28 (1), 374. Wright, P. E.; Dyson, H. J. Intrinsically unstructured proteins: reassessing the protein structure-function paradigm. J. Mol. Biol. 1999, 293 (2), 321–31. Dunker, A. K.; Brown, C. J.; Lawson, J. D.; Iakoucheva, L. M.; Obradovic, Z. Intrinsic disorder and protein function. Biochemistry 2002, 41 (21), 6573–82.
research articles
Tyrosine Sulfation with mRMR Feature Selection and Analysis (32) Liu, J.; Tan, H.; Rost, B. Loopy proteins appear conserved in evolution. J. Mol. Biol. 2002, 322 (1), 53–64. (33) Tompa, P. Intrinsically unstructured proteins. Trends Biochem. Sci. 2002, 27 (10), 527–33. (34) Peng, K.; Radivojac, P.; Vucetic, S.; Dunker, A. K.; Obradovic, Z. Length-dependent prediction of protein intrinsic disorder. BMC Bioinf. 2006, 7, 208. (35) Bordoli, L.; Kiefer, F.; Schwede, T. Assessment of disorder predictions in CASP7. Proteins 2007, 69 (Suppl 8), 129–36. (36) He, B.; Wang, K.; Liu, Y.; Xue, B.; Uversky, V. N.; Dunker, A. K. Predicting intrinsic disorder in proteins: an overview. Cell Res. 2009, 19 (8), 929–49. (37) Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and minredundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27 (8), 1226–38. (38) Qian, Z.; Cai, Y. D.; Li, Y. A novel computational method to predict transcription factor DNA binding preference. Biochem. Biophys. Res. Commun. 2006, 348 (3), 1034–7. (39) Huang, T.; Cui, W.; Hu, L.; Feng, K.; Li, Y. X.; Cai, Y. D. Prediction of pharmacological and xenobiotic responses to drugs based on time course gene expression profiles. PLoS One 2009, 4 (12), e8126. (40) Liu, M. C.; Yasuda, S.; Idell, S. Sulfation of nitrotyrosine: biochemistry and functional implications. IUBMB Life 2007, 59 (10), 622– 7.
(41) Cai, Y.; He, J.; Li, X.; Lu, L.; Yang, X.; Feng, K.; Lu, W.; Kong, X. A novel computational approach to predict transcription factor DNA binding preference. J. Proteome Res. 2009, 8 (2), 999–1003. (42) Huang, T.; Tu, K.; Shyr, Y.; Wei, C. C.; Xie, L.; Li, Y. X. The prediction of interferon treatment effects based on time series microarray gene expression profiles. J. Transl. Med. 2008, 6, 44. (43) Huang, T.; Shi, X. H.; Wang, P.; He, Z.; Feng, K. Y.; Hu, L.; Kong, X.; Li, Y. X.; Cai, Y. D.; Chou, K. C. Analysis and prediction of the metabolic stability of proteins based on their sequential features, subcellular locations and interaction networks. PLoS One 2010, 5 (6), e10972. (44) Hortin, G.; Folz, R.; Gordon, J. I.; Strauss, A. W. Characterization of Sites of Tyrosine Sulfation in Proteins and Criteria for Predicting Their Occurrence. Biochem. Biophys. Res. Commun. 1986, 141 (1), 326–333. (45) Folkers, P. J.; Clore, G. M.; Driscoll, P. C.; Dodt, J.; Kohler, S.; Gronenborn, A. M. Solution structure of recombinant hirudin and the Lys-47----Glu mutant: a nuclear magnetic resonance and hybrid distance geometry-dynamical simulated annealing study. Biochemistry 1989, 28 (6), 2601–17. (46) Niehrs, C.; Beisswanger, R.; Huttner, W. B. Protein tyrosine sulfation, 1993--an update. Chem. Biol. Interact. 1994, 92 (1-3), 257–71.
PR1007152
Journal of Proteome Research • Vol. 9, No. 12, 2010 6497